codetutor/backend/data/questions/contains-duplicate.yaml

title: Contains Duplicate
slug: contains-duplicate
difficulty: easy
leetcode_id: 217
leetcode_url: https://leetcode.com/problems/contains-duplicate/
categories:
  - arrays
  - hash-tables
patterns:
  - slug: hashing
    is_optimal: true

function_signature: "def contains_duplicate(nums: list[int]) -> bool:"

test_cases:
  visible:
    - input: { nums: [1, 2, 3, 1] }
      expected: true
    - input: { nums: [1, 2, 3, 4] }
      expected: false
    - input: { nums: [1, 1, 1, 3, 3, 4, 3, 2, 4, 2] }
      expected: true
  hidden:
    - input: { nums: [1] }
      expected: false
    - input: { nums: [1, 1] }
      expected: true
    - input: { nums: [-1, -1, -2, -3] }
      expected: true
    - input: { nums: [0, 0] }
      expected: true
    - input: { nums: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] }
      expected: false
    - input: { nums: [1, 2, 3, 4, 5, 1] }
      expected: true
    - input: { nums: [1000000000, -1000000000] }
      expected: false

description: |
  Given an integer array `nums`, return `true` if any value appears **at least twice** in the array, and return `false` if every element is distinct.

constraints: |
  - `1 <= nums.length <= 10^5`
  - `-10^9 <= nums[i] <= 10^9`

examples:
  - input: "nums = [1,2,3,1]"
    output: "true"
    explanation: "The element 1 occurs at the indices 0 and 3."
  - input: "nums = [1,2,3,4]"
    output: "false"
    explanation: "All elements are distinct."
  - input: "nums = [1,1,1,3,3,4,3,2,4,2]"
    output: "true"
    explanation: "Multiple elements appear more than once."

explanation:
  intuition: |
    Imagine you're checking coats at a party and need to ensure no two guests have the same ticket number. As each guest arrives, you could compare their ticket to every previous ticket — but that gets tedious as the party grows. Instead, what if you kept a quick-reference list of all ticket numbers you've seen?

    This is the core insight: **use a data structure that allows instant lookups** to check if you've seen a number before. A *hash set* provides exactly this capability — adding an element and checking membership both take O(1) average time.

    Think of it like this: as you iterate through the array, you maintain a "memory" of all numbers encountered so far. For each new number, you ask: "Have I seen this before?" If yes, you've found a duplicate. If no, add it to your memory and continue.

    The key constraint guiding our solution is the array size (up to 10^5 elements). This rules out O(n^2) approaches and points us toward O(n) or O(n log n) solutions.

  approach: |
    We solve this using a **Hash Set Approach**:

    **Step 1: Create an empty set**

    - `seen`: An empty set to store numbers we've encountered
    - Sets provide O(1) average time for both insertion and membership testing

    &nbsp;

    **Step 2: Iterate through the array**

    - For each number in `nums`, check if it already exists in `seen`
    - If the number is in `seen`, we've found a duplicate — return `True` immediately
    - If the number is not in `seen`, add it to the set and continue

    &nbsp;

    **Step 3: Return the result**

    - If we complete the loop without finding any duplicates, return `False`
    - This means all elements were distinct

    &nbsp;

    This approach works because hash sets give us constant-time lookups. We trade space (storing up to n elements) for time (avoiding nested comparisons).

  common_pitfalls:
    - title: The Brute Force Trap
      description: |
        A natural first instinct is to compare every pair of elements:
        - Outer loop `i` from `0` to `n-1`
        - Inner loop `j` from `i+1` to `n-1`
        - Check if `nums[i] == nums[j]`

        This results in **O(n^2) time complexity**. With `nums.length <= 10^5`, this means up to 5 billion comparisons — guaranteed **Time Limit Exceeded (TLE)**.

        The hash set approach reduces this to O(n) by eliminating the inner loop entirely.
      wrong_approach: "Nested loops comparing all pairs"
      correct_approach: "Hash set for O(1) membership testing"

    - title: Sorting Without Understanding the Trade-off
      description: |
        Sorting the array first (O(n log n)) then checking adjacent elements works, but it has two downsides:
        - Slower than the hash set approach for this specific problem
        - Modifies the original array (or requires O(n) extra space for a copy)

        However, sorting can be preferable when memory is extremely constrained, as it uses O(1) extra space if done in-place.
      wrong_approach: "Always defaulting to sorting"
      correct_approach: "Choose hash set for O(n) time when space permits"

    - title: Using a List Instead of a Set
      description: |
        In Python, checking `if x in list` is O(n), not O(1). Using a list instead of a set turns your "optimised" solution back into O(n^2).

        ```python
        # Wrong - O(n^2) total
        seen = []
        for num in nums:
            if num in seen:  # O(n) lookup!
                return True
            seen.append(num)
        ```

        Always use a set (or dict) for membership testing.

  key_takeaways:
    - "**Hash sets for membership testing**: When you need to check 'have I seen this before?', a set gives O(1) lookups"
    - "**Space-time trade-off**: Using O(n) extra space gives us O(n) time instead of O(n^2)"
    - "**Early exit optimisation**: Return immediately when a duplicate is found — no need to check the rest"
    - "**Foundation for harder problems**: This pattern appears in problems like Two Sum, finding pairs, and detecting cycles"

  time_complexity: "O(n). We traverse the array once, with O(1) set operations at each step."
  space_complexity: "O(n). In the worst case (all unique elements), we store all n elements in the set."

  pattern_comparison: |
    **Hash Set vs Sorting: The Classic Space-Time Trade-off**

    Two fundamentally different approaches, each optimal in different contexts:

    | Approach | Time | Space | Modifies Input? | Early Exit? |
    |----------|------|-------|-----------------|-------------|
    | **Hash Set** | O(n) | O(n) | No | Yes |
    | **Sorting** | O(n log n) | O(1)* | Yes | Yes |

    *In-place sorting like quicksort uses O(1) extra space (ignoring recursion stack).

    **When Hash Set is better:**
    - Memory is plentiful (most modern systems)
    - Input must not be modified
    - You need the fastest possible runtime
    - Data is already streaming in one element at a time

    **When Sorting is better:**
    - Memory is extremely constrained (embedded systems, very large arrays)
    - Modifying the input is acceptable
    - You also need the sorted array for subsequent operations
    - The data is nearly sorted (adaptive sorts like Timsort are very fast)

    **The one-liner alternative:**
    ```python
    return len(nums) != len(set(nums))
    ```
    This is clean and Pythonic, but always processes all elements (no early exit). The iterative set approach can exit immediately upon finding a duplicate, which is faster when duplicates appear early.

    **Foundation pattern**: This simple problem teaches the fundamental "have I seen this before?" pattern that appears in Two Sum, cycle detection, and many other problems.

solutions:
  - approach_name: Hash Set
    is_optimal: true
    code: |
      def contains_duplicate(nums: list[int]) -> bool:
          # Set to track numbers we've seen
          seen = set()

          for num in nums:
              # Already seen this number? Duplicate found!
              if num in seen:
                  return True
              # First time seeing this number, remember it
              seen.add(num)

          # No duplicates found after checking all elements
          return False
    explanation: |
      **Time Complexity:** O(n) — Single pass through the array with O(1) set operations.

      **Space Complexity:** O(n) — Set stores up to n elements in the worst case.

      We iterate once, checking each number against our set of seen values. The moment we find a number already in the set, we return `True`. If we finish without finding duplicates, we return `False`.

  - approach_name: One-liner with Set Length
    is_optimal: true
    code: |
      def contains_duplicate(nums: list[int]) -> bool:
          # If set has fewer elements than list, duplicates exist
          return len(nums) != len(set(nums))
    explanation: |
      **Time Complexity:** O(n) — Building a set from the list is O(n).

      **Space Complexity:** O(n) — The set stores up to n elements.

      This elegant one-liner exploits the fact that sets automatically remove duplicates. If the set has fewer elements than the original list, at least one duplicate existed. Note: this always processes all elements, unlike the early-exit version above.

  - approach_name: Sorting
    is_optimal: false
    code: |
      def contains_duplicate(nums: list[int]) -> bool:
          # Sort the array so duplicates become adjacent
          nums.sort()

          # Check adjacent pairs for duplicates
          for i in range(1, len(nums)):
              if nums[i] == nums[i - 1]:
                  return True

          return False
    explanation: |
      **Time Complexity:** O(n log n) — Dominated by the sorting step.

      **Space Complexity:** O(1) — In-place sorting uses constant extra space (ignoring the recursion stack).

      After sorting, any duplicates will be adjacent. We scan through checking consecutive pairs. This approach is useful when memory is extremely limited, but it modifies the original array.

  - approach_name: Brute Force
    is_optimal: false
    code: |
      def contains_duplicate(nums: list[int]) -> bool:
          n = len(nums)

          # Compare every pair of elements
          for i in range(n):
              for j in range(i + 1, n):
                  if nums[i] == nums[j]:
                      return True

          return False
    explanation: |
      **Time Complexity:** O(n^2) — Nested loops comparing all pairs.

      **Space Complexity:** O(1) — No extra data structures used.

      This straightforward approach checks every possible pair. While correct, it's far too slow for large inputs (TLE on LeetCode). Included to illustrate why hash-based approaches are essential.