codetutor/backend/data/questions/find-the-duplicate-number.yaml

title: Find the Duplicate Number
slug: find-the-duplicate-number
difficulty: medium
leetcode_id: 287
leetcode_url: https://leetcode.com/problems/find-the-duplicate-number/
categories:
  - arrays
  - two-pointers
patterns:
  - slug: fast-slow-pointers
    is_optimal: false
  - slug: binary-search
    is_optimal: true

function_signature: "def find_duplicate(nums: list[int]) -> int:"

test_cases:
  visible:
    - input: { nums: [1, 3, 4, 2, 2] }
      expected: 2
    - input: { nums: [3, 1, 3, 4, 2] }
      expected: 3
    - input: { nums: [3, 3, 3, 3, 3] }
      expected: 3
  hidden:
    - input: { nums: [1, 1] }
      expected: 1
    - input: { nums: [2, 2, 2, 2, 2] }
      expected: 2
    - input: { nums: [1, 4, 4, 2, 4] }
      expected: 4
    - input: { nums: [1, 2, 3, 4, 5, 6, 7, 8, 9, 5] }
      expected: 5
    - input: { nums: [2, 5, 9, 6, 9, 3, 8, 9, 7, 1] }
      expected: 9
    - input: { nums: [1, 1, 2] }
      expected: 1

description: |
  Given an array of integers `nums` containing `n + 1` integers where each integer is in the range `[1, n]` inclusive.

  There is only **one repeated number** in `nums`, return *this repeated number*.

  You must solve the problem **without** modifying the array `nums` and using only constant extra space.

constraints: |
  - `1 <= n <= 10^5`
  - `nums.length == n + 1`
  - `1 <= nums[i] <= n`
  - All the integers in `nums` appear only **once** except for **precisely one integer** which appears **two or more** times

examples:
  - input: "nums = [1,3,4,2,2]"
    output: "2"
    explanation: "The number 2 appears twice in the array."
  - input: "nums = [3,1,3,4,2]"
    output: "3"
    explanation: "The number 3 appears twice in the array."
  - input: "nums = [3,3,3,3,3]"
    output: "3"
    explanation: "The number 3 appears five times in the array."

explanation:
  intuition: |
    This problem has a beautiful constraint: the array has `n + 1` elements but values are only in the range `[1, n]`. By the **Pigeonhole Principle**, at least one value must repeat.

    The key insight is to view the array as a **linked list** where each value points to the next index. Since values are in `[1, n]` and we have indices `[0, n]`, treating `nums[i]` as "next pointer" creates a valid linked structure.

    Think of it like this: if we start at index `0` and repeatedly jump to `nums[current_index]`, we create a sequence. Because one number repeats, two different indices point to the same location — this creates a **cycle**! The duplicate number is the entry point of this cycle.

    For example, with `nums = [1,3,4,2,2]`:
    - Index 0 → value 1 → jump to index 1
    - Index 1 → value 3 → jump to index 3
    - Index 3 → value 2 → jump to index 2
    - Index 2 → value 4 → jump to index 4
    - Index 4 → value 2 → jump to index 2 (cycle!)

    The cycle exists because both index 3 and index 4 have value `2`. Floyd's Tortoise and Hare algorithm finds exactly where this cycle begins.

  approach: |
    We solve this using **Floyd's Cycle Detection** (Tortoise and Hare):

    **Step 1: Detect the cycle**

    - `slow`: Moves one step at a time (`slow = nums[slow]`)
    - `fast`: Moves two steps at a time (`fast = nums[nums[fast]]`)
    - Both start at index `0`
    - Keep moving until they meet — this proves a cycle exists

    &nbsp;

    **Step 2: Find the cycle entrance**

    - Reset `slow` to index `0`, keep `fast` at the meeting point
    - Move both pointers one step at a time
    - The point where they meet again is the duplicate number

    &nbsp;

    **Why does this work?**

    Let's say the distance from start to cycle entrance is `F`, and the cycle length is `C`. When slow and fast first meet:
    - Slow has traveled `F + a` steps (where `a` is distance into the cycle)
    - Fast has traveled `2(F + a)` steps
    - Since fast is in the cycle: `2(F + a) - (F + a) = C`, so `F + a = C`

    This means `F = C - a`. When we reset slow to start and both move at the same speed, slow travels `F` steps to reach the entrance, while fast travels `F = C - a` steps from its position `a` into the cycle — also reaching the entrance!

    &nbsp;

    **Step 3: Return the result**

    - The meeting point in phase 2 is the duplicate value

  common_pitfalls:
    - title: Using Extra Space
      description: |
        A common first instinct is to use a hash set to track seen numbers:

        ```python
        seen = set()
        for num in nums:
            if num in seen:
                return num
            seen.add(num)
        ```

        While this works and runs in O(n) time, it uses O(n) space. The problem explicitly requires **O(1) space**, so this approach violates the constraints.
      wrong_approach: "Hash set to track seen numbers"
      correct_approach: "Floyd's cycle detection using the array itself"

    - title: Modifying the Array
      description: |
        Another tempting approach is to mark visited indices by negating values:

        ```python
        for num in nums:
            idx = abs(num)
            if nums[idx] < 0:
                return idx
            nums[idx] = -nums[idx]
        ```

        This is O(n) time and O(1) space, but it **modifies the input array**, which the problem forbids. The cycle detection approach leaves the array untouched.
      wrong_approach: "Negating values to mark as visited"
      correct_approach: "Read-only traversal with two pointers"

    - title: Sorting the Array
      description: |
        Sorting and finding adjacent duplicates is intuitive but has two problems:
        - It modifies the array (or requires O(n) space for a copy)
        - It's O(n log n) time, not optimal

        The cycle detection method achieves O(n) time with O(1) space without modification.
      wrong_approach: "Sort and find adjacent duplicates"
      correct_approach: "Floyd's algorithm for O(n) time, O(1) space"

    - title: Confusing Index with Value
      description: |
        In Floyd's algorithm, we treat values as pointers to indices. A common mistake is confusing when to use the value versus the index.

        Remember: `slow = nums[slow]` means "jump to the index that equals the current value." The duplicate is a **value**, not an index — it's what gets returned after phase 2.

  key_takeaways:
    - "**Cycle detection pattern**: When array values can be treated as pointers (value in valid index range), consider Floyd's algorithm"
    - "**Pigeonhole Principle**: With `n + 1` items in `n` slots, at least one slot must have multiple items — guaranteeing a duplicate exists"
    - "**Creative problem reframing**: Transforming an array duplicate problem into a linked list cycle problem unlocks an elegant O(1) space solution"
    - "**Two-phase approach**: First detect *that* a cycle exists (fast catches slow), then find *where* it starts (both at same speed)"

  time_complexity: "O(n). Each pointer traverses at most O(n) steps in both phases."
  space_complexity: "O(1). Only two pointer variables are used, regardless of input size."

solutions:
  - approach_name: Floyd's Cycle Detection
    is_optimal: true
    code: |
      def find_duplicate(nums: list[int]) -> int:
          # Phase 1: Find the intersection point in the cycle
          slow = nums[0]
          fast = nums[0]

          # Move slow by 1, fast by 2 until they meet
          while True:
              slow = nums[slow]           # One step
              fast = nums[nums[fast]]     # Two steps
              if slow == fast:
                  break

          # Phase 2: Find the entrance to the cycle (the duplicate)
          slow = nums[0]  # Reset slow to start

          # Move both at same speed until they meet at cycle entrance
          while slow != fast:
              slow = nums[slow]
              fast = nums[fast]

          # The meeting point is the duplicate number
          return slow
    explanation: |
      **Time Complexity:** O(n) — Each pointer visits at most n nodes in each phase.

      **Space Complexity:** O(1) — Only two pointer variables used.

      By treating array values as "next pointers," we transform this into a cycle detection problem. The duplicate causes a cycle because two indices point to the same value. Floyd's algorithm finds the cycle entrance in linear time with constant space.

  - approach_name: Binary Search on Value Range
    is_optimal: false
    code: |
      def find_duplicate(nums: list[int]) -> int:
          # Search the value range [1, n], not the array indices
          low, high = 1, len(nums) - 1

          while low < high:
              mid = (low + high) // 2

              # Count numbers <= mid
              count = sum(1 for num in nums if num <= mid)

              # If count > mid, duplicate is in [low, mid]
              # Otherwise, duplicate is in [mid+1, high]
              if count > mid:
                  high = mid
              else:
                  low = mid + 1

          return low
    explanation: |
      **Time Complexity:** O(n log n) — Binary search over n values, each iteration scans n elements.

      **Space Complexity:** O(1) — Only a few variables used.

      This approach binary searches the *value* range, not the array. If there are more than `mid` numbers in `[1, mid]`, the duplicate must be in that range (Pigeonhole Principle). While not optimal, this demonstrates binary search on answer space rather than on array indices.

  - approach_name: Hash Set
    is_optimal: false
    code: |
      def find_duplicate(nums: list[int]) -> int:
          seen = set()

          for num in nums:
              # If we've seen this number before, it's the duplicate
              if num in seen:
                  return num
              seen.add(num)

          return -1  # Should never reach here given constraints
    explanation: |
      **Time Complexity:** O(n) — Single pass through the array.

      **Space Complexity:** O(n) — Hash set stores up to n elements.

      The most intuitive approach: track seen numbers and return when we find a repeat. While this violates the O(1) space constraint, it's included to show the trade-off between space and algorithmic complexity. Understanding why this isn't acceptable motivates learning Floyd's algorithm.