Files
codetutor/backend/data/questions/find-the-duplicate-number.yaml

253 lines
10 KiB
YAML

title: Find the Duplicate Number
slug: find-the-duplicate-number
difficulty: medium
leetcode_id: 287
leetcode_url: https://leetcode.com/problems/find-the-duplicate-number/
categories:
- arrays
- two-pointers
patterns:
- slug: fast-slow-pointers
is_optimal: false
- slug: binary-search
is_optimal: true
function_signature: "def find_duplicate(nums: list[int]) -> int:"
test_cases:
visible:
- input: { nums: [1, 3, 4, 2, 2] }
expected: 2
- input: { nums: [3, 1, 3, 4, 2] }
expected: 3
- input: { nums: [3, 3, 3, 3, 3] }
expected: 3
hidden:
- input: { nums: [1, 1] }
expected: 1
- input: { nums: [2, 2, 2, 2, 2] }
expected: 2
- input: { nums: [1, 4, 4, 2, 4] }
expected: 4
- input: { nums: [1, 2, 3, 4, 5, 6, 7, 8, 9, 5] }
expected: 5
- input: { nums: [2, 5, 9, 6, 9, 3, 8, 9, 7, 1] }
expected: 9
- input: { nums: [1, 1, 2] }
expected: 1
description: |
Given an array of integers `nums` containing `n + 1` integers where each integer is in the range `[1, n]` inclusive.
There is only **one repeated number** in `nums`, return *this repeated number*.
You must solve the problem **without** modifying the array `nums` and using only constant extra space.
constraints: |
- `1 <= n <= 10^5`
- `nums.length == n + 1`
- `1 <= nums[i] <= n`
- All the integers in `nums` appear only **once** except for **precisely one integer** which appears **two or more** times
examples:
- input: "nums = [1,3,4,2,2]"
output: "2"
explanation: "The number 2 appears twice in the array."
- input: "nums = [3,1,3,4,2]"
output: "3"
explanation: "The number 3 appears twice in the array."
- input: "nums = [3,3,3,3,3]"
output: "3"
explanation: "The number 3 appears five times in the array."
explanation:
intuition: |
This problem has a beautiful constraint: the array has `n + 1` elements but values are only in the range `[1, n]`. By the **Pigeonhole Principle**, at least one value must repeat.
The key insight is to view the array as a **linked list** where each value points to the next index. Since values are in `[1, n]` and we have indices `[0, n]`, treating `nums[i]` as "next pointer" creates a valid linked structure.
Think of it like this: if we start at index `0` and repeatedly jump to `nums[current_index]`, we create a sequence. Because one number repeats, two different indices point to the same location — this creates a **cycle**! The duplicate number is the entry point of this cycle.
For example, with `nums = [1,3,4,2,2]`:
- Index 0 → value 1 → jump to index 1
- Index 1 → value 3 → jump to index 3
- Index 3 → value 2 → jump to index 2
- Index 2 → value 4 → jump to index 4
- Index 4 → value 2 → jump to index 2 (cycle!)
The cycle exists because both index 3 and index 4 have value `2`. Floyd's Tortoise and Hare algorithm finds exactly where this cycle begins.
approach: |
We solve this using **Floyd's Cycle Detection** (Tortoise and Hare):
**Step 1: Detect the cycle**
- `slow`: Moves one step at a time (`slow = nums[slow]`)
- `fast`: Moves two steps at a time (`fast = nums[nums[fast]]`)
- Both start at index `0`
- Keep moving until they meet — this proves a cycle exists
&nbsp;
**Step 2: Find the cycle entrance**
- Reset `slow` to index `0`, keep `fast` at the meeting point
- Move both pointers one step at a time
- The point where they meet again is the duplicate number
&nbsp;
**Why does this work?**
Let's say the distance from start to cycle entrance is `F`, and the cycle length is `C`. When slow and fast first meet:
- Slow has traveled `F + a` steps (where `a` is distance into the cycle)
- Fast has traveled `2(F + a)` steps
- Since fast is in the cycle: `2(F + a) - (F + a) = C`, so `F + a = C`
This means `F = C - a`. When we reset slow to start and both move at the same speed, slow travels `F` steps to reach the entrance, while fast travels `F = C - a` steps from its position `a` into the cycle — also reaching the entrance!
&nbsp;
**Step 3: Return the result**
- The meeting point in phase 2 is the duplicate value
common_pitfalls:
- title: Using Extra Space
description: |
A common first instinct is to use a hash set to track seen numbers:
```python
seen = set()
for num in nums:
if num in seen:
return num
seen.add(num)
```
While this works and runs in O(n) time, it uses O(n) space. The problem explicitly requires **O(1) space**, so this approach violates the constraints.
wrong_approach: "Hash set to track seen numbers"
correct_approach: "Floyd's cycle detection using the array itself"
- title: Modifying the Array
description: |
Another tempting approach is to mark visited indices by negating values:
```python
for num in nums:
idx = abs(num)
if nums[idx] < 0:
return idx
nums[idx] = -nums[idx]
```
This is O(n) time and O(1) space, but it **modifies the input array**, which the problem forbids. The cycle detection approach leaves the array untouched.
wrong_approach: "Negating values to mark as visited"
correct_approach: "Read-only traversal with two pointers"
- title: Sorting the Array
description: |
Sorting and finding adjacent duplicates is intuitive but has two problems:
- It modifies the array (or requires O(n) space for a copy)
- It's O(n log n) time, not optimal
The cycle detection method achieves O(n) time with O(1) space without modification.
wrong_approach: "Sort and find adjacent duplicates"
correct_approach: "Floyd's algorithm for O(n) time, O(1) space"
- title: Confusing Index with Value
description: |
In Floyd's algorithm, we treat values as pointers to indices. A common mistake is confusing when to use the value versus the index.
Remember: `slow = nums[slow]` means "jump to the index that equals the current value." The duplicate is a **value**, not an index — it's what gets returned after phase 2.
key_takeaways:
- "**Cycle detection pattern**: When array values can be treated as pointers (value in valid index range), consider Floyd's algorithm"
- "**Pigeonhole Principle**: With `n + 1` items in `n` slots, at least one slot must have multiple items — guaranteeing a duplicate exists"
- "**Creative problem reframing**: Transforming an array duplicate problem into a linked list cycle problem unlocks an elegant O(1) space solution"
- "**Two-phase approach**: First detect *that* a cycle exists (fast catches slow), then find *where* it starts (both at same speed)"
time_complexity: "O(n). Each pointer traverses at most O(n) steps in both phases."
space_complexity: "O(1). Only two pointer variables are used, regardless of input size."
solutions:
- approach_name: Floyd's Cycle Detection
is_optimal: true
code: |
def find_duplicate(nums: list[int]) -> int:
# Phase 1: Find the intersection point in the cycle
slow = nums[0]
fast = nums[0]
# Move slow by 1, fast by 2 until they meet
while True:
slow = nums[slow] # One step
fast = nums[nums[fast]] # Two steps
if slow == fast:
break
# Phase 2: Find the entrance to the cycle (the duplicate)
slow = nums[0] # Reset slow to start
# Move both at same speed until they meet at cycle entrance
while slow != fast:
slow = nums[slow]
fast = nums[fast]
# The meeting point is the duplicate number
return slow
explanation: |
**Time Complexity:** O(n) — Each pointer visits at most n nodes in each phase.
**Space Complexity:** O(1) — Only two pointer variables used.
By treating array values as "next pointers," we transform this into a cycle detection problem. The duplicate causes a cycle because two indices point to the same value. Floyd's algorithm finds the cycle entrance in linear time with constant space.
- approach_name: Binary Search on Value Range
is_optimal: false
code: |
def find_duplicate(nums: list[int]) -> int:
# Search the value range [1, n], not the array indices
low, high = 1, len(nums) - 1
while low < high:
mid = (low + high) // 2
# Count numbers <= mid
count = sum(1 for num in nums if num <= mid)
# If count > mid, duplicate is in [low, mid]
# Otherwise, duplicate is in [mid+1, high]
if count > mid:
high = mid
else:
low = mid + 1
return low
explanation: |
**Time Complexity:** O(n log n) — Binary search over n values, each iteration scans n elements.
**Space Complexity:** O(1) — Only a few variables used.
This approach binary searches the *value* range, not the array. If there are more than `mid` numbers in `[1, mid]`, the duplicate must be in that range (Pigeonhole Principle). While not optimal, this demonstrates binary search on answer space rather than on array indices.
- approach_name: Hash Set
is_optimal: false
code: |
def find_duplicate(nums: list[int]) -> int:
seen = set()
for num in nums:
# If we've seen this number before, it's the duplicate
if num in seen:
return num
seen.add(num)
return -1 # Should never reach here given constraints
explanation: |
**Time Complexity:** O(n) — Single pass through the array.
**Space Complexity:** O(n) — Hash set stores up to n elements.
The most intuitive approach: track seen numbers and return when we find a repeat. While this violates the O(1) space constraint, it's included to show the trade-off between space and algorithmic complexity. Understanding why this isn't acceptable motivates learning Floyd's algorithm.