questions F-L
This commit is contained in:
200
backend/data/questions/find-median-from-data-stream.yaml
Normal file
200
backend/data/questions/find-median-from-data-stream.yaml
Normal file
@@ -0,0 +1,200 @@
|
||||
title: Find Median from Data Stream
|
||||
slug: find-median-from-data-stream
|
||||
difficulty: hard
|
||||
leetcode_id: 295
|
||||
leetcode_url: https://leetcode.com/problems/find-median-from-data-stream/
|
||||
categories:
|
||||
- heap
|
||||
- sorting
|
||||
patterns:
|
||||
- heap
|
||||
|
||||
description: |
|
||||
The **median** is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, and the median is the mean of the two middle values.
|
||||
|
||||
- For example, for `arr = [2, 3, 4]`, the median is `3`.
|
||||
- For example, for `arr = [2, 3]`, the median is `(2 + 3) / 2 = 2.5`.
|
||||
|
||||
Implement the `MedianFinder` class:
|
||||
|
||||
- `MedianFinder()` initialises the `MedianFinder` object.
|
||||
- `void addNum(int num)` adds the integer `num` from the data stream to the data structure.
|
||||
- `double findMedian()` returns the median of all elements so far. Answers within `10^-5` of the actual answer will be accepted.
|
||||
|
||||
constraints: |
|
||||
- `-10^5 <= num <= 10^5`
|
||||
- There will be at least one element in the data structure before calling `findMedian`.
|
||||
- At most `5 * 10^4` calls will be made to `addNum` and `findMedian`.
|
||||
|
||||
examples:
|
||||
- input: |
|
||||
["MedianFinder", "addNum", "addNum", "findMedian", "addNum", "findMedian"]
|
||||
[[], [1], [2], [], [3], []]
|
||||
output: "[null, null, null, 1.5, null, 2.0]"
|
||||
explanation: |
|
||||
MedianFinder medianFinder = new MedianFinder();
|
||||
medianFinder.addNum(1); // arr = [1]
|
||||
medianFinder.addNum(2); // arr = [1, 2]
|
||||
medianFinder.findMedian(); // return 1.5 (i.e., (1 + 2) / 2)
|
||||
medianFinder.addNum(3); // arr = [1, 2, 3]
|
||||
medianFinder.findMedian(); // return 2.0
|
||||
|
||||
explanation:
|
||||
intuition: |
|
||||
Imagine you're watching numbers flow by on a conveyor belt, and at any moment someone might ask: "What's the median of all numbers you've seen so far?"
|
||||
|
||||
The naive approach would be to keep a sorted list and insert each new number in its correct position. But insertion into a sorted list takes O(n) time, which becomes too slow with many operations.
|
||||
|
||||
Here's the key insight: **you don't need the entire sorted list to find the median**. You only need quick access to the middle element(s). Think of splitting the numbers into two halves:
|
||||
|
||||
- The **smaller half** — all numbers less than or equal to the median
|
||||
- The **larger half** — all numbers greater than or equal to the median
|
||||
|
||||
If you had instant access to the **maximum of the smaller half** and the **minimum of the larger half**, you could compute the median immediately. This is exactly what two heaps provide:
|
||||
|
||||
- A **max-heap** for the smaller half (gives you the largest of the small numbers)
|
||||
- A **min-heap** for the larger half (gives you the smallest of the large numbers)
|
||||
|
||||
By keeping these heaps balanced (differing in size by at most 1), the median is always at the top of one or both heaps.
|
||||
|
||||
approach: |
|
||||
We solve this using the **Two Heaps** pattern:
|
||||
|
||||
**Step 1: Initialise two heaps**
|
||||
|
||||
- `max_heap`: A max-heap to store the smaller half of numbers (in Python, we negate values since `heapq` is a min-heap)
|
||||
- `min_heap`: A min-heap to store the larger half of numbers
|
||||
- We maintain the invariant: `len(max_heap) >= len(min_heap)` and they differ by at most 1
|
||||
|
||||
|
||||
|
||||
**Step 2: Adding a number**
|
||||
|
||||
- First, add the new number to `max_heap` (the smaller half)
|
||||
- Then, move the largest from `max_heap` to `min_heap` to ensure all elements in `max_heap` are smaller than those in `min_heap`
|
||||
- If `min_heap` becomes larger than `max_heap`, move one element back to balance
|
||||
|
||||
This "add-then-balance" approach ensures both heaps stay balanced and maintain the correct ordering.
|
||||
|
||||
|
||||
|
||||
**Step 3: Finding the median**
|
||||
|
||||
- If total count is odd: the median is the top of `max_heap` (the larger heap)
|
||||
- If total count is even: the median is the average of both heap tops
|
||||
|
||||
|
||||
|
||||
This approach guarantees O(log n) insertion and O(1) median retrieval.
|
||||
|
||||
common_pitfalls:
|
||||
- title: Sorted List Insertion Trap
|
||||
description: |
|
||||
A tempting first approach is to maintain a sorted list using binary search insertion:
|
||||
- Use `bisect.insort()` to insert each number in O(log n) search time
|
||||
- But the actual insertion into the list still takes O(n) time due to shifting elements
|
||||
|
||||
With up to `5 * 10^4` operations, this O(n) insertion leads to O(n^2) total time, which may cause TLE.
|
||||
wrong_approach: "Sorted list with binary search insertion"
|
||||
correct_approach: "Two heaps for O(log n) insertion"
|
||||
|
||||
- title: Single Heap Mistake
|
||||
description: |
|
||||
You might think one heap is enough — just keep all elements and find the middle. But heaps only give you efficient access to one extreme (min or max), not the middle.
|
||||
|
||||
Finding the median in a single heap requires removing half the elements, which is O(n log n) per query.
|
||||
wrong_approach: "Single heap with repeated extraction"
|
||||
correct_approach: "Two heaps splitting at the median"
|
||||
|
||||
- title: Heap Imbalance
|
||||
description: |
|
||||
If the heaps become unbalanced (size difference > 1), the median calculation breaks. For example, if `max_heap` has 5 elements and `min_heap` has 2, the top of `max_heap` is not the median.
|
||||
|
||||
Always rebalance after each insertion to maintain the invariant: `0 <= len(max_heap) - len(min_heap) <= 1`.
|
||||
wrong_approach: "Inserting without rebalancing"
|
||||
correct_approach: "Rebalance heaps after every insertion"
|
||||
|
||||
- title: Python Heap Negation
|
||||
description: |
|
||||
Python's `heapq` module only provides a min-heap. To simulate a max-heap, you must negate values when pushing and negate again when popping.
|
||||
|
||||
Forgetting to negate leads to incorrect ordering — you'd get the minimum of the smaller half instead of the maximum.
|
||||
wrong_approach: "Using heapq as max-heap without negation"
|
||||
correct_approach: "Negate values: push -x, pop and negate result"
|
||||
|
||||
key_takeaways:
|
||||
- "**Two Heaps pattern**: Split data at the median using a max-heap for the lower half and min-heap for the upper half"
|
||||
- "**Streaming data structure**: This design handles continuous data with O(log n) updates and O(1) queries"
|
||||
- "**Heap balancing invariant**: Keep heap sizes within 1 of each other to ensure the median is always accessible at the tops"
|
||||
- "**Foundation for variations**: This technique extends to finding other percentiles or handling weighted medians"
|
||||
|
||||
time_complexity: "O(log n) per `addNum` call due to heap insertion and rebalancing. O(1) per `findMedian` call since we only access heap tops."
|
||||
space_complexity: "O(n) where n is the total number of elements added, as all elements are stored across the two heaps."
|
||||
|
||||
solutions:
|
||||
- approach_name: Two Heaps
|
||||
is_optimal: true
|
||||
code: |
|
||||
import heapq
|
||||
|
||||
class MedianFinder:
|
||||
def __init__(self):
|
||||
# Max-heap for smaller half (store negated values)
|
||||
self.max_heap = []
|
||||
# Min-heap for larger half
|
||||
self.min_heap = []
|
||||
|
||||
def addNum(self, num: int) -> None:
|
||||
# Always add to max_heap first (negate for max-heap behaviour)
|
||||
heapq.heappush(self.max_heap, -num)
|
||||
|
||||
# Move largest from max_heap to min_heap
|
||||
# This ensures max_heap elements <= min_heap elements
|
||||
heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))
|
||||
|
||||
# Rebalance: max_heap should have equal or one more element
|
||||
if len(self.min_heap) > len(self.max_heap):
|
||||
heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
|
||||
|
||||
def findMedian(self) -> float:
|
||||
# Odd total: median is top of max_heap
|
||||
if len(self.max_heap) > len(self.min_heap):
|
||||
return -self.max_heap[0]
|
||||
# Even total: average of both tops
|
||||
return (-self.max_heap[0] + self.min_heap[0]) / 2
|
||||
explanation: |
|
||||
**Time Complexity:** O(log n) for `addNum` — each heap operation is O(log n). O(1) for `findMedian` — just accessing heap tops.
|
||||
|
||||
**Space Complexity:** O(n) — storing all n elements across two heaps.
|
||||
|
||||
We maintain two heaps that split the data at the median. The max-heap holds the smaller half, the min-heap holds the larger half. After each insertion, we rebalance to keep sizes within 1. The median is always accessible at the top(s) of the heaps.
|
||||
|
||||
- approach_name: Sorted List with Binary Search
|
||||
is_optimal: false
|
||||
code: |
|
||||
import bisect
|
||||
|
||||
class MedianFinder:
|
||||
def __init__(self):
|
||||
# Maintain a sorted list of all numbers
|
||||
self.nums = []
|
||||
|
||||
def addNum(self, num: int) -> None:
|
||||
# Binary search to find insertion point: O(log n)
|
||||
# But actual insertion shifts elements: O(n)
|
||||
bisect.insort(self.nums, num)
|
||||
|
||||
def findMedian(self) -> float:
|
||||
n = len(self.nums)
|
||||
mid = n // 2
|
||||
# Odd length: return middle element
|
||||
if n % 2 == 1:
|
||||
return self.nums[mid]
|
||||
# Even length: return average of two middle elements
|
||||
return (self.nums[mid - 1] + self.nums[mid]) / 2
|
||||
explanation: |
|
||||
**Time Complexity:** O(n) for `addNum` — binary search is O(log n) but list insertion is O(n). O(1) for `findMedian` — direct index access.
|
||||
|
||||
**Space Complexity:** O(n) — storing all n elements in a list.
|
||||
|
||||
This approach maintains a sorted list. While conceptually simple and gives O(1) median lookup, the O(n) insertion time makes it impractical for large inputs. It's included to illustrate why heaps are necessary.
|
||||
Reference in New Issue
Block a user