Files
codetutor/backend/data/questions/design-hashset.yaml
2025-05-25 11:08:40 +01:00

204 lines
9.6 KiB
YAML

title: Design HashSet
slug: design-hashset
difficulty: easy
leetcode_id: 705
leetcode_url: https://leetcode.com/problems/design-hashset/
categories:
- arrays
- hash-tables
patterns:
- heap
description: |
Design a HashSet without using any built-in hash table libraries.
Implement `MyHashSet` class:
- `void add(key)` Inserts the value `key` into the HashSet.
- `bool contains(key)` Returns whether the value `key` exists in the HashSet or not.
- `void remove(key)` Removes the value `key` in the HashSet. If `key` does not exist in the HashSet, do nothing.
constraints: |
- `0 <= key <= 10^6`
- At most `10^4` calls will be made to `add`, `remove`, and `contains`.
examples:
- input: '["MyHashSet", "add", "add", "contains", "contains", "add", "contains", "remove", "contains"], [[], [1], [2], [1], [3], [2], [2], [2], [2]]'
output: "[null, null, null, true, false, null, true, null, false]"
explanation: |
MyHashSet myHashSet = new MyHashSet();
myHashSet.add(1); // set = [1]
myHashSet.add(2); // set = [1, 2]
myHashSet.contains(1); // return True
myHashSet.contains(3); // return False, (not found)
myHashSet.add(2); // set = [1, 2]
myHashSet.contains(2); // return True
myHashSet.remove(2); // set = [1]
myHashSet.contains(2); // return False, (already removed)
explanation:
intuition: |
Think of a hash set like a large filing cabinet with numbered drawers. Instead of searching through every drawer to find a document, you use a formula (the **hash function**) to instantly determine which drawer to check.
The core insight is that we need to map potentially millions of keys (`0` to `10^6`) to a manageable number of storage locations. This is done using the **modulo operation**: `key % num_buckets` gives us a bucket index. For example, with 1000 buckets, keys `5`, `1005`, and `2005` all map to bucket `5`.
But wait — multiple keys can map to the same bucket! This is called a **collision**. To handle collisions, each bucket stores a list of all keys that hash to it. When we add, remove, or search for a key, we first compute its bucket, then operate on the list within that bucket.
The art of hash set design lies in choosing the right number of buckets — enough to keep the lists short (for fast operations), but not so many that we waste memory.
approach: |
We use **Separate Chaining** with an array of buckets, where each bucket is a list that handles collisions.
**Step 1: Choose the number of buckets**
- Use a prime number like `1000` or `10007` to distribute keys evenly
- A prime reduces clustering from patterns in input data
- With `10^4` operations and `1000` buckets, average list length is ~10 (very fast)
&nbsp;
**Step 2: Initialise the data structure**
- Create an array of `num_buckets` empty lists
- `self.buckets`: The array where `buckets[i]` holds all keys with hash `i`
&nbsp;
**Step 3: Implement the hash function**
- `_hash(key)`: Returns `key % num_buckets`
- This maps any key to a valid bucket index in range `[0, num_buckets - 1]`
&nbsp;
**Step 4: Implement add(key)**
- Compute bucket index using `_hash(key)`
- Check if key already exists in that bucket's list (sets don't allow duplicates)
- If not present, append the key to the list
&nbsp;
**Step 5: Implement remove(key)**
- Compute bucket index using `_hash(key)`
- Search for the key in that bucket's list
- If found, remove it; if not found, do nothing
&nbsp;
**Step 6: Implement contains(key)**
- Compute bucket index using `_hash(key)`
- Return `True` if key is in that bucket's list, `False` otherwise
common_pitfalls:
- title: Using a Boolean Array
description: |
A tempting "simple" approach is to create a boolean array of size `10^6 + 1` where `arr[key] = True` means the key exists.
While this works and gives O(1) operations, it wastes **1 MB of memory** just for the array, even if you only store 10 keys. This fails the spirit of the problem (designing a hash set) and may cause memory issues.
wrong_approach: "Boolean array of size 10^6"
correct_approach: "Hash table with buckets using modulo"
- title: Forgetting Duplicate Prevention
description: |
A set must not contain duplicates. If you blindly append to the bucket list on every `add()` call, you'll end up with duplicate entries.
For example, calling `add(5)` three times should result in `5` appearing once, not three times. Always check for existence before adding.
wrong_approach: "Always append key to bucket list"
correct_approach: "Check if key exists before appending"
- title: Poor Bucket Count Choice
description: |
Using too few buckets (e.g., 10) means each bucket holds many keys on average, making operations slow — approaching O(n) in the worst case.
Using a non-prime number (e.g., 1000) can cause clustering if keys follow patterns (e.g., all multiples of 100 land in the same bucket).
A prime like `769` or `10007` distributes keys more uniformly.
wrong_approach: "Using 10 buckets or a round number like 1000"
correct_approach: "Use a prime number of buckets (e.g., 769, 1009, 10007)"
- title: Not Handling remove() on Non-Existent Key
description: |
The problem states: "If `key` does not exist in the HashSet, do nothing."
If you use `list.remove(key)` without checking, Python raises `ValueError` when the key isn't found. Always check existence first or use a try/except block.
wrong_approach: "Directly calling list.remove(key)"
correct_approach: "Check if key in bucket before removing"
key_takeaways:
- "**Hash function fundamentals**: The modulo operation is the simplest way to map large key spaces to bounded array indices"
- "**Collision handling**: Separate chaining (lists in buckets) is intuitive and works well for moderate load factors"
- "**Prime bucket counts**: Using a prime number of buckets reduces collision clustering from patterned input data"
- "**Design problems**: Understanding the *why* behind data structures (not just using built-ins) is crucial for interviews and system design"
time_complexity: "O(n/k) average for all operations, where `n` is the number of keys and `k` is the number of buckets. With a good hash function and sufficient buckets, this approaches O(1)."
space_complexity: "O(k + n), where `k` is the number of buckets and `n` is the number of stored keys. We allocate `k` empty lists upfront, then store `n` keys across them."
solutions:
- approach_name: Separate Chaining with Array of Lists
is_optimal: true
code: |
class MyHashSet:
def __init__(self):
# Use a prime number of buckets to reduce collision clustering
self.num_buckets = 769
# Each bucket is a list to handle collisions via chaining
self.buckets = [[] for _ in range(self.num_buckets)]
def _hash(self, key: int) -> int:
# Map any key to a valid bucket index
return key % self.num_buckets
def add(self, key: int) -> None:
bucket_index = self._hash(key)
bucket = self.buckets[bucket_index]
# Only add if not already present (sets don't allow duplicates)
if key not in bucket:
bucket.append(key)
def remove(self, key: int) -> None:
bucket_index = self._hash(key)
bucket = self.buckets[bucket_index]
# Only remove if present (avoid ValueError)
if key in bucket:
bucket.remove(key)
def contains(self, key: int) -> bool:
bucket_index = self._hash(key)
bucket = self.buckets[bucket_index]
# Check membership in the bucket's list
return key in bucket
explanation: |
**Time Complexity:** O(n/k) average per operation — where `n` is keys stored and `k` is bucket count. With 769 buckets and up to 10^4 operations, average bucket size stays small, making operations effectively O(1).
**Space Complexity:** O(k + n) — We pre-allocate `k` empty lists (769 pointers) plus store `n` actual keys distributed across buckets.
This approach uses separate chaining to handle collisions. Each bucket is a Python list that stores all keys hashing to that index. The prime bucket count (769) helps distribute keys evenly even when input has patterns.
- approach_name: Boolean Array (Space-Inefficient)
is_optimal: false
code: |
class MyHashSet:
def __init__(self):
# Allocate array for entire key range (wasteful)
# Uses ~1MB of memory regardless of actual usage
self.data = [False] * (10**6 + 1)
def add(self, key: int) -> None:
# Direct indexing - O(1) but wastes space
self.data[key] = True
def remove(self, key: int) -> None:
self.data[key] = False
def contains(self, key: int) -> bool:
return self.data[key]
explanation: |
**Time Complexity:** O(1) for all operations — Direct array indexing.
**Space Complexity:** O(max_key) = O(10^6) — Allocates memory for entire key range upfront.
While this achieves O(1) time complexity, it defeats the purpose of the exercise. Real hash sets don't know the key range in advance and must handle arbitrary keys efficiently. This approach wastes ~1MB of memory even for storing just a few keys, and doesn't teach the fundamental concepts of hashing and collision resolution.