271 lines
11 KiB
YAML
271 lines
11 KiB
YAML
title: Regular Expression Matching
|
|
slug: regular-expression-matching
|
|
difficulty: hard
|
|
leetcode_id: 10
|
|
leetcode_url: https://leetcode.com/problems/regular-expression-matching/
|
|
categories:
|
|
- strings
|
|
- dynamic-programming
|
|
- recursion
|
|
patterns:
|
|
- dynamic-programming
|
|
|
|
function_signature: "def is_match(s: str, p: str) -> bool:"
|
|
|
|
test_cases:
|
|
visible:
|
|
- input: { s: "aa", p: "a" }
|
|
expected: false
|
|
- input: { s: "aa", p: "a*" }
|
|
expected: true
|
|
- input: { s: "ab", p: ".*" }
|
|
expected: true
|
|
hidden:
|
|
- input: { s: "a", p: "a" }
|
|
expected: true
|
|
- input: { s: "a", p: "." }
|
|
expected: true
|
|
- input: { s: "", p: "a*" }
|
|
expected: true
|
|
- input: { s: "aab", p: "c*a*b" }
|
|
expected: true
|
|
- input: { s: "mississippi", p: "mis*is*p*." }
|
|
expected: false
|
|
|
|
description: |
|
|
Given an input string `s` and a pattern `p`, implement regular expression matching with support for `'.'` and `'*'` where:
|
|
|
|
- `'.'` Matches any single character.
|
|
- `'*'` Matches zero or more of the preceding element.
|
|
|
|
The matching should cover the **entire** input string (not partial).
|
|
|
|
constraints: |
|
|
- `1 <= s.length <= 20`
|
|
- `1 <= p.length <= 20`
|
|
- `s` contains only lowercase English letters.
|
|
- `p` contains only lowercase English letters, `'.'`, and `'*'`.
|
|
- It is guaranteed for each appearance of the character `'*'`, there will be a previous valid character to match.
|
|
|
|
examples:
|
|
- input: 's = "aa", p = "a"'
|
|
output: "false"
|
|
explanation: '"a" does not match the entire string "aa".'
|
|
- input: 's = "aa", p = "a*"'
|
|
output: "true"
|
|
explanation: '"*" means zero or more of the preceding element, "a". Therefore, by repeating "a" once, it becomes "aa".'
|
|
- input: 's = "ab", p = ".*"'
|
|
output: "true"
|
|
explanation: '".*" means "zero or more (*) of any character (.)".'
|
|
|
|
explanation:
|
|
intuition: |
|
|
Think of this problem as a **decision tree** where at each step you must decide how to match the current characters.
|
|
|
|
The key insight is that the `'*'` wildcard creates **branching possibilities**: when you see a pattern like `a*`, you can either:
|
|
1. **Use zero occurrences** of `a` (skip `a*` entirely and move on in the pattern)
|
|
2. **Use one or more occurrences** of `a` (if the current string character matches, consume it and keep the `a*` available for more matches)
|
|
|
|
This branching nature makes the problem a natural fit for **recursion** with **memoisation** (or bottom-up dynamic programming). Without memoisation, you'd repeatedly solve the same subproblems, leading to exponential time complexity.
|
|
|
|
The `'.'` wildcard is simpler: it just matches any single character, so treat it as a "universal match" when comparing characters.
|
|
|
|
The mental model is: "At each position, what are my options, and does *any* combination of choices lead to a full match?"
|
|
|
|
approach: |
|
|
We solve this using **Dynamic Programming** with a 2D table:
|
|
|
|
**Step 1: Define the DP state**
|
|
|
|
- `dp[i][j]`: Whether `s[0:i]` matches `p[0:j]`
|
|
- Our answer will be `dp[len(s)][len(p)]`
|
|
|
|
|
|
|
|
**Step 2: Initialise the base cases**
|
|
|
|
- `dp[0][0] = True`: Empty string matches empty pattern
|
|
- `dp[0][j]`: Empty string can match patterns like `a*b*c*` where each `x*` uses zero occurrences
|
|
- `dp[i][0] = False` for `i > 0`: Non-empty string cannot match empty pattern
|
|
|
|
|
|
|
|
**Step 3: Fill the DP table**
|
|
|
|
For each cell `dp[i][j]`, we consider the current pattern character `p[j-1]`:
|
|
|
|
- **Case 1: `p[j-1]` is `'*'`** (star wildcard)
|
|
- *Option A*: Use zero occurrences of the preceding element: `dp[i][j] = dp[i][j-2]`
|
|
- *Option B*: Use one or more occurrences (only if `s[i-1]` matches `p[j-2]`): `dp[i][j] = dp[i-1][j]`
|
|
- We take the OR of both options
|
|
|
|
- **Case 2: `p[j-1]` is `'.'` or a letter**
|
|
- Check if `s[i-1]` matches `p[j-1]` (either same letter or `'.'`)
|
|
- If match: `dp[i][j] = dp[i-1][j-1]`
|
|
- If no match: `dp[i][j] = False`
|
|
|
|
|
|
|
|
**Step 4: Return the result**
|
|
|
|
- Return `dp[len(s)][len(p)]`
|
|
|
|
common_pitfalls:
|
|
- title: Mishandling the Star Wildcard
|
|
description: |
|
|
The `'*'` doesn't stand alone; it modifies the **preceding character**. A common mistake is treating `*` as "match anything" like in shell globbing.
|
|
|
|
In regex matching, `a*` means "zero or more `a`s", not "anything". The pattern `.*` means "zero or more of any character" because `.` matches any single character.
|
|
|
|
Always process `*` together with its preceding character as a single unit.
|
|
wrong_approach: "Treating * as an independent wildcard"
|
|
correct_approach: "Process * with its preceding character as a unit"
|
|
|
|
- title: Forgetting the Zero-Match Case
|
|
description: |
|
|
When you see `x*` in the pattern, you might only consider matching one or more `x`s. But `*` means **zero or more**, so you must also consider skipping `x*` entirely.
|
|
|
|
For example, matching `s = "aab"` against `p = "c*a*b"`:
|
|
- `c*` matches zero `c`s
|
|
- `a*` matches two `a`s
|
|
- `b` matches `b`
|
|
|
|
Missing the zero-match case will cause incorrect results.
|
|
wrong_approach: "Only considering one or more matches for x*"
|
|
correct_approach: "Always consider both zero matches (skip) and one-or-more matches"
|
|
|
|
- title: Incorrect Base Case for Empty String
|
|
description: |
|
|
An empty string `s` can still match certain patterns. For example:
|
|
- `s = ""` matches `p = "a*"` (zero `a`s)
|
|
- `s = ""` matches `p = "a*b*c*"` (zero of each)
|
|
|
|
You must carefully initialise `dp[0][j]` by checking if `p[0:j]` can match an empty string. This happens when the pattern consists entirely of `x*` pairs.
|
|
wrong_approach: "Assuming empty string only matches empty pattern"
|
|
correct_approach: "Check if pattern can reduce to empty via x* zero-matches"
|
|
|
|
- title: Off-by-One Errors in Indexing
|
|
description: |
|
|
The DP table has dimensions `(len(s)+1) x (len(p)+1)` to handle empty string/pattern cases. When accessing `s[i-1]` or `p[j-1]` from `dp[i][j]`, it's easy to make indexing mistakes.
|
|
|
|
Be consistent: `dp[i][j]` represents matching `s[0:i]` with `p[0:j]`, so the "current" characters are `s[i-1]` and `p[j-1]`.
|
|
|
|
key_takeaways:
|
|
- "**DP on two sequences**: When matching/comparing two strings, think of a 2D DP table where `dp[i][j]` represents the answer for prefixes `s[0:i]` and `p[0:j]`"
|
|
- "**Handle wildcards as units**: `*` modifies its preceding character; process them together"
|
|
- "**Consider all branches**: The `*` creates branching (zero vs. one-or-more matches); use OR logic to combine possibilities"
|
|
- "**Foundation for harder problems**: This pattern extends to wildcard matching, edit distance, and other two-string DP problems"
|
|
|
|
time_complexity: "O(m * n). We fill a 2D table of size `(len(s)+1) x (len(p)+1)`, and each cell takes O(1) time."
|
|
space_complexity: "O(m * n). We use a 2D DP table. This can be optimised to O(n) using rolling arrays since we only need the previous row."
|
|
|
|
solutions:
|
|
- approach_name: Dynamic Programming (Bottom-Up)
|
|
is_optimal: true
|
|
code: |
|
|
def is_match(s: str, p: str) -> bool:
|
|
m, n = len(s), len(p)
|
|
# dp[i][j] = True if s[0:i] matches p[0:j]
|
|
dp = [[False] * (n + 1) for _ in range(m + 1)]
|
|
|
|
# Base case: empty string matches empty pattern
|
|
dp[0][0] = True
|
|
|
|
# Base case: empty string can match patterns like a*, a*b*, etc.
|
|
for j in range(2, n + 1):
|
|
# If current char is *, we can use zero occurrences of preceding char
|
|
if p[j - 1] == '*':
|
|
dp[0][j] = dp[0][j - 2]
|
|
|
|
# Fill the DP table
|
|
for i in range(1, m + 1):
|
|
for j in range(1, n + 1):
|
|
if p[j - 1] == '*':
|
|
# Option 1: use zero occurrences of preceding element
|
|
dp[i][j] = dp[i][j - 2]
|
|
|
|
# Option 2: use one or more (if current char matches preceding pattern char)
|
|
if p[j - 2] == '.' or p[j - 2] == s[i - 1]:
|
|
dp[i][j] = dp[i][j] or dp[i - 1][j]
|
|
|
|
elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
|
|
# Direct match: current chars match
|
|
dp[i][j] = dp[i - 1][j - 1]
|
|
# else: dp[i][j] remains False (no match)
|
|
|
|
return dp[m][n]
|
|
explanation: |
|
|
**Time Complexity:** O(m * n) — We fill each cell of the `(m+1) x (n+1)` table exactly once.
|
|
|
|
**Space Complexity:** O(m * n) — We store the entire DP table.
|
|
|
|
This bottom-up approach builds the solution from smaller subproblems. The key transitions handle the `*` wildcard by considering both zero matches (skip) and one-or-more matches (consume and stay).
|
|
|
|
- approach_name: Recursion with Memoisation
|
|
is_optimal: true
|
|
code: |
|
|
def is_match(s: str, p: str) -> bool:
|
|
memo = {}
|
|
|
|
def dp(i: int, j: int) -> bool:
|
|
"""Check if s[i:] matches p[j:]"""
|
|
if (i, j) in memo:
|
|
return memo[(i, j)]
|
|
|
|
# Base case: pattern exhausted
|
|
if j == len(p):
|
|
return i == len(s)
|
|
|
|
# Check if first characters match
|
|
first_match = i < len(s) and (p[j] == s[i] or p[j] == '.')
|
|
|
|
# Handle star wildcard
|
|
if j + 1 < len(p) and p[j + 1] == '*':
|
|
# Option 1: skip x* (zero occurrences)
|
|
# Option 2: use x* (if first char matches, consume it)
|
|
result = dp(i, j + 2) or (first_match and dp(i + 1, j))
|
|
else:
|
|
# No star: must match current char and recurse
|
|
result = first_match and dp(i + 1, j + 1)
|
|
|
|
memo[(i, j)] = result
|
|
return result
|
|
|
|
return dp(0, 0)
|
|
explanation: |
|
|
**Time Complexity:** O(m * n) — Each unique `(i, j)` state is computed once and cached.
|
|
|
|
**Space Complexity:** O(m * n) — For the memoisation cache, plus O(m + n) recursion stack depth.
|
|
|
|
This top-down approach directly translates the recursive thinking. The memoisation dictionary prevents redundant computation of overlapping subproblems.
|
|
|
|
- approach_name: Recursion (Brute Force)
|
|
is_optimal: false
|
|
code: |
|
|
def is_match(s: str, p: str) -> bool:
|
|
def dp(i: int, j: int) -> bool:
|
|
"""Check if s[i:] matches p[j:]"""
|
|
# Base case: pattern exhausted
|
|
if j == len(p):
|
|
return i == len(s)
|
|
|
|
# Check if first characters match
|
|
first_match = i < len(s) and (p[j] == s[i] or p[j] == '.')
|
|
|
|
# Handle star wildcard
|
|
if j + 1 < len(p) and p[j + 1] == '*':
|
|
# Option 1: skip x* (zero occurrences)
|
|
# Option 2: use x* (if first char matches, consume it)
|
|
return dp(i, j + 2) or (first_match and dp(i + 1, j))
|
|
else:
|
|
# No star: must match current char and recurse
|
|
return first_match and dp(i + 1, j + 1)
|
|
|
|
return dp(0, 0)
|
|
explanation: |
|
|
**Time Complexity:** O(2^(m+n)) in the worst case — Without memoisation, the same subproblems are recomputed exponentially many times.
|
|
|
|
**Space Complexity:** O(m + n) — Recursion stack depth.
|
|
|
|
This naive recursive solution is correct but extremely slow. Patterns with many `*` wildcards cause exponential branching. Included to show why memoisation is essential.
|