codetutor/backend/data/questions/regular-expression-matching.yaml

title: Regular Expression Matching
slug: regular-expression-matching
difficulty: hard
leetcode_id: 10
leetcode_url: https://leetcode.com/problems/regular-expression-matching/
categories:
  - strings
  - dynamic-programming
  - recursion
patterns:
  - dynamic-programming

function_signature: "def is_match(s: str, p: str) -> bool:"

test_cases:
  visible:
    - input: { s: "aa", p: "a" }
      expected: false
    - input: { s: "aa", p: "a*" }
      expected: true
    - input: { s: "ab", p: ".*" }
      expected: true
  hidden:
    - input: { s: "a", p: "a" }
      expected: true
    - input: { s: "a", p: "." }
      expected: true
    - input: { s: "", p: "a*" }
      expected: true
    - input: { s: "aab", p: "c*a*b" }
      expected: true
    - input: { s: "mississippi", p: "mis*is*p*." }
      expected: false

description: |
  Given an input string `s` and a pattern `p`, implement regular expression matching with support for `'.'` and `'*'` where:

  - `'.'` Matches any single character.
  - `'*'` Matches zero or more of the preceding element.

  The matching should cover the **entire** input string (not partial).

constraints: |
  - `1 <= s.length <= 20`
  - `1 <= p.length <= 20`
  - `s` contains only lowercase English letters.
  - `p` contains only lowercase English letters, `'.'`, and `'*'`.
  - It is guaranteed for each appearance of the character `'*'`, there will be a previous valid character to match.

examples:
  - input: 's = "aa", p = "a"'
    output: "false"
    explanation: '"a" does not match the entire string "aa".'
  - input: 's = "aa", p = "a*"'
    output: "true"
    explanation: '"*" means zero or more of the preceding element, "a". Therefore, by repeating "a" once, it becomes "aa".'
  - input: 's = "ab", p = ".*"'
    output: "true"
    explanation: '".*" means "zero or more (*) of any character (.)".'

explanation:
  intuition: |
    Think of this problem as a **decision tree** where at each step you must decide how to match the current characters.

    The key insight is that the `'*'` wildcard creates **branching possibilities**: when you see a pattern like `a*`, you can either:
    1. **Use zero occurrences** of `a` (skip `a*` entirely and move on in the pattern)
    2. **Use one or more occurrences** of `a` (if the current string character matches, consume it and keep the `a*` available for more matches)

    This branching nature makes the problem a natural fit for **recursion** with **memoisation** (or bottom-up dynamic programming). Without memoisation, you'd repeatedly solve the same subproblems, leading to exponential time complexity.

    The `'.'` wildcard is simpler: it just matches any single character, so treat it as a "universal match" when comparing characters.

    The mental model is: "At each position, what are my options, and does *any* combination of choices lead to a full match?"

  approach: |
    We solve this using **Dynamic Programming** with a 2D table:

    **Step 1: Define the DP state**

    - `dp[i][j]`: Whether `s[0:i]` matches `p[0:j]`
    - Our answer will be `dp[len(s)][len(p)]`

    &nbsp;

    **Step 2: Initialise the base cases**

    - `dp[0][0] = True`: Empty string matches empty pattern
    - `dp[0][j]`: Empty string can match patterns like `a*b*c*` where each `x*` uses zero occurrences
    - `dp[i][0] = False` for `i > 0`: Non-empty string cannot match empty pattern

    &nbsp;

    **Step 3: Fill the DP table**

    For each cell `dp[i][j]`, we consider the current pattern character `p[j-1]`:

    - **Case 1: `p[j-1]` is `'*'`** (star wildcard)
      - *Option A*: Use zero occurrences of the preceding element: `dp[i][j] = dp[i][j-2]`
      - *Option B*: Use one or more occurrences (only if `s[i-1]` matches `p[j-2]`): `dp[i][j] = dp[i-1][j]`
      - We take the OR of both options

    - **Case 2: `p[j-1]` is `'.'` or a letter**
      - Check if `s[i-1]` matches `p[j-1]` (either same letter or `'.'`)
      - If match: `dp[i][j] = dp[i-1][j-1]`
      - If no match: `dp[i][j] = False`

    &nbsp;

    **Step 4: Return the result**

    - Return `dp[len(s)][len(p)]`

  common_pitfalls:
    - title: Mishandling the Star Wildcard
      description: |
        The `'*'` doesn't stand alone; it modifies the **preceding character**. A common mistake is treating `*` as "match anything" like in shell globbing.

        In regex matching, `a*` means "zero or more `a`s", not "anything". The pattern `.*` means "zero or more of any character" because `.` matches any single character.

        Always process `*` together with its preceding character as a single unit.
      wrong_approach: "Treating * as an independent wildcard"
      correct_approach: "Process * with its preceding character as a unit"

    - title: Forgetting the Zero-Match Case
      description: |
        When you see `x*` in the pattern, you might only consider matching one or more `x`s. But `*` means **zero or more**, so you must also consider skipping `x*` entirely.

        For example, matching `s = "aab"` against `p = "c*a*b"`:
        - `c*` matches zero `c`s
        - `a*` matches two `a`s
        - `b` matches `b`

        Missing the zero-match case will cause incorrect results.
      wrong_approach: "Only considering one or more matches for x*"
      correct_approach: "Always consider both zero matches (skip) and one-or-more matches"

    - title: Incorrect Base Case for Empty String
      description: |
        An empty string `s` can still match certain patterns. For example:
        - `s = ""` matches `p = "a*"` (zero `a`s)
        - `s = ""` matches `p = "a*b*c*"` (zero of each)

        You must carefully initialise `dp[0][j]` by checking if `p[0:j]` can match an empty string. This happens when the pattern consists entirely of `x*` pairs.
      wrong_approach: "Assuming empty string only matches empty pattern"
      correct_approach: "Check if pattern can reduce to empty via x* zero-matches"

    - title: Off-by-One Errors in Indexing
      description: |
        The DP table has dimensions `(len(s)+1) x (len(p)+1)` to handle empty string/pattern cases. When accessing `s[i-1]` or `p[j-1]` from `dp[i][j]`, it's easy to make indexing mistakes.

        Be consistent: `dp[i][j]` represents matching `s[0:i]` with `p[0:j]`, so the "current" characters are `s[i-1]` and `p[j-1]`.

  key_takeaways:
    - "**DP on two sequences**: When matching/comparing two strings, think of a 2D DP table where `dp[i][j]` represents the answer for prefixes `s[0:i]` and `p[0:j]`"
    - "**Handle wildcards as units**: `*` modifies its preceding character; process them together"
    - "**Consider all branches**: The `*` creates branching (zero vs. one-or-more matches); use OR logic to combine possibilities"
    - "**Foundation for harder problems**: This pattern extends to wildcard matching, edit distance, and other two-string DP problems"

  time_complexity: "O(m * n). We fill a 2D table of size `(len(s)+1) x (len(p)+1)`, and each cell takes O(1) time."
  space_complexity: "O(m * n). We use a 2D DP table. This can be optimised to O(n) using rolling arrays since we only need the previous row."

solutions:
  - approach_name: Dynamic Programming (Bottom-Up)
    is_optimal: true
    code: |
      def is_match(s: str, p: str) -> bool:
          m, n = len(s), len(p)
          # dp[i][j] = True if s[0:i] matches p[0:j]
          dp = [[False] * (n + 1) for _ in range(m + 1)]

          # Base case: empty string matches empty pattern
          dp[0][0] = True

          # Base case: empty string can match patterns like a*, a*b*, etc.
          for j in range(2, n + 1):
              # If current char is *, we can use zero occurrences of preceding char
              if p[j - 1] == '*':
                  dp[0][j] = dp[0][j - 2]

          # Fill the DP table
          for i in range(1, m + 1):
              for j in range(1, n + 1):
                  if p[j - 1] == '*':
                      # Option 1: use zero occurrences of preceding element
                      dp[i][j] = dp[i][j - 2]

                      # Option 2: use one or more (if current char matches preceding pattern char)
                      if p[j - 2] == '.' or p[j - 2] == s[i - 1]:
                          dp[i][j] = dp[i][j] or dp[i - 1][j]

                  elif p[j - 1] == '.' or p[j - 1] == s[i - 1]:
                      # Direct match: current chars match
                      dp[i][j] = dp[i - 1][j - 1]
                  # else: dp[i][j] remains False (no match)

          return dp[m][n]
    explanation: |
      **Time Complexity:** O(m * n) — We fill each cell of the `(m+1) x (n+1)` table exactly once.

      **Space Complexity:** O(m * n) — We store the entire DP table.

      This bottom-up approach builds the solution from smaller subproblems. The key transitions handle the `*` wildcard by considering both zero matches (skip) and one-or-more matches (consume and stay).

  - approach_name: Recursion with Memoisation
    is_optimal: true
    code: |
      def is_match(s: str, p: str) -> bool:
          memo = {}

          def dp(i: int, j: int) -> bool:
              """Check if s[i:] matches p[j:]"""
              if (i, j) in memo:
                  return memo[(i, j)]

              # Base case: pattern exhausted
              if j == len(p):
                  return i == len(s)

              # Check if first characters match
              first_match = i < len(s) and (p[j] == s[i] or p[j] == '.')

              # Handle star wildcard
              if j + 1 < len(p) and p[j + 1] == '*':
                  # Option 1: skip x* (zero occurrences)
                  # Option 2: use x* (if first char matches, consume it)
                  result = dp(i, j + 2) or (first_match and dp(i + 1, j))
              else:
                  # No star: must match current char and recurse
                  result = first_match and dp(i + 1, j + 1)

              memo[(i, j)] = result
              return result

          return dp(0, 0)
    explanation: |
      **Time Complexity:** O(m * n) — Each unique `(i, j)` state is computed once and cached.

      **Space Complexity:** O(m * n) — For the memoisation cache, plus O(m + n) recursion stack depth.

      This top-down approach directly translates the recursive thinking. The memoisation dictionary prevents redundant computation of overlapping subproblems.

  - approach_name: Recursion (Brute Force)
    is_optimal: false
    code: |
      def is_match(s: str, p: str) -> bool:
          def dp(i: int, j: int) -> bool:
              """Check if s[i:] matches p[j:]"""
              # Base case: pattern exhausted
              if j == len(p):
                  return i == len(s)

              # Check if first characters match
              first_match = i < len(s) and (p[j] == s[i] or p[j] == '.')

              # Handle star wildcard
              if j + 1 < len(p) and p[j + 1] == '*':
                  # Option 1: skip x* (zero occurrences)
                  # Option 2: use x* (if first char matches, consume it)
                  return dp(i, j + 2) or (first_match and dp(i + 1, j))
              else:
                  # No star: must match current char and recurse
                  return first_match and dp(i + 1, j + 1)

          return dp(0, 0)
    explanation: |
      **Time Complexity:** O(2^(m+n)) in the worst case — Without memoisation, the same subproblems are recomputed exponentially many times.

      **Space Complexity:** O(m + n) — Recursion stack depth.

      This naive recursive solution is correct but extremely slow. Patterns with many `*` wildcards cause exponential branching. Included to show why memoisation is essential.