Skip to content

Instantly share code, notes, and snippets.

@halcyondude
Created June 12, 2025 21:09
Show Gist options
  • Save halcyondude/d44f0fecfe90a65ed527e45667625a7f to your computer and use it in GitHub Desktop.
Save halcyondude/d44f0fecfe90a65ed527e45667625a7f to your computer and use it in GitHub Desktop.
pyahocorasick : add match_any fast function (PR text)

feat: Add Automaton.any_match() for fast existence checking

Abstract

This pull request introduces a new method, Automaton.any_match(string), to provide a high-performance interface for checking if any keyword exists within a given string.

Currently, the most direct way to perform an existence check is next(A.iter(text), None) is not None. While functional, this pattern incurs overhead from the creation and state management of a Python-level iterator object. For high-throughput applications where only the presence of a match is needed, this overhead can be a bottleneck.

The any_match() method is a stateless, C-level function that implements a true early-exit search. It avoids Python object allocation for iterators or results, returning a singleton Py_True or Py_False. This provides a significant performance gain, especially in scenarios where matches are found early in the input text.

Asymptotic and Practical Performance Analysis

Let N be the length of the input string (the haystack) and M be the length of the shortest keyword that matches.

next(A.iter(text), None)

  • Time Complexity: O(p) where p is the end position of the first found match. In the best case, p is approximately M.
  • Space Complexity: O(1) auxiliary space.
  • Practical Overhead: This approach has a higher constant factor overhead due to:
    1. Heap allocation and initialization of a stateful AutomatonSearchIter Python object.
    2. The cost of yielding a Python tuple (end_index, value) from C back to the Python layer.

A.any_match(text) (This PR)

  • Time Complexity: O(p), identical to the iterator approach in the best case (p ~ M) and worst case (p = N).
  • Space Complexity: O(1) auxiliary space (stack only).
  • Practical Overhead: The constant factor overhead is substantially lower. any_match() is a direct, stateless C function call that avoids:
    1. Heap allocation of an iterator object.
    2. The creation of any result tuples. It returns Python's boolean singletons directly.

The primary benefit of any_match() is the reduction in these constant factors and the guarantee of a minimal-overhead early exit, making it measurably faster in practice, especially for frequent calls on shorter texts or texts with matches near the beginning.

Implementation Details

The feature is implemented with the following key changes:

  1. Core Logic (src/trie.c):
    • A new function, trie_any_match(), contains the core search algorithm. It scans the input string and, after each character transition, checks the current state and its fail links for an end-of-word (eow) flag.
    • It returns true immediately upon finding the first match, ensuring the early-exit behavior.
  2. Python Binding (src/Automaton.c):
    • A corresponding wrapper, automaton_any_match(), is exposed to Python.
    • It handles argument parsing from the Python tuple and calls the underlying trie_any_match() function.
    • The automaton_methods array is updated to register the new method.
  3. Documentation (docs/ and src/inline_doc.h):
    • The new method is fully documented, explaining its purpose and performance characteristics.

Validation and Testing

To validate this feature, a new test class, TestAutomatonAnyMatch, has been added to tests/test_unit.py.

  • Correctness: A set of unit tests validates correctness for various scenarios, including matches at the start, middle, and end of the string, as well as overlapping matches and no-match cases.
  • Performance: Performance tests are susceptible to system jitter. The methodology for quick performance testing is as follows:
    • timeit is used to measure the execution time of the any_match() method (new dependency).
    • A _run_best_of_n helper function is used to get a stable performance signal by taking the minimum of several timeit runs.
    • Tests use long, synthetic strings with controlled keyword positions. This amplifies the performance difference between the algorithms, making the benefits of any_match()'s early-exit and low overhead clear and reliably measurable, even in noisy CI environments.
@halcyondude
Copy link
Author

halcyondude commented Jun 12, 2025

For future final furnishing: add a test helper to measure jitter as a data point, to help understand variability of baseline (typically CI) system to inform interactive interrogation and interpretation of results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment