This pull request introduces a new method, Automaton.any_match(string)
, to provide a high-performance interface for checking if any keyword exists within a given string.
Currently, the most direct way to perform an existence check is next(A.iter(text), None) is not None
. While functional, this pattern incurs overhead from the creation and state management of a Python-level iterator object. For high-throughput applications where only the presence of a match is needed, this overhead can be a bottleneck.
The any_match()
method is a stateless, C-level function that implements a true early-exit search. It avoids Python object allocation for iterators or results, returning a singleton Py_True
or Py_False
. This provides a significant performance gain, especially in scenarios where matches are found early in the input text.
Let N be the length of the input string (the haystack) and M be the length of the shortest keyword that matches.
- Time Complexity: O(p) where p is the end position of the first found match. In the best case, p is approximately M.
- Space Complexity: O(1) auxiliary space.
- Practical Overhead: This approach has a higher constant factor overhead due to:
- Heap allocation and initialization of a stateful AutomatonSearchIter Python object.
- The cost of yielding a Python tuple (end_index, value) from C back to the Python layer.
- Time Complexity: O(p), identical to the iterator approach in the best case (p ~ M) and worst case (p = N).
- Space Complexity: O(1) auxiliary space (stack only).
- Practical Overhead: The constant factor overhead is substantially lower. any_match() is a direct, stateless C function call that avoids:
- Heap allocation of an iterator object.
- The creation of any result tuples. It returns Python's boolean singletons directly.
The primary benefit of any_match()
is the reduction in these constant factors and the guarantee of a minimal-overhead early exit, making it measurably faster in practice, especially for frequent calls on shorter texts or texts with matches near the beginning.
The feature is implemented with the following key changes:
- Core Logic (src/trie.c):
- A new function,
trie_any_match()
, contains the core search algorithm. It scans the input string and, after each character transition, checks the current state and its fail links for an end-of-word (eow) flag. - It returns true immediately upon finding the first match, ensuring the early-exit behavior.
- A new function,
- Python Binding (src/Automaton.c):
- A corresponding wrapper,
automaton_any_match()
, is exposed to Python. - It handles argument parsing from the Python tuple and calls the underlying
trie_any_match()
function. - The automaton_methods array is updated to register the new method.
- A corresponding wrapper,
- Documentation (docs/ and src/inline_doc.h):
- The new method is fully documented, explaining its purpose and performance characteristics.
To validate this feature, a new test class, TestAutomatonAnyMatch, has been added to tests/test_unit.py.
- Correctness: A set of unit tests validates correctness for various scenarios, including matches at the start, middle, and end of the string, as well as overlapping matches and no-match cases.
- Performance: Performance tests are susceptible to system jitter. The methodology for quick performance testing is as follows:
timeit
is used to measure the execution time of theany_match()
method (new dependency).- A
_run_best_of_n
helper function is used to get a stable performance signal by taking the minimum of several timeit runs. - Tests use long, synthetic strings with controlled keyword positions. This amplifies the performance difference between the algorithms, making the benefits of
any_match()
's early-exit and low overhead clear and reliably measurable, even in noisy CI environments.
For future final furnishing: add a test helper to measure jitter as a data point, to help understand variability of baseline (typically CI) system to inform interactive interrogation and interpretation of results.