Documentation
¶
Overview ¶
Package meta implements the meta-engine orchestrator that automatically selects the optimal regex execution strategy.
The meta-engine coordinates three engines:
- Prefilter: fast literal-based candidate finding (optional)
- Lazy DFA: deterministic finite automaton with on-demand state construction
- NFA (PikeVM): nondeterministic fallback for complex patterns or DFA cache overflow
Strategy selection is based on:
- Pattern complexity (NFA size, literal quality)
- Prefilter availability (good literals enable fast filtering)
- DFA suitability (patterns without alternations benefit most)
The meta-engine provides the public API for pattern compilation and matching, hiding the complexity of multi-engine coordination from users.
Package meta implements the meta-engine orchestrator for coregex.
The meta package provides the core Engine type that orchestrates all regex execution strategies. It analyzes patterns, selects optimal strategies, and coordinates search across multiple engines (NFA, DFA, prefilters).
Architecture ¶
The meta-engine uses a multi-strategy approach:
- Pattern analysis extracts literals for prefiltering
- Strategy selection chooses the optimal execution path
- Engine orchestration coordinates NFA, DFA, and specialized searchers
Strategies ¶
Available strategies (selected automatically based on pattern):
- UseNFA: Direct NFA (PikeVM) search for small patterns
- UseDFA: Lazy DFA for patterns with good literals
- UseBoth: Adaptive DFA→NFA with cache eviction handling
- UseReverseAnchored: Reverse DFA for end-anchored patterns ($)
- UseReverseSuffix: Suffix literal + reverse DFA (1124x speedup)
- UseReverseInner: Bidirectional search from inner literal (909x)
- UseReverseSuffixSet: Teddy multi-suffix prefilter (260x)
- UseTeddy: SIMD multi-pattern for exact alternations (242x)
- UseAhoCorasick: Aho-Corasick for large alternations (113x)
- UseCharClassSearcher: Specialized char class searcher (23x)
- UseDigitPrefilter: SIMD digit prefilter for digit-lead patterns
- UseOnePass: OnePass DFA for anchored patterns with captures
- UseBoundedBacktracker: Bounded backtracker for char class patterns
- UseBranchDispatch: O(1) branch dispatch for anchored alternations
- UseCompositeSearcher: For concatenated char classes
- UseAnchoredLiteral: O(1) matching for ^prefix.*suffix$ patterns (32-133x)
Thread Safety ¶
The Engine type is safe for concurrent use from multiple goroutines. Per-search mutable state is managed via sync.Pool.
Usage ¶
engine, err := meta.Compile(`\w+@\w+\.\w+`)
if err != nil {
log.Fatal(err)
}
// Find first match
match := engine.Find([]byte("[email protected]"))
if match != nil {
fmt.Println(match.String())
}
// Check if matches
if engine.IsMatch([]byte("[email protected]")) {
fmt.Println("matches!")
}
// Find with captures
match := engine.FindSubmatch([]byte("[email protected]"))
if match != nil {
fmt.Println(match.Group(1)) // first capture group
}
Files ¶
The meta package is organized into these files:
- engine.go: Engine struct definition and core API
- compile.go: Pattern compilation and engine builders
- find.go: Find methods returning *Match
- find_indices.go: FindIndices methods (zero-allocation)
- ismatch.go: IsMatch methods for boolean matching
- findall.go: FindAll*, Count, and FindSubmatch methods
- strategy.go: Strategy constants and selection logic
- config.go: Configuration options
- match.go: Match and MatchWithCaptures types
- search_state.go: Thread-safe state pooling
- anchored_literal.go: UseAnchoredLiteral implementation
- reverse_*.go: Reverse search implementations
Index ¶
- Variables
- func MatchAnchoredLiteral(input []byte, info *AnchoredLiteralInfo) bool
- func StrategyReason(strategy Strategy, n *nfa.NFA, literals *literal.Seq, config Config) string
- type AnchoredLiteralInfo
- type CompileError
- type Config
- type ConfigError
- type Engine
- func (e *Engine) Count(haystack []byte, n int) int
- func (e *Engine) Find(haystack []byte) *Match
- func (e *Engine) FindAllIndicesStreaming(haystack []byte, n int, results [][2]int) [][2]int
- func (e *Engine) FindAllSubmatch(haystack []byte, n int) []*MatchWithCaptures
- func (e *Engine) FindAt(haystack []byte, at int) *Match
- func (e *Engine) FindIndices(haystack []byte) (start, end int, found bool)
- func (e *Engine) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
- func (e *Engine) FindSubmatch(haystack []byte) *MatchWithCaptures
- func (e *Engine) FindSubmatchAt(haystack []byte, at int) *MatchWithCaptures
- func (e *Engine) IsMatch(haystack []byte) bool
- func (e *Engine) IsStartAnchored() bool
- func (e *Engine) NumCaptures() int
- func (e *Engine) ResetStats()
- func (e *Engine) SetLongest(longest bool)
- func (e *Engine) Stats() Stats
- func (e *Engine) Strategy() Strategy
- func (e *Engine) SubexpNames() []string
- type Match
- type MatchWithCaptures
- func (m *MatchWithCaptures) AllGroupStrings() []string
- func (m *MatchWithCaptures) AllGroups() [][]byte
- func (m *MatchWithCaptures) Bytes() []byte
- func (m *MatchWithCaptures) End() int
- func (m *MatchWithCaptures) Group(index int) []byte
- func (m *MatchWithCaptures) GroupIndex(index int) []int
- func (m *MatchWithCaptures) GroupString(index int) string
- func (m *MatchWithCaptures) NumCaptures() int
- func (m *MatchWithCaptures) Start() int
- func (m *MatchWithCaptures) String() string
- type MultilineReverseSuffixSearcher
- func (s *MultilineReverseSuffixSearcher) Find(haystack []byte) *Match
- func (s *MultilineReverseSuffixSearcher) FindAt(haystack []byte, at int) *Match
- func (s *MultilineReverseSuffixSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
- func (s *MultilineReverseSuffixSearcher) IsMatch(haystack []byte) bool
- type ReverseAnchoredSearcher
- type ReverseInnerSearcher
- type ReverseSuffixSearcher
- type ReverseSuffixSetSearcher
- func (s *ReverseSuffixSetSearcher) Find(haystack []byte) *Match
- func (s *ReverseSuffixSetSearcher) FindAt(haystack []byte, at int) *Match
- func (s *ReverseSuffixSetSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
- func (s *ReverseSuffixSetSearcher) IsMatch(haystack []byte) bool
- type SearchState
- type Stats
- type Strategy
Constants ¶
This section is empty.
Variables ¶
var ErrNoInnerLiterals = errors.New("no inner literals available for ReverseInner strategy")
ErrNoInnerLiterals indicates that no inner literals could be extracted for ReverseInner strategy. This is not a fatal error - it just means ReverseInner optimization cannot be used.
var ErrNoMultilinePrefilter = errors.New("no prefilter available for multiline suffix literals")
ErrNoMultilinePrefilter indicates that no prefilter could be built for multiline suffix literals.
var ErrNoPrefilter = errors.New("no prefilter available for suffix literals")
ErrNoPrefilter indicates that no prefilter could be built for suffix literals. This is not a fatal error - it just means ReverseSuffix optimization cannot be used.
var ErrNoSuffixSet = errors.New("no suffix set prefilter available")
ErrNoSuffixSet indicates that no suffix set prefilter could be built.
Functions ¶
func MatchAnchoredLiteral ¶ added in v0.11.0
func MatchAnchoredLiteral(input []byte, info *AnchoredLiteralInfo) bool
MatchAnchoredLiteral performs fast O(prefix + suffix + charclass) matching. This is the runtime execution for UseAnchoredLiteral strategy.
Algorithm:
- Length check (O(1))
- Prefix check (O(len(prefix)))
- Suffix check (O(len(suffix)))
- Charclass bridge check (O(k) where k = distance to find match)
Returns true if input matches the pattern.
func StrategyReason ¶
StrategyReason provides a human-readable explanation for strategy selection.
This is useful for debugging and performance tuning.
Example:
strategy := meta.SelectStrategy(nfa, literals, config)
reason := meta.StrategyReason(strategy, nfa, literals, config)
log.Printf("Using %s: %s", strategy, reason)
Types ¶
type AnchoredLiteralInfo ¶ added in v0.11.0
type AnchoredLiteralInfo struct {
// Prefix is the required prefix literal (may be empty).
// For ^/.*\.php$, this is "/".
Prefix []byte
// Suffix is the required suffix literal (always non-empty).
// For ^/.*\.php$, this is ".php".
Suffix []byte
// CharClassTable is a 256-byte lookup table for the charclass bridge.
// For [\w-], table[c] is true for [A-Za-z0-9_-].
// nil if no charclass bridge required.
CharClassTable *[256]bool
// CharClassMin is the minimum count of charclass matches required.
// Usually 1 for charclass+.
CharClassMin int
// WildcardMin is 0 for .* or 1 for .+
WildcardMin int
// MinLength is the minimum input length for a possible match.
// Calculated as: len(Prefix) + WildcardMin + CharClassMin + len(Suffix)
MinLength int
}
AnchoredLiteralInfo contains extracted components for fast matching. This is used by the UseAnchoredLiteral strategy.
func DetectAnchoredLiteral ¶ added in v0.11.0
func DetectAnchoredLiteral(re *syntax.Regexp) *AnchoredLiteralInfo
DetectAnchoredLiteral analyzes a regex AST to detect patterns suitable for the UseAnchoredLiteral strategy.
Pattern structure (all variations):
^prefix.*charclass+suffix$ (full form) ^prefix.*suffix$ (no charclass bridge) ^.*suffix$ (no prefix) ^prefix.+suffix$ (.+ instead of .*)
Requirements:
- Must be anchored at both start (^ or \A) and end ($ or \z)
- Must contain .* or .+ (greedy wildcard)
- Must have a literal suffix before end anchor
Returns nil if pattern doesn't match the required structure.
type CompileError ¶
CompileError represents a pattern compilation error.
func (*CompileError) Error ¶
func (e *CompileError) Error() string
Error implements the error interface. For syntax errors, returns the error directly to match stdlib behavior.
func (*CompileError) Unwrap ¶
func (e *CompileError) Unwrap() error
Unwrap returns the underlying error.
type Config ¶
type Config struct {
// EnableDFA enables the Lazy DFA engine.
// When false, only NFA (PikeVM) is used.
// Default: true
EnableDFA bool
// EnablePrefilter enables literal-based prefiltering.
// When false, no prefilter is used even if literals are available.
// Default: true
EnablePrefilter bool
// MaxDFAStates sets the maximum number of DFA states to cache.
// Larger values use more memory but reduce NFA fallback frequency.
// Default: 10000
MaxDFAStates uint32
// DeterminizationLimit caps the number of NFA states per DFA state.
// This prevents exponential blowup in patterns like (a*)*b.
// Default: 1000
DeterminizationLimit int
// MinLiteralLen is the minimum length for prefilter literals.
// Shorter literals may have too many false positives.
// Default: 2
MinLiteralLen int
// MaxLiterals limits the number of literals to extract for prefiltering.
// Must be > 64 to properly detect patterns that exceed Teddy's capacity.
// Default: 256 (allows detecting patterns with >64 literals for Aho-Corasick)
MaxLiterals int
// MaxRecursionDepth limits recursion during NFA compilation.
// Default: 100
MaxRecursionDepth int
// EnableASCIIOptimization enables ASCII runtime detection (V11-002 optimization).
// When true and the pattern contains '.', two NFAs are compiled:
// - UTF-8 NFA: handles all valid UTF-8 codepoints (~28 states per '.')
// - ASCII NFA: optimized for ASCII-only input (1-2 states per '.')
//
// At runtime, input is checked using SIMD (AVX2 on x86-64) to determine if
// all bytes are ASCII. If so, the faster ASCII NFA is used.
//
// Performance impact for Issue #79 pattern ^/.*[\w-]+\.php:
// - Compile time: ~1.5x longer (compiling two NFAs)
// - Match time: up to 1.6x faster on ASCII input
// - Memory: ~1.4x more (two NFAs stored)
//
// The check adds ~3-4ns overhead per search but saves significantly more
// on patterns with '.' when input is ASCII-only.
//
// Default: true
EnableASCIIOptimization bool
}
Config controls meta-engine behavior and performance characteristics.
Configuration options affect:
- Strategy selection (which engine to use)
- Cache sizes (DFA state cache)
- Limits (determinization, recursion)
- Prefilter enablement
Example:
config := meta.DefaultConfig() config.EnableDFA = false // Force NFA-only execution engine := meta.NewEngine(nfa, config)
func DefaultConfig ¶
func DefaultConfig() Config
DefaultConfig returns a configuration with sensible defaults.
Defaults are tuned for typical regex patterns with a balance between performance and memory usage:
- DFA enabled with 10K state cache (moderate memory usage)
- Prefilter enabled (5-50x speedup on patterns with literals)
- Conservative determinization limit (prevents exponential blowup)
- Reasonable recursion depth (handles nested patterns)
Example:
config := meta.DefaultConfig() // Use as-is or customize specific options config.MaxDFAStates = 50000 // Increase cache for better hit rate
func (Config) Validate ¶
Validate checks if the configuration is valid. Returns an error if any parameter is out of range.
Valid ranges:
- MaxDFAStates: 1 to 1,000,000
- DeterminizationLimit: 10 to 100,000
- MinLiteralLen: 1 to 64
- MaxLiterals: 1 to 1,000
- MaxRecursionDepth: 10 to 1,000
Example:
config := meta.Config{MaxDFAStates: 0} // Invalid!
if err := config.Validate(); err != nil {
log.Fatal(err)
}
type ConfigError ¶
ConfigError represents an invalid configuration parameter.
func (*ConfigError) Error ¶
func (e *ConfigError) Error() string
Error implements the error interface.
type Engine ¶
type Engine struct {
// contains filtered or unexported fields
}
Engine is the meta-engine that orchestrates all regex execution strategies.
The Engine:
- Analyzes the pattern and extracts literals
- Selects the optimal strategy (NFA, DFA, or both)
- Builds prefilter (if literals available)
- Coordinates search across engines
Thread safety: The Engine uses a sync.Pool internally to provide thread-safe concurrent access. Multiple goroutines can safely call search methods (Find, IsMatch, FindSubmatch, etc.) on the same Engine instance concurrently.
The underlying NFA, DFA, and prefilters are immutable after compilation. Per-search mutable state is managed via sync.Pool, following the Go stdlib regexp package pattern.
Example:
// Compile pattern (once)
engine, err := meta.Compile("(foo|bar)\\d+")
if err != nil {
return err
}
// Search (safe to call from multiple goroutines)
haystack := []byte("test foo123 end")
match := engine.Find(haystack)
if match != nil {
println(match.String()) // "foo123"
}
func Compile ¶
Compile compiles a regex pattern string into an executable Engine.
Steps:
- Parse pattern using regexp/syntax
- Compile to NFA
- Extract literals (prefixes, suffixes)
- Build prefilter (if good literals exist)
- Select strategy
- Build DFA (if strategy requires it)
Returns an error if:
- Pattern syntax is invalid
- Pattern is too complex (recursion limit exceeded)
- Configuration is invalid
Example:
engine, err := meta.Compile("hello.*world")
if err != nil {
log.Fatal(err)
}
func CompileRegexp ¶
CompileRegexp compiles a parsed syntax.Regexp with default configuration.
This is useful when you already have a parsed regexp from another source.
Example:
re, _ := syntax.Parse("hello", syntax.Perl)
engine, err := meta.CompileRegexp(re, meta.DefaultConfig())
func CompileWithConfig ¶
CompileWithConfig compiles a pattern with custom configuration.
Example:
config := meta.DefaultConfig()
config.MaxDFAStates = 50000 // Increase cache
engine, err := meta.CompileWithConfig("(a|b|c)*", config)
func (*Engine) Count ¶ added in v0.4.0
Count returns the number of non-overlapping matches in the haystack.
This is optimized for counting without allocating result slices. Uses early termination for boolean checks at each step. If n > 0, counts at most n matches. If n <= 0, counts all matches.
Example:
engine, _ := meta.Compile(`\d+`)
count := engine.Count([]byte("1 2 3 4 5"), -1)
// count == 5
func (*Engine) Find ¶
Find returns the first match in the haystack, or nil if no match.
The search algorithm depends on the selected strategy:
UseNFA: PikeVM search directly UseDFA: Prefilter (if available) → DFA → NFA fallback UseBoth: Try DFA, fallback to NFA on cache full
Example:
engine, _ := meta.Compile("hello")
match := engine.Find([]byte("say hello world"))
if match != nil {
println(match.String()) // "hello"
}
func (*Engine) FindAllIndicesStreaming ¶ added in v0.9.4
FindAllIndicesStreaming returns all non-overlapping match indices using streaming algorithm. For CharClassSearcher strategy, this uses single-pass state machine which is significantly faster than repeated FindIndicesAt calls (no per-match function call overhead).
Returns slice of [2]int{start, end} pairs. Limit n (0=no limit) restricts match count. The results slice is reused if provided (pass nil for fresh allocation).
This method is optimized for patterns like \w+, \d+, [a-z]+ where matches are frequent.
func (*Engine) FindAllSubmatch ¶ added in v0.4.0
func (e *Engine) FindAllSubmatch(haystack []byte, n int) []*MatchWithCaptures
FindAllSubmatch returns all successive matches with capture group information. If n > 0, returns at most n matches. If n <= 0, returns all matches.
Example:
engine, _ := meta.Compile(`(\w+)@(\w+)\.(\w+)`)
matches := engine.FindAllSubmatch([]byte("[email protected] [email protected]"), -1)
// len(matches) == 2
func (*Engine) FindAt ¶ added in v0.8.6
FindAt finds the first match starting from position 'at' in the haystack. Returns nil if no match is found.
This method is used by FindAll* operations to correctly handle anchors like ^. Unlike Find, it takes the FULL haystack and a starting position, so assertions like ^ correctly check against the original input start, not a sliced position.
Example:
engine, _ := meta.Compile("^test")
match := engine.FindAt([]byte("hello test"), 0) // matches at 0
match := engine.FindAt([]byte("hello test"), 6) // no match (^ won't match at pos 6)
func (*Engine) FindIndices ¶ added in v0.8.15
FindIndices returns the start and end indices of the first match. Returns (-1, -1, false) if no match is found.
This is a zero-allocation alternative to Find() - it returns indices directly instead of creating a Match object.
func (*Engine) FindIndicesAt ¶ added in v0.8.15
FindIndicesAt returns the start and end indices of the first match starting at position 'at'. Returns (-1, -1, false) if no match is found.
func (*Engine) FindSubmatch ¶ added in v0.2.0
func (e *Engine) FindSubmatch(haystack []byte) *MatchWithCaptures
FindSubmatch returns the first match with capture group information. Returns nil if no match is found.
Group 0 is always the entire match. Groups 1+ are explicit capture groups. Unmatched optional groups will have nil values.
When a one-pass DFA is available (for anchored patterns), this method is 10-20x faster than PikeVM for capture group extraction.
Example:
engine, _ := meta.Compile(`(\w+)@(\w+)\.(\w+)`)
match := engine.FindSubmatch([]byte("[email protected]"))
if match != nil {
fmt.Println(match.Group(0)) // "[email protected]"
fmt.Println(match.Group(1)) // "user"
fmt.Println(match.Group(2)) // "example"
fmt.Println(match.Group(3)) // "com"
}
func (*Engine) FindSubmatchAt ¶ added in v0.8.6
func (e *Engine) FindSubmatchAt(haystack []byte, at int) *MatchWithCaptures
FindSubmatchAt returns the first match with capture group information, starting from position 'at' in the haystack. Returns nil if no match is found.
This method is used by ReplaceAll* operations to correctly handle anchors like ^. Unlike FindSubmatch, it takes the FULL haystack and a starting position. Thread-safe: uses pooled state for both OnePass cache and PikeVM.
func (*Engine) IsMatch ¶
IsMatch returns true if the pattern matches anywhere in the haystack.
This is optimized for boolean matching:
- Uses early termination (returns immediately on first match)
- Avoids Match object creation
- Uses DFA.IsMatch when available (2-10x faster than Find)
Example:
engine, _ := meta.Compile("hello")
if engine.IsMatch([]byte("say hello world")) {
println("matches!")
}
func (*Engine) IsStartAnchored ¶ added in v0.10.8
IsStartAnchored returns true if the pattern is anchored at the start (^). Start-anchored patterns can only match at position 0.
func (*Engine) NumCaptures ¶ added in v0.2.0
NumCaptures returns the number of capture groups in the pattern. Group 0 is the entire match, groups 1+ are explicit captures.
func (*Engine) ResetStats ¶
func (e *Engine) ResetStats()
ResetStats resets execution statistics to zero.
func (*Engine) SetLongest ¶ added in v0.8.12
SetLongest enables or disables leftmost-longest (POSIX) matching semantics. By default, the engine uses leftmost-first (Perl) semantics where the first alternative in an alternation wins. With longest=true, the longest match wins.
This affects how alternations like `(a|ab)` match:
- longest=false (default): "a" wins (first branch)
- longest=true: "ab" wins (longest match)
func (*Engine) Stats ¶
Stats returns execution statistics.
Useful for performance analysis and debugging.
Example:
stats := engine.Stats()
println("NFA searches:", stats.NFASearches)
println("DFA searches:", stats.DFASearches)
func (*Engine) Strategy ¶
Strategy returns the execution strategy selected for this engine.
Example:
strategy := engine.Strategy() println(strategy.String()) // "UseDFA"
func (*Engine) SubexpNames ¶ added in v0.5.0
SubexpNames returns the names of capture groups in the pattern. Index 0 is always "" (entire match). Named groups return their names, unnamed groups return "". This matches stdlib regexp.Regexp.SubexpNames() behavior.
type Match ¶
type Match struct {
// contains filtered or unexported fields
}
Match represents a successful regex match with position information.
A Match contains:
- Start position (inclusive)
- End position (exclusive)
- Reference to the original haystack
Note: This is a simple match without capture group support. Capture groups will be added in a future version.
Example:
match := &Match{start: 5, end: 11, haystack: []byte("test foo123 end")}
println(match.String()) // "foo123"
println(match.Start(), match.End()) // 5, 11
func NewMatch ¶
NewMatch creates a new Match from start and end positions.
Parameters:
- start: inclusive start position in haystack
- end: exclusive end position in haystack
- haystack: the original byte buffer that was searched
The haystack is stored by reference (not copied) for efficiency. Callers must ensure the haystack remains valid for the lifetime of the Match.
Example:
haystack := []byte("hello world")
match := meta.NewMatch(0, 5, haystack) // "hello"
func (*Match) Bytes ¶
Bytes returns the matched bytes as a slice.
The returned slice is a view into the original haystack (not a copy). Callers should copy the bytes if they need to retain them after the haystack is modified or deallocated.
Example:
match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(string(match.Bytes())) // "foo123"
func (*Match) Contains ¶
Contains returns true if the given position is within the match range.
Parameters:
- pos: position to check (must be >= 0)
Returns true if start <= pos < end.
Example:
match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Contains(7)) // true
println(match.Contains(11)) // false (exclusive end)
func (*Match) End ¶
End returns the exclusive end position of the match.
Example:
match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.End()) // 11
func (*Match) IsEmpty ¶
IsEmpty returns true if the match has zero length.
Empty matches can occur with patterns like "" or "(?:)" that match without consuming input.
Example:
match := meta.NewMatch(5, 5, []byte("test"))
println(match.IsEmpty()) // true
func (*Match) Len ¶
Len returns the length of the match in bytes.
Example:
match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Len()) // 6 (11 - 5)
func (*Match) Start ¶
Start returns the inclusive start position of the match.
Example:
match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Start()) // 5
type MatchWithCaptures ¶ added in v0.2.0
type MatchWithCaptures struct {
// contains filtered or unexported fields
}
MatchWithCaptures represents a match with capture group information. Group 0 is always the entire match, groups 1+ are explicit capture groups.
func NewMatchWithCaptures ¶ added in v0.2.0
func NewMatchWithCaptures(haystack []byte, captures [][]int) *MatchWithCaptures
NewMatchWithCaptures creates a MatchWithCaptures from capture positions.
func (*MatchWithCaptures) AllGroupStrings ¶ added in v0.2.0
func (m *MatchWithCaptures) AllGroupStrings() []string
AllGroupStrings returns all captured groups as strings.
func (*MatchWithCaptures) AllGroups ¶ added in v0.2.0
func (m *MatchWithCaptures) AllGroups() [][]byte
AllGroups returns all captured groups as byte slices. Result[0] is the entire match, result[i] may be nil if group i was not captured.
func (*MatchWithCaptures) Bytes ¶ added in v0.2.0
func (m *MatchWithCaptures) Bytes() []byte
Bytes returns the matched bytes for group 0 (entire match).
func (*MatchWithCaptures) End ¶ added in v0.2.0
func (m *MatchWithCaptures) End() int
End returns the end position of the entire match (group 0).
func (*MatchWithCaptures) Group ¶ added in v0.2.0
func (m *MatchWithCaptures) Group(index int) []byte
Group returns the captured text for the specified group. Group 0 is the entire match. Returns nil if group was not captured.
func (*MatchWithCaptures) GroupIndex ¶ added in v0.2.0
func (m *MatchWithCaptures) GroupIndex(index int) []int
GroupIndex returns [start, end] positions for the specified group. Returns nil if group was not captured.
func (*MatchWithCaptures) GroupString ¶ added in v0.2.0
func (m *MatchWithCaptures) GroupString(index int) string
GroupString returns the captured text as string for the specified group.
func (*MatchWithCaptures) NumCaptures ¶ added in v0.2.0
func (m *MatchWithCaptures) NumCaptures() int
NumCaptures returns the number of capture groups (including group 0).
func (*MatchWithCaptures) Start ¶ added in v0.2.0
func (m *MatchWithCaptures) Start() int
Start returns the start position of the entire match (group 0).
func (*MatchWithCaptures) String ¶ added in v0.2.0
func (m *MatchWithCaptures) String() string
String returns the matched text for group 0 (entire match).
type MultilineReverseSuffixSearcher ¶ added in v0.11.1
type MultilineReverseSuffixSearcher struct {
// contains filtered or unexported fields
}
MultilineReverseSuffixSearcher performs suffix literal prefilter + line-aware forward verification.
This strategy is used for multiline patterns like `(?m)^/.*[\w-]+\.php` where:
- The pattern has the multiline flag (?m)
- The pattern is anchored at start of LINE (^), not start of TEXT (\A)
- Has a good suffix literal for prefiltering
- Match can occur at ANY line start, not just position 0
Algorithm:
- Extract suffix literals from pattern
- Build prefilter for suffix literals
- Search algorithm: a. Prefilter finds suffix candidates in haystack b. For each candidate: - Scan backward to find LINE start (\n or start of input) - Verify pattern from line start using forward PikeVM c. Return first valid match
Key difference from ReverseSuffixSearcher:
- ReverseSuffix: match always starts at position 0 (unanchored .*)
- MultilineReverseSuffix: match starts at LINE start (after \n or pos 0)
Performance:
- Forward naive search: O(n*m) where n=haystack length, m=pattern length
- MultilineReverseSuffix: O(k*l) where k=suffix candidates, l=avg line length
- Expected speedup: 5-20x for patterns like `(?m)^/.*\.php` on large inputs
func NewMultilineReverseSuffixSearcher ¶ added in v0.11.1
func NewMultilineReverseSuffixSearcher( forwardNFA *nfa.NFA, suffixLiterals *literal.Seq, _ lazy.Config, ) (*MultilineReverseSuffixSearcher, error)
NewMultilineReverseSuffixSearcher creates a multiline-aware suffix searcher.
Requirements:
- Pattern must have good suffix literals
- Pattern must have multiline ^ anchor
- Prefilter must be available
Parameters:
- forwardNFA: the compiled forward NFA
- suffixLiterals: extracted suffix literals from pattern
- config: DFA configuration (unused, kept for API compatibility)
Returns error if multiline reverse suffix optimization cannot be applied.
func (*MultilineReverseSuffixSearcher) Find ¶ added in v0.11.1
func (s *MultilineReverseSuffixSearcher) Find(haystack []byte) *Match
Find searches using suffix literal prefilter + line-aware forward verification.
Algorithm:
- Iterate through suffix candidates using prefilter
- For each candidate, find LINE start (after \n or input start)
- Use forward PikeVM to verify match from line start
- Return first valid match
The key insight is that multiline ^ anchors must be verified using FORWARD matching from line start, not reverse matching. The ^ anchor has specific semantics that only work correctly in forward direction.
func (*MultilineReverseSuffixSearcher) FindAt ¶ added in v0.11.1
func (s *MultilineReverseSuffixSearcher) FindAt(haystack []byte, at int) *Match
FindAt searches for a match starting from position 'at'.
Returns the first match starting at or after position 'at'. Essential for FindAll iteration.
func (*MultilineReverseSuffixSearcher) FindIndicesAt ¶ added in v0.11.1
func (s *MultilineReverseSuffixSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
FindIndicesAt returns match indices starting from position 'at' - zero allocation version.
func (*MultilineReverseSuffixSearcher) IsMatch ¶ added in v0.11.1
func (s *MultilineReverseSuffixSearcher) IsMatch(haystack []byte) bool
IsMatch checks if the pattern matches using suffix prefilter + line-aware verification.
Optimized for boolean matching:
- Uses prefilter for fast candidate finding
- Uses forward PikeVM for line-aware verification
- Early termination on first match
- No Match object allocation
type ReverseAnchoredSearcher ¶ added in v0.4.0
type ReverseAnchoredSearcher struct {
// contains filtered or unexported fields
}
ReverseAnchoredSearcher performs reverse search for patterns anchored at end.
This strategy is used for patterns like "abc$" or "pattern.*suffix$" where the pattern must match at the end of the haystack. Instead of trying to match from every position in the haystack (O(n) attempts), we search backward from the end of the haystack (O(1) attempt).
Algorithm:
- Build reverse NFA from forward NFA
- Build reverse DFA from reverse NFA
- Search backward from end of haystack using reverse DFA
- If match found, convert reverse positions to forward positions
Performance:
- Forward search (naive): O(n*m) where n=haystack length, m=pattern length
- Reverse search: O(m) - only one search attempt from the end
- Speedup: ~n/m (e.g., 1000x for 1MB haystack and 1KB pattern)
Example:
// Pattern "Easy1$" on 1MB data // Forward: 340 seconds (tries match at every position) // Reverse: ~1 millisecond (one match attempt from end)
func NewReverseAnchoredSearcher ¶ added in v0.4.0
func NewReverseAnchoredSearcher(forwardNFA *nfa.NFA, config lazy.Config) (*ReverseAnchoredSearcher, error)
NewReverseAnchoredSearcher creates a reverse searcher from forward NFA.
Parameters:
- forwardNFA: the compiled forward NFA
- config: DFA configuration for reverse DFA cache
Returns nil if reverse DFA cannot be built (falls back to forward search).
func (*ReverseAnchoredSearcher) Find ¶ added in v0.4.0
func (s *ReverseAnchoredSearcher) Find(haystack []byte) *Match
Find searches backward from end of haystack and returns the match.
Algorithm:
- Use reverse DFA SearchReverse to find match START (zero-allocation)
- For $-anchored patterns, END is always len(haystack)
Performance:
- ZERO-ALLOCATION: no byte reversal needed
- Single DFA scan: O(m) where m = match length
- Much faster than PikeVM + reverseBytes approach
Example:
Forward pattern: "abc$" Forward haystack: "xxxabc" SearchReverse finds start=3, end=6 (because $ anchor)
func (*ReverseAnchoredSearcher) IsMatch ¶ added in v0.4.0
func (s *ReverseAnchoredSearcher) IsMatch(haystack []byte) bool
IsMatch checks if the pattern matches at the end of haystack.
This is optimized for boolean matching:
- Uses reverse DFA for fast rejection
- ZERO-ALLOCATION: backward scan without byte reversal
- No Match object allocation
- Early termination
type ReverseInnerSearcher ¶ added in v0.8.0
type ReverseInnerSearcher struct {
// contains filtered or unexported fields
}
ReverseInnerSearcher performs inner literal prefilter + bidirectional DFA search.
This strategy is used for patterns with inner literals like `prefix.*inner.*suffix` where:
- The pattern has a good inner literal for prefiltering
- Has wildcards/repetitions both BEFORE and AFTER the inner literal
- Can use bidirectional DFA to find exact match bounds
Algorithm:
- Extract inner literals from pattern
- Build prefilter for inner literals
- Search algorithm: a. Prefilter finds inner literal candidates in haystack b. For each candidate at position P: - Reverse DFA scans backward from P to find match START - Forward DFA scans forward from P+innerLen to find match END c. Return leftmost-longest match
Performance:
- Forward search: O(n*m) where n=haystack length, m=pattern length
- ReverseInner: O(k*(m1+m2)) where k=number of inner candidates, m1=prefix length, m2=suffix length
- Speedup: 10-100x for patterns like `ERROR.*connection.*timeout` on large haystacks
Example:
// Pattern `ERROR.*connection.*timeout` on 1MB log with 5 "connection" occurrences // Forward: tries pattern match at every position (~1M attempts) // ReverseInner: prefilter finds 5 "connection" positions, bidirectional DFA verifies (~5 attempts) // Speedup: ~200,000x
func NewReverseInnerSearcher ¶ added in v0.8.0
func NewReverseInnerSearcher( fullNFA *nfa.NFA, innerInfo *literal.InnerLiteralInfo, config lazy.Config, ) (*ReverseInnerSearcher, error)
NewReverseInnerSearcher creates a reverse inner searcher using AST splitting.
The key optimization (from rust-regex):
- Build reverse NFA from PREFIX AST only (not full pattern)
- Build forward NFA from SUFFIX AST only (not full pattern)
- This enables true bidirectional search with 10-100x speedup
Requirements:
- InnerLiteralInfo must have PrefixAST and SuffixAST populated
- Inner literal must have wildcards both before and after
- Prefilter must be available
Parameters:
- fullNFA: the compiled NFA for the full pattern (for fallback)
- innerInfo: extracted inner literal info with split AST
- config: DFA configuration for reverse/forward DFA cache
Returns error if reverse inner optimization cannot be applied:
- ErrNoInnerLiterals: no inner literals available
- ErrNoPrefilter: prefilter could not be built
- DFA compilation errors
func (*ReverseInnerSearcher) Find ¶ added in v0.8.0
func (s *ReverseInnerSearcher) Find(haystack []byte) *Match
Find searches using inner literal prefilter + bidirectional DFA and returns the match.
Algorithm (leftmost-longest/greedy semantics):
- Use prefilter to find ALL inner literal candidates
- For each candidate at position P: a. Reverse DFA scans backward from P to find match START b. Forward DFA scans forward from P+innerLen to find match END
- Track leftmost-longest match: - Leftmost: earliest start position - Longest: if same start, choose longest end
- Return the best match found
Performance:
- ZERO PikeVM calls - uses DFA exclusively
- Bidirectional DFA scan finds both match start and end efficiently
- Prefilter reduces search space dramatically
Example (bidirectional matching):
Pattern: `ERROR.*connection.*timeout` Haystack: "ERROR: connection lost due to connection timeout" Inner literal: "connection" 1. Prefilter finds "connection" at position 7, then at position 32 2. Candidate 1 (pos=7): a. Reverse DFA from pos=7 backward → finds start=0 (ERROR) b. Forward DFA from pos=17 forward → finds end=24 (lost, no timeout) c. No valid match (pattern requires "timeout" after) 3. Candidate 2 (pos=32): a. Reverse DFA from pos=32 backward → finds start=0 (ERROR) b. Forward DFA from pos=42 forward → finds end=49 (timeout) c. Valid match [0:49] 4. Return [0:49] = "ERROR: connection lost due to connection timeout"
func (*ReverseInnerSearcher) FindIndicesAt ¶ added in v0.8.21
func (s *ReverseInnerSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
FindIndicesAt returns match indices starting from position 'at' - zero allocation version. This is used by FindAll* operations for efficient iteration.
func (*ReverseInnerSearcher) IsMatch ¶ added in v0.8.0
func (s *ReverseInnerSearcher) IsMatch(haystack []byte) bool
IsMatch checks if the pattern matches using inner prefilter + bidirectional DFA.
This is optimized for boolean matching:
- Uses prefilter for fast candidate finding
- Uses bidirectional DFA for fast verification
- No Match object allocation
- Early termination on first match
- ZERO PikeVM calls - DFA confirmation is sufficient
Algorithm:
- Prefilter finds inner literal candidates
- For each candidate: a. Reverse DFA checks if we can reach inner from start b. Forward DFA checks if we can reach end from inner
- Return true on first valid match
type ReverseSuffixSearcher ¶ added in v0.6.0
type ReverseSuffixSearcher struct {
// contains filtered or unexported fields
}
ReverseSuffixSearcher performs suffix literal prefilter + reverse DFA search.
This strategy is used for patterns with literal suffixes like `.*\.txt` where:
- The pattern is NOT anchored at start (^)
- Has a good suffix literal for prefiltering
- Can use reverse DFA to verify the prefix pattern
Algorithm:
- Extract suffix literals from pattern
- Build prefilter for suffix literals
- Search algorithm: a. Prefilter finds suffix candidates in haystack b. For each candidate: - Build reverse search from haystack start to suffix end - Use reverse DFA to verify prefix pattern - If match, use forward DFA to find full match end c. Return first match
Performance:
- Forward naive search: O(n*m) where n=haystack length, m=pattern length
- ReverseSuffix: O(k*m) where k=number of suffix candidates (usually k << n)
- Speedup: 10-100x for patterns like `.*\.txt` on large haystacks
Example:
// Pattern `.*\.txt` on 1MB data with 10 `.txt` occurrences // Forward: tries pattern match at every position (~1M attempts) // ReverseSuffix: prefilter finds 10 `.txt` positions, reverse DFA verifies (~10 attempts) // Speedup: ~100,000x
func NewReverseSuffixSearcher ¶ added in v0.6.0
func NewReverseSuffixSearcher( forwardNFA *nfa.NFA, suffixLiterals *literal.Seq, config lazy.Config, ) (*ReverseSuffixSearcher, error)
NewReverseSuffixSearcher creates a reverse suffix searcher from forward NFA.
Requirements:
- Pattern must have good suffix literals
- Pattern must NOT be start-anchored (^)
- Prefilter must be available
Parameters:
- forwardNFA: the compiled forward NFA
- suffixLiterals: extracted suffix literals from pattern
- config: DFA configuration for reverse DFA cache
Returns nil if reverse suffix optimization cannot be applied.
func (*ReverseSuffixSearcher) Find ¶ added in v0.6.0
func (s *ReverseSuffixSearcher) Find(haystack []byte) *Match
Find searches using suffix literal prefilter + reverse DFA and returns the match.
Algorithm (find LAST suffix for greedy semantics):
- Use prefilter to find the LAST suffix literal candidate
- Use reverse DFA to find match START (leftmost)
- Return match immediately (no forward scan needed!)
Why find LAST suffix?
- Pattern `.*\.txt` is greedy - `.*` matches as much as possible
- For input "a.txt.txt", the greedy match is the ENTIRE string [0:9]
- Finding the LAST `.txt` (at position 5) and reverse scanning gives us this
- No expensive forward DFA scan needed!
Performance:
- Single prefilter scan to find last suffix: O(n)
- Single reverse DFA scan: O(n)
- Total: O(n) with small constant
Example:
Pattern: `.*\.txt` Haystack: "a.txt.txt" Suffix literal: `.txt` 1. Prefilter finds LAST `.txt` at position 5 2. Reverse DFA from [0,9] finds match start = 0 3. Return [0:9] = "a.txt.txt" (greedy!)
func (*ReverseSuffixSearcher) FindAt ¶ added in v0.8.19
func (s *ReverseSuffixSearcher) FindAt(haystack []byte, at int) *Match
FindAt searches for a match starting from position 'at' using suffix prefilter + reverse DFA.
Unlike Find() which returns the greedy (last suffix) match, FindAt returns the first match starting at or after position 'at'. This is essential for FindAll iteration.
Algorithm:
- Use prefilter to find FIRST suffix candidate >= at
- Use reverse DFA to find match START (from 'at' position)
- Return match [start, suffixEnd]
Performance:
- Prefilter scan: O(n) from 'at' position
- Reverse DFA verification: O(m) where m is match length
- Total: O(n) per call
func (*ReverseSuffixSearcher) FindIndicesAt ¶ added in v0.8.19
func (s *ReverseSuffixSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
FindIndicesAt returns match indices starting from position 'at' - zero allocation version.
func (*ReverseSuffixSearcher) IsMatch ¶ added in v0.6.0
func (s *ReverseSuffixSearcher) IsMatch(haystack []byte) bool
IsMatch checks if the pattern matches using suffix prefilter + reverse DFA.
This is optimized for boolean matching:
- Uses prefilter for fast candidate finding
- Uses reverse DFA for fast prefix verification
- No Match object allocation
- Early termination on first match
- ZERO PikeVM calls - reverse DFA confirmation is sufficient
type ReverseSuffixSetSearcher ¶ added in v0.8.20
type ReverseSuffixSetSearcher struct {
// contains filtered or unexported fields
}
ReverseSuffixSetSearcher performs Teddy multi-suffix prefilter + reverse DFA search.
This strategy handles patterns like `.*\.(txt|log|md)` where:
- The suffix is an alternation with no common suffix (LCS is empty)
- Multiple suffix literals are available (2-8 literals, each >= 3 bytes)
- Teddy can efficiently search for any of the suffixes
This optimization is NOT present in rust-regex (they fall back to Core strategy).
Algorithm:
- Build Teddy prefilter from all suffix literals
- Search algorithm: a. Teddy finds any suffix literal in haystack b. Use reverse DFA to verify prefix pattern c. For `.*` prefix patterns, match starts at position 0 (skip reverse scan) d. Return match
Performance:
- Without this optimization: O(n*m) using UseBoth strategy
- With Teddy suffix prefilter: O(n + k*m) where k = suffix candidates
- Speedup: 5-10x for patterns like `.*\.(txt|log|md)`
func NewReverseSuffixSetSearcher ¶ added in v0.8.20
func NewReverseSuffixSetSearcher( forwardNFA *nfa.NFA, suffixLiterals *literal.Seq, config lazy.Config, ) (*ReverseSuffixSetSearcher, error)
NewReverseSuffixSetSearcher creates a reverse suffix set searcher.
Requirements:
- Pattern must have 2-8 suffix literals
- Each suffix literal must be >= 2 bytes (allows extensions like ".md")
- Pattern must NOT be start-anchored (^)
Returns nil if the optimization cannot be applied.
func (*ReverseSuffixSetSearcher) Find ¶ added in v0.8.20
func (s *ReverseSuffixSetSearcher) Find(haystack []byte) *Match
Find searches using Teddy suffix prefilter + reverse DFA.
For greedy matching (like `.*`), we need to find the LAST matching suffix. However, with multiple suffix lengths, we iterate through all candidates and track the best (rightmost) match.
func (*ReverseSuffixSetSearcher) FindAt ¶ added in v0.8.20
func (s *ReverseSuffixSetSearcher) FindAt(haystack []byte, at int) *Match
FindAt searches for a match starting from position 'at'.
func (*ReverseSuffixSetSearcher) FindIndicesAt ¶ added in v0.8.20
func (s *ReverseSuffixSetSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)
FindIndicesAt returns match indices - zero allocation version.
func (*ReverseSuffixSetSearcher) IsMatch ¶ added in v0.8.20
func (s *ReverseSuffixSetSearcher) IsMatch(haystack []byte) bool
IsMatch checks if the pattern matches using suffix set prefilter.
type SearchState ¶ added in v0.10.4
type SearchState struct {
// contains filtered or unexported fields
}
SearchState holds per-search mutable state for thread-safe concurrent searches. This struct should be obtained from a sync.Pool to enable safe concurrent usage of the same compiled Engine from multiple goroutines.
Usage pattern:
state := engine.getSearchState() defer engine.putSearchState(state) // use state for search operations
Thread safety: Each goroutine must use its own SearchState instance. The SearchState itself is NOT thread-safe - it must not be shared between goroutines.
type Stats ¶
type Stats struct {
// NFASearches counts NFA (PikeVM) searches
NFASearches uint64
// DFASearches counts DFA searches
DFASearches uint64
// OnePassSearches counts OnePass DFA searches (for FindSubmatch)
OnePassSearches uint64
// AhoCorasickSearches counts Aho-Corasick automaton searches
AhoCorasickSearches uint64
// PrefilterHits counts successful prefilter matches
PrefilterHits uint64
// PrefilterMisses counts prefilter candidates that didn't match
PrefilterMisses uint64
// PrefilterAbandoned counts times prefilter was abandoned due to high FP rate
PrefilterAbandoned uint64
// DFACacheFull counts times DFA fell back to NFA due to cache full
DFACacheFull uint64
}
Stats tracks execution statistics for performance analysis.
type Strategy ¶
type Strategy int
Strategy represents the execution strategy for regex matching.
The meta-engine chooses between:
- UseNFA: use PikeVM exclusively (simple patterns, no cache needed)
- UseDFA: use Lazy DFA with NFA fallback (complex patterns, good literals)
- UseBoth: adaptive strategy (try DFA first, fallback to NFA on cache full)
Strategy selection is automatic based on pattern analysis.
const ( // UseNFA uses only the NFA (PikeVM) engine. // Selected for: // - Very small NFAs (< 20 states) where DFA overhead isn't worth it // - Patterns without literals where DFA has no advantage // - When EnableDFA is false in config UseNFA Strategy = iota // UseDFA uses Lazy DFA with NFA fallback on cache overflow. // Selected for: // - Large NFAs (> 100 states) where DFA is essential // - Patterns with good literals (prefilter + DFA is fastest) // - Simple patterns (no alternations) where DFA doesn't blow up UseDFA // UseBoth uses adaptive strategy: try DFA, fallback to NFA if cache fills. // Selected for: // - Medium-sized NFAs (20-100 states) // - Patterns with some literals but complex structure // - Default when pattern characteristics are unclear UseBoth // UseReverseAnchored uses reverse DFA search for patterns anchored at end. // Selected for: // - Patterns with $ or \z anchor (end of text) // - NOT also anchored at start (^) // - Searches backward from end of haystack // - Converts O(n*m) to O(m) for end-anchored patterns UseReverseAnchored // UseReverseSuffix uses suffix literal prefilter + reverse DFA search. // Selected for: // - Patterns with literal suffix (e.g., `.*\.txt`) // - NOT start-anchored (^) // - Has good suffix literal for prefiltering // - Speedup: 10-100x for patterns like `.*\.txt` UseReverseSuffix // UseOnePass uses one-pass DFA for anchored patterns with capture groups. // Selected for: // - Pattern is always anchored (^ or implicit anchor) // - Pattern is "one-pass" (no ambiguity in matching paths) // - Pattern has capture groups (otherwise lazy DFA is faster) // - Speedup: 10-20x over PikeVM for capture group extraction // - Only used for FindSubmatch, not Find UseOnePass // UseReverseInner uses inner literal prefilter + bidirectional DFA search. // Selected for: // - Patterns with inner literal (e.g., `prefix.*inner.*suffix`) // - NOT start-anchored (^) or end-anchored ($) // - Has good inner literal for prefiltering // - NO good prefix or suffix literals (otherwise prefer UseDFA/UseReverseSuffix) // - Has wildcards both before AND after inner literal // - Speedup: 10-100x for patterns like `ERROR.*connection.*timeout` UseReverseInner // UseBoundedBacktracker uses bounded backtracking with bit-vector visited tracking. // Selected for: // - Simple character class patterns (\d+, \w+, [a-z]+) without literals // - Small enough input (states * inputLen <= threshold) // - No prefilter benefit (no extractable literals) // - Speedup: 2-4x over PikeVM for character class patterns UseBoundedBacktracker // UseTeddy uses Teddy multi-pattern prefilter directly without DFA. // Selected for: // - Exact literal alternations like (foo|bar|baz) // - All literals are complete (no regex engine verification needed) // - 2-32 patterns, each >= 3 bytes // - Speedup: 50-250x over PikeVM by skipping all DFA/NFA overhead // // This implements the "literal engine bypass" optimization from Rust regex: // when patterns are exact literals, the prefilter IS the engine. UseTeddy // UseReverseSuffixSet uses Teddy multi-pattern prefilter for suffix alternations. // Selected for: // - Patterns like `.*\.(txt|log|md)` where suffix is an alternation // - No common suffix (LCS is empty), but multiple suffix literals available // - 2-32 suffix literals, each >= 3 bytes // - Speedup: 5-10x over UseBoth by using Teddy for suffix candidates // // Algorithm: // 1. Teddy finds any of the suffix literals (e.g., ".txt", ".log", ".md") // 2. Reverse DFA scan from suffix position to find match start // 3. For `.*` prefix patterns, match starts at position 0 (skip reverse scan) // // This is an optimization NOT present in rust-regex (they fallback to Core). UseReverseSuffixSet // UseCharClassSearcher uses specialized lookup-table searcher for simple char_class+ patterns. // Selected for: // - Patterns like `[\w]+`, `[a-z]+`, `\d+` (simple repeated character class) // - NOT concatenations (those use BoundedBacktracker) // - NOT patterns with capture groups // - Speedup: 14-22x over stdlib, 14-17x over BoundedBacktracker // // Uses 256-byte membership table for O(1) byte classification instead of // NFA state tracking. Optimal for "find all words" type patterns. UseCharClassSearcher // UseCompositeSearcher uses sequential lookup tables for concatenated char class patterns. // Selected for: // - Patterns like [a-zA-Z]+[0-9]+, \d+\s+\w+, [a-z]+[A-Z]+ // - Concatenation of 2+ quantified character classes // - No anchors, captures, or alternations within // - Speedup: 5-6x over BoundedBacktracker by using O(1) lookup tables // // Algorithm: // 1. Each char class part has [256]bool membership table // 2. Greedy matching: consume max chars for each part // 3. Backtrack if min requirement not met // // Reference: https://github.com/coregx/coregex/issues/72 UseCompositeSearcher // UseBranchDispatch uses O(1) first-byte dispatch for anchored alternations. // Selected for: // - Start-anchored patterns like ^(\d+|UUID|hex32) // - Each alternation branch has distinct first bytes (no overlap) // - Speedup: 2-3x on match, 10x+ on no-match by avoiding branch iteration // // Algorithm: // 1. Build [256]int8 dispatch table: first_byte → branch_index // 2. On search: dispatch[haystack[0]] gives branch to try // 3. Only execute that single branch instead of all branches // // Reference: https://github.com/coregx/coregex/issues/79 UseBranchDispatch // UseDigitPrefilter uses SIMD digit scanning for patterns that must start with digits. // Selected for: // - Patterns where ALL alternation branches must start with a digit [0-9] // - Examples: IP address patterns, numeric validators // - Pattern has no extractable prefix literals (due to alternation structure) // - Speedup: 5-10x by skipping non-digit regions with SIMD // // Algorithm: // 1. SIMD scan haystack for digit sequences // 2. At each digit position, run lazy DFA to verify match // 3. Skip non-digit regions entirely (major speedup for sparse matches) // // This addresses Issue #50 (IP regex optimization) where alternations like // `25[0-5]|2[0-4][0-9]|...` produce no extractable prefix literals. UseDigitPrefilter // UseAhoCorasick uses Aho-Corasick automaton for large literal alternations. // Selected for: // - Exact literal alternations with >32 patterns (beyond Teddy's limit) // - All literals are complete (no regex engine verification needed) // - Each pattern >= 1 byte // - Speedup: 50-500x over PikeVM by using O(n) multi-pattern matching // // Uses github.com/coregx/ahocorasick library with: // - Dense array transitions for O(1) state lookup // - Byte class compression for memory efficiency // - ~1.6 GB/s throughput // // This extends the "literal engine bypass" optimization for large pattern sets // where Teddy's SIMD approach becomes impractical. UseAhoCorasick // UseAnchoredLiteral uses specialized O(1) matching for anchored prefix.*suffix patterns. // Selected for: // - Patterns matching: ^prefix.*[charclass+]suffix$ // - Both start (^) and end ($) anchored // - Contains .* or .+ wildcard // - Has literal suffix (e.g., ".php", ".txt") // - Optional prefix literal and charclass bridge // - Speedup: 50-90x over stdlib by avoiding NFA/DFA entirely // // Algorithm (all O(1) or O(k) operations): // 1. O(1) length check (MinLength) // 2. O(k) prefix check (bytes.HasPrefix equivalent) // 3. O(k) suffix check (bytes.HasSuffix equivalent) // 4. O(m) charclass bridge verification (if required) // // Examples: // - ^/.*[\w-]+\.php$ → prefix="/", suffix=".php", charclass=[\w-] // - ^.*\.txt$ → no prefix, suffix=".txt", no charclass // - ^api/v1/.*\.json$ → prefix="api/v1/", suffix=".json" // // This is a specialized "literal engine bypass" for URL/path matching patterns // that are extremely common in web applications and routing tables. // Reference: https://github.com/coregx/coregex/issues/79 UseAnchoredLiteral // UseMultilineReverseSuffix uses line-aware suffix search for multiline patterns. // Selected for: // - Patterns with (?m) multiline flag AND line-start anchor (^) // - Pattern has good suffix literal (e.g., `(?m)^/.*\.php`) // - The ^ anchor matches at LINE start, not just input start // - Speedup: 5-20x over stdlib by avoiding per-position forward NFA // // Algorithm (from Rust regex rebar benchmarks): // 1. Find suffix literal candidates using prefilter // 2. For each candidate, scan backward to find LINE start (\n or pos 0) // 3. Verify prefix pattern from line start to suffix // 4. Return first valid match // // Key difference from UseReverseSuffix: // - ReverseSuffix: unanchored (.*suffix), match starts at pos 0 // - MultilineReverseSuffix: line-anchored ((?m)^.*suffix), match starts at line start // // This fixes Issue #97 where `(?m)^/.*[\w-]+\.php` was 24% SLOWER than stdlib // because UseReverseSuffix assumed match always starts at position 0. // Reference: https://github.com/coregx/coregex/issues/97 UseMultilineReverseSuffix )
func SelectStrategy ¶
SelectStrategy analyzes the NFA and literals to choose the best execution strategy.
Algorithm:
- If end-anchored ($ or \z) and not start-anchored → UseReverseAnchored
- If DFA disabled in config → UseNFA
- If NFA is tiny (< 20 states) → UseNFA (DFA overhead not worth it)
- If simple character class pattern without literals → UseNFA (DFA overhead not worth it)
- If good literals exist → UseDFA (prefilter + DFA is fastest)
- If NFA is large (> 100 states) → UseDFA (essential for performance)
- Otherwise → UseBoth (adaptive)
"Good literals" means:
- At least one literal exists
- Longest common prefix (LCP) length >= MinLiteralLen
- This enables effective prefiltering
Parameters:
- n: the compiled NFA to analyze
- re: the parsed regexp (for anchor detection, can be nil)
- literals: extracted prefix literals (can be nil)
- config: meta-engine configuration
Example:
strategy := meta.SelectStrategy(nfa, re, literals, config)
switch strategy {
case meta.UseNFA:
// Use PikeVM only
case meta.UseDFA:
// Use Lazy DFA
case meta.UseReverseAnchored:
// Use reverse search
case meta.UseBoth:
// Adaptive
}