meta

package
v0.11.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 15, 2026 License: MIT Imports: 12 Imported by: 0

Documentation

Overview

Package meta implements the meta-engine orchestrator that automatically selects the optimal regex execution strategy.

The meta-engine coordinates three engines:

  • Prefilter: fast literal-based candidate finding (optional)
  • Lazy DFA: deterministic finite automaton with on-demand state construction
  • NFA (PikeVM): nondeterministic fallback for complex patterns or DFA cache overflow

Strategy selection is based on:

  • Pattern complexity (NFA size, literal quality)
  • Prefilter availability (good literals enable fast filtering)
  • DFA suitability (patterns without alternations benefit most)

The meta-engine provides the public API for pattern compilation and matching, hiding the complexity of multi-engine coordination from users.

Package meta implements the meta-engine orchestrator for coregex.

The meta package provides the core Engine type that orchestrates all regex execution strategies. It analyzes patterns, selects optimal strategies, and coordinates search across multiple engines (NFA, DFA, prefilters).

Architecture

The meta-engine uses a multi-strategy approach:

  • Pattern analysis extracts literals for prefiltering
  • Strategy selection chooses the optimal execution path
  • Engine orchestration coordinates NFA, DFA, and specialized searchers

Strategies

Available strategies (selected automatically based on pattern):

  • UseNFA: Direct NFA (PikeVM) search for small patterns
  • UseDFA: Lazy DFA for patterns with good literals
  • UseBoth: Adaptive DFA→NFA with cache eviction handling
  • UseReverseAnchored: Reverse DFA for end-anchored patterns ($)
  • UseReverseSuffix: Suffix literal + reverse DFA (1124x speedup)
  • UseReverseInner: Bidirectional search from inner literal (909x)
  • UseReverseSuffixSet: Teddy multi-suffix prefilter (260x)
  • UseTeddy: SIMD multi-pattern for exact alternations (242x)
  • UseAhoCorasick: Aho-Corasick for large alternations (113x)
  • UseCharClassSearcher: Specialized char class searcher (23x)
  • UseDigitPrefilter: SIMD digit prefilter for digit-lead patterns
  • UseOnePass: OnePass DFA for anchored patterns with captures
  • UseBoundedBacktracker: Bounded backtracker for char class patterns
  • UseBranchDispatch: O(1) branch dispatch for anchored alternations
  • UseCompositeSearcher: For concatenated char classes
  • UseAnchoredLiteral: O(1) matching for ^prefix.*suffix$ patterns (32-133x)

Thread Safety

The Engine type is safe for concurrent use from multiple goroutines. Per-search mutable state is managed via sync.Pool.

Usage

engine, err := meta.Compile(`\w+@\w+\.\w+`)
if err != nil {
    log.Fatal(err)
}

// Find first match
match := engine.Find([]byte("[email protected]"))
if match != nil {
    fmt.Println(match.String())
}

// Check if matches
if engine.IsMatch([]byte("[email protected]")) {
    fmt.Println("matches!")
}

// Find with captures
match := engine.FindSubmatch([]byte("[email protected]"))
if match != nil {
    fmt.Println(match.Group(1)) // first capture group
}

Files

The meta package is organized into these files:

  • engine.go: Engine struct definition and core API
  • compile.go: Pattern compilation and engine builders
  • find.go: Find methods returning *Match
  • find_indices.go: FindIndices methods (zero-allocation)
  • ismatch.go: IsMatch methods for boolean matching
  • findall.go: FindAll*, Count, and FindSubmatch methods
  • strategy.go: Strategy constants and selection logic
  • config.go: Configuration options
  • match.go: Match and MatchWithCaptures types
  • search_state.go: Thread-safe state pooling
  • anchored_literal.go: UseAnchoredLiteral implementation
  • reverse_*.go: Reverse search implementations

Index

Constants

This section is empty.

Variables

View Source
var ErrNoInnerLiterals = errors.New("no inner literals available for ReverseInner strategy")

ErrNoInnerLiterals indicates that no inner literals could be extracted for ReverseInner strategy. This is not a fatal error - it just means ReverseInner optimization cannot be used.

View Source
var ErrNoMultilinePrefilter = errors.New("no prefilter available for multiline suffix literals")

ErrNoMultilinePrefilter indicates that no prefilter could be built for multiline suffix literals.

View Source
var ErrNoPrefilter = errors.New("no prefilter available for suffix literals")

ErrNoPrefilter indicates that no prefilter could be built for suffix literals. This is not a fatal error - it just means ReverseSuffix optimization cannot be used.

View Source
var ErrNoSuffixSet = errors.New("no suffix set prefilter available")

ErrNoSuffixSet indicates that no suffix set prefilter could be built.

Functions

func MatchAnchoredLiteral added in v0.11.0

func MatchAnchoredLiteral(input []byte, info *AnchoredLiteralInfo) bool

MatchAnchoredLiteral performs fast O(prefix + suffix + charclass) matching. This is the runtime execution for UseAnchoredLiteral strategy.

Algorithm:

  1. Length check (O(1))
  2. Prefix check (O(len(prefix)))
  3. Suffix check (O(len(suffix)))
  4. Charclass bridge check (O(k) where k = distance to find match)

Returns true if input matches the pattern.

func StrategyReason

func StrategyReason(strategy Strategy, n *nfa.NFA, literals *literal.Seq, config Config) string

StrategyReason provides a human-readable explanation for strategy selection.

This is useful for debugging and performance tuning.

Example:

strategy := meta.SelectStrategy(nfa, literals, config)
reason := meta.StrategyReason(strategy, nfa, literals, config)
log.Printf("Using %s: %s", strategy, reason)

Types

type AnchoredLiteralInfo added in v0.11.0

type AnchoredLiteralInfo struct {
	// Prefix is the required prefix literal (may be empty).
	// For ^/.*\.php$, this is "/".
	Prefix []byte

	// Suffix is the required suffix literal (always non-empty).
	// For ^/.*\.php$, this is ".php".
	Suffix []byte

	// CharClassTable is a 256-byte lookup table for the charclass bridge.
	// For [\w-], table[c] is true for [A-Za-z0-9_-].
	// nil if no charclass bridge required.
	CharClassTable *[256]bool

	// CharClassMin is the minimum count of charclass matches required.
	// Usually 1 for charclass+.
	CharClassMin int

	// WildcardMin is 0 for .* or 1 for .+
	WildcardMin int

	// MinLength is the minimum input length for a possible match.
	// Calculated as: len(Prefix) + WildcardMin + CharClassMin + len(Suffix)
	MinLength int
}

AnchoredLiteralInfo contains extracted components for fast matching. This is used by the UseAnchoredLiteral strategy.

func DetectAnchoredLiteral added in v0.11.0

func DetectAnchoredLiteral(re *syntax.Regexp) *AnchoredLiteralInfo

DetectAnchoredLiteral analyzes a regex AST to detect patterns suitable for the UseAnchoredLiteral strategy.

Pattern structure (all variations):

^prefix.*charclass+suffix$   (full form)
^prefix.*suffix$             (no charclass bridge)
^.*suffix$                   (no prefix)
^prefix.+suffix$             (.+ instead of .*)

Requirements:

  • Must be anchored at both start (^ or \A) and end ($ or \z)
  • Must contain .* or .+ (greedy wildcard)
  • Must have a literal suffix before end anchor

Returns nil if pattern doesn't match the required structure.

type CompileError

type CompileError struct {
	Pattern string
	Err     error
}

CompileError represents a pattern compilation error.

func (*CompileError) Error

func (e *CompileError) Error() string

Error implements the error interface. For syntax errors, returns the error directly to match stdlib behavior.

func (*CompileError) Unwrap

func (e *CompileError) Unwrap() error

Unwrap returns the underlying error.

type Config

type Config struct {
	// EnableDFA enables the Lazy DFA engine.
	// When false, only NFA (PikeVM) is used.
	// Default: true
	EnableDFA bool

	// EnablePrefilter enables literal-based prefiltering.
	// When false, no prefilter is used even if literals are available.
	// Default: true
	EnablePrefilter bool

	// MaxDFAStates sets the maximum number of DFA states to cache.
	// Larger values use more memory but reduce NFA fallback frequency.
	// Default: 10000
	MaxDFAStates uint32

	// DeterminizationLimit caps the number of NFA states per DFA state.
	// This prevents exponential blowup in patterns like (a*)*b.
	// Default: 1000
	DeterminizationLimit int

	// MinLiteralLen is the minimum length for prefilter literals.
	// Shorter literals may have too many false positives.
	// Default: 2
	MinLiteralLen int

	// MaxLiterals limits the number of literals to extract for prefiltering.
	// Must be > 64 to properly detect patterns that exceed Teddy's capacity.
	// Default: 256 (allows detecting patterns with >64 literals for Aho-Corasick)
	MaxLiterals int

	// MaxRecursionDepth limits recursion during NFA compilation.
	// Default: 100
	MaxRecursionDepth int

	// EnableASCIIOptimization enables ASCII runtime detection (V11-002 optimization).
	// When true and the pattern contains '.', two NFAs are compiled:
	//   - UTF-8 NFA: handles all valid UTF-8 codepoints (~28 states per '.')
	//   - ASCII NFA: optimized for ASCII-only input (1-2 states per '.')
	//
	// At runtime, input is checked using SIMD (AVX2 on x86-64) to determine if
	// all bytes are ASCII. If so, the faster ASCII NFA is used.
	//
	// Performance impact for Issue #79 pattern ^/.*[\w-]+\.php:
	//   - Compile time: ~1.5x longer (compiling two NFAs)
	//   - Match time: up to 1.6x faster on ASCII input
	//   - Memory: ~1.4x more (two NFAs stored)
	//
	// The check adds ~3-4ns overhead per search but saves significantly more
	// on patterns with '.' when input is ASCII-only.
	//
	// Default: true
	EnableASCIIOptimization bool
}

Config controls meta-engine behavior and performance characteristics.

Configuration options affect:

  • Strategy selection (which engine to use)
  • Cache sizes (DFA state cache)
  • Limits (determinization, recursion)
  • Prefilter enablement

Example:

config := meta.DefaultConfig()
config.EnableDFA = false // Force NFA-only execution
engine := meta.NewEngine(nfa, config)

func DefaultConfig

func DefaultConfig() Config

DefaultConfig returns a configuration with sensible defaults.

Defaults are tuned for typical regex patterns with a balance between performance and memory usage:

  • DFA enabled with 10K state cache (moderate memory usage)
  • Prefilter enabled (5-50x speedup on patterns with literals)
  • Conservative determinization limit (prevents exponential blowup)
  • Reasonable recursion depth (handles nested patterns)

Example:

config := meta.DefaultConfig()
// Use as-is or customize specific options
config.MaxDFAStates = 50000 // Increase cache for better hit rate

func (Config) Validate

func (c Config) Validate() error

Validate checks if the configuration is valid. Returns an error if any parameter is out of range.

Valid ranges:

  • MaxDFAStates: 1 to 1,000,000
  • DeterminizationLimit: 10 to 100,000
  • MinLiteralLen: 1 to 64
  • MaxLiterals: 1 to 1,000
  • MaxRecursionDepth: 10 to 1,000

Example:

config := meta.Config{MaxDFAStates: 0} // Invalid!
if err := config.Validate(); err != nil {
    log.Fatal(err)
}

type ConfigError

type ConfigError struct {
	Field   string
	Message string
}

ConfigError represents an invalid configuration parameter.

func (*ConfigError) Error

func (e *ConfigError) Error() string

Error implements the error interface.

type Engine

type Engine struct {
	// contains filtered or unexported fields
}

Engine is the meta-engine that orchestrates all regex execution strategies.

The Engine:

  1. Analyzes the pattern and extracts literals
  2. Selects the optimal strategy (NFA, DFA, or both)
  3. Builds prefilter (if literals available)
  4. Coordinates search across engines

Thread safety: The Engine uses a sync.Pool internally to provide thread-safe concurrent access. Multiple goroutines can safely call search methods (Find, IsMatch, FindSubmatch, etc.) on the same Engine instance concurrently.

The underlying NFA, DFA, and prefilters are immutable after compilation. Per-search mutable state is managed via sync.Pool, following the Go stdlib regexp package pattern.

Example:

// Compile pattern (once)
engine, err := meta.Compile("(foo|bar)\\d+")
if err != nil {
    return err
}

// Search (safe to call from multiple goroutines)
haystack := []byte("test foo123 end")
match := engine.Find(haystack)
if match != nil {
    println(match.String()) // "foo123"
}

func Compile

func Compile(pattern string) (*Engine, error)

Compile compiles a regex pattern string into an executable Engine.

Steps:

  1. Parse pattern using regexp/syntax
  2. Compile to NFA
  3. Extract literals (prefixes, suffixes)
  4. Build prefilter (if good literals exist)
  5. Select strategy
  6. Build DFA (if strategy requires it)

Returns an error if:

  • Pattern syntax is invalid
  • Pattern is too complex (recursion limit exceeded)
  • Configuration is invalid

Example:

engine, err := meta.Compile("hello.*world")
if err != nil {
    log.Fatal(err)
}

func CompileRegexp

func CompileRegexp(re *syntax.Regexp, config Config) (*Engine, error)

CompileRegexp compiles a parsed syntax.Regexp with default configuration.

This is useful when you already have a parsed regexp from another source.

Example:

re, _ := syntax.Parse("hello", syntax.Perl)
engine, err := meta.CompileRegexp(re, meta.DefaultConfig())

func CompileWithConfig

func CompileWithConfig(pattern string, config Config) (*Engine, error)

CompileWithConfig compiles a pattern with custom configuration.

Example:

config := meta.DefaultConfig()
config.MaxDFAStates = 50000 // Increase cache
engine, err := meta.CompileWithConfig("(a|b|c)*", config)

func (*Engine) Count added in v0.4.0

func (e *Engine) Count(haystack []byte, n int) int

Count returns the number of non-overlapping matches in the haystack.

This is optimized for counting without allocating result slices. Uses early termination for boolean checks at each step. If n > 0, counts at most n matches. If n <= 0, counts all matches.

Example:

engine, _ := meta.Compile(`\d+`)
count := engine.Count([]byte("1 2 3 4 5"), -1)
// count == 5

func (*Engine) Find

func (e *Engine) Find(haystack []byte) *Match

Find returns the first match in the haystack, or nil if no match.

The search algorithm depends on the selected strategy:

UseNFA:   PikeVM search directly
UseDFA:   Prefilter (if available) → DFA → NFA fallback
UseBoth:  Try DFA, fallback to NFA on cache full

Example:

engine, _ := meta.Compile("hello")
match := engine.Find([]byte("say hello world"))
if match != nil {
    println(match.String()) // "hello"
}

func (*Engine) FindAllIndicesStreaming added in v0.9.4

func (e *Engine) FindAllIndicesStreaming(haystack []byte, n int, results [][2]int) [][2]int

FindAllIndicesStreaming returns all non-overlapping match indices using streaming algorithm. For CharClassSearcher strategy, this uses single-pass state machine which is significantly faster than repeated FindIndicesAt calls (no per-match function call overhead).

Returns slice of [2]int{start, end} pairs. Limit n (0=no limit) restricts match count. The results slice is reused if provided (pass nil for fresh allocation).

This method is optimized for patterns like \w+, \d+, [a-z]+ where matches are frequent.

func (*Engine) FindAllSubmatch added in v0.4.0

func (e *Engine) FindAllSubmatch(haystack []byte, n int) []*MatchWithCaptures

FindAllSubmatch returns all successive matches with capture group information. If n > 0, returns at most n matches. If n <= 0, returns all matches.

Example:

engine, _ := meta.Compile(`(\w+)@(\w+)\.(\w+)`)
matches := engine.FindAllSubmatch([]byte("[email protected] [email protected]"), -1)
// len(matches) == 2

func (*Engine) FindAt added in v0.8.6

func (e *Engine) FindAt(haystack []byte, at int) *Match

FindAt finds the first match starting from position 'at' in the haystack. Returns nil if no match is found.

This method is used by FindAll* operations to correctly handle anchors like ^. Unlike Find, it takes the FULL haystack and a starting position, so assertions like ^ correctly check against the original input start, not a sliced position.

Example:

engine, _ := meta.Compile("^test")
match := engine.FindAt([]byte("hello test"), 0)  // matches at 0
match := engine.FindAt([]byte("hello test"), 6)  // no match (^ won't match at pos 6)

func (*Engine) FindIndices added in v0.8.15

func (e *Engine) FindIndices(haystack []byte) (start, end int, found bool)

FindIndices returns the start and end indices of the first match. Returns (-1, -1, false) if no match is found.

This is a zero-allocation alternative to Find() - it returns indices directly instead of creating a Match object.

func (*Engine) FindIndicesAt added in v0.8.15

func (e *Engine) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)

FindIndicesAt returns the start and end indices of the first match starting at position 'at'. Returns (-1, -1, false) if no match is found.

func (*Engine) FindSubmatch added in v0.2.0

func (e *Engine) FindSubmatch(haystack []byte) *MatchWithCaptures

FindSubmatch returns the first match with capture group information. Returns nil if no match is found.

Group 0 is always the entire match. Groups 1+ are explicit capture groups. Unmatched optional groups will have nil values.

When a one-pass DFA is available (for anchored patterns), this method is 10-20x faster than PikeVM for capture group extraction.

Example:

engine, _ := meta.Compile(`(\w+)@(\w+)\.(\w+)`)
match := engine.FindSubmatch([]byte("[email protected]"))
if match != nil {
    fmt.Println(match.Group(0)) // "[email protected]"
    fmt.Println(match.Group(1)) // "user"
    fmt.Println(match.Group(2)) // "example"
    fmt.Println(match.Group(3)) // "com"
}

func (*Engine) FindSubmatchAt added in v0.8.6

func (e *Engine) FindSubmatchAt(haystack []byte, at int) *MatchWithCaptures

FindSubmatchAt returns the first match with capture group information, starting from position 'at' in the haystack. Returns nil if no match is found.

This method is used by ReplaceAll* operations to correctly handle anchors like ^. Unlike FindSubmatch, it takes the FULL haystack and a starting position. Thread-safe: uses pooled state for both OnePass cache and PikeVM.

func (*Engine) IsMatch

func (e *Engine) IsMatch(haystack []byte) bool

IsMatch returns true if the pattern matches anywhere in the haystack.

This is optimized for boolean matching:

  • Uses early termination (returns immediately on first match)
  • Avoids Match object creation
  • Uses DFA.IsMatch when available (2-10x faster than Find)

Example:

engine, _ := meta.Compile("hello")
if engine.IsMatch([]byte("say hello world")) {
    println("matches!")
}

func (*Engine) IsStartAnchored added in v0.10.8

func (e *Engine) IsStartAnchored() bool

IsStartAnchored returns true if the pattern is anchored at the start (^). Start-anchored patterns can only match at position 0.

func (*Engine) NumCaptures added in v0.2.0

func (e *Engine) NumCaptures() int

NumCaptures returns the number of capture groups in the pattern. Group 0 is the entire match, groups 1+ are explicit captures.

func (*Engine) ResetStats

func (e *Engine) ResetStats()

ResetStats resets execution statistics to zero.

func (*Engine) SetLongest added in v0.8.12

func (e *Engine) SetLongest(longest bool)

SetLongest enables or disables leftmost-longest (POSIX) matching semantics. By default, the engine uses leftmost-first (Perl) semantics where the first alternative in an alternation wins. With longest=true, the longest match wins.

This affects how alternations like `(a|ab)` match:

  • longest=false (default): "a" wins (first branch)
  • longest=true: "ab" wins (longest match)

func (*Engine) Stats

func (e *Engine) Stats() Stats

Stats returns execution statistics.

Useful for performance analysis and debugging.

Example:

stats := engine.Stats()
println("NFA searches:", stats.NFASearches)
println("DFA searches:", stats.DFASearches)

func (*Engine) Strategy

func (e *Engine) Strategy() Strategy

Strategy returns the execution strategy selected for this engine.

Example:

strategy := engine.Strategy()
println(strategy.String()) // "UseDFA"

func (*Engine) SubexpNames added in v0.5.0

func (e *Engine) SubexpNames() []string

SubexpNames returns the names of capture groups in the pattern. Index 0 is always "" (entire match). Named groups return their names, unnamed groups return "". This matches stdlib regexp.Regexp.SubexpNames() behavior.

type Match

type Match struct {
	// contains filtered or unexported fields
}

Match represents a successful regex match with position information.

A Match contains:

  • Start position (inclusive)
  • End position (exclusive)
  • Reference to the original haystack

Note: This is a simple match without capture group support. Capture groups will be added in a future version.

Example:

match := &Match{start: 5, end: 11, haystack: []byte("test foo123 end")}
println(match.String()) // "foo123"
println(match.Start(), match.End()) // 5, 11

func NewMatch

func NewMatch(start, end int, haystack []byte) *Match

NewMatch creates a new Match from start and end positions.

Parameters:

  • start: inclusive start position in haystack
  • end: exclusive end position in haystack
  • haystack: the original byte buffer that was searched

The haystack is stored by reference (not copied) for efficiency. Callers must ensure the haystack remains valid for the lifetime of the Match.

Example:

haystack := []byte("hello world")
match := meta.NewMatch(0, 5, haystack) // "hello"

func (*Match) Bytes

func (m *Match) Bytes() []byte

Bytes returns the matched bytes as a slice.

The returned slice is a view into the original haystack (not a copy). Callers should copy the bytes if they need to retain them after the haystack is modified or deallocated.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(string(match.Bytes())) // "foo123"

func (*Match) Contains

func (m *Match) Contains(pos int) bool

Contains returns true if the given position is within the match range.

Parameters:

  • pos: position to check (must be >= 0)

Returns true if start <= pos < end.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Contains(7))  // true
println(match.Contains(11)) // false (exclusive end)

func (*Match) End

func (m *Match) End() int

End returns the exclusive end position of the match.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.End()) // 11

func (*Match) IsEmpty

func (m *Match) IsEmpty() bool

IsEmpty returns true if the match has zero length.

Empty matches can occur with patterns like "" or "(?:)" that match without consuming input.

Example:

match := meta.NewMatch(5, 5, []byte("test"))
println(match.IsEmpty()) // true

func (*Match) Len

func (m *Match) Len() int

Len returns the length of the match in bytes.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Len()) // 6 (11 - 5)

func (*Match) Start

func (m *Match) Start() int

Start returns the inclusive start position of the match.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.Start()) // 5

func (*Match) String

func (m *Match) String() string

String returns the matched text as a string.

This allocates a new string by copying the matched bytes. For zero-allocation access, use Bytes() instead.

Example:

match := meta.NewMatch(5, 11, []byte("test foo123 end"))
println(match.String()) // "foo123"

type MatchWithCaptures added in v0.2.0

type MatchWithCaptures struct {
	// contains filtered or unexported fields
}

MatchWithCaptures represents a match with capture group information. Group 0 is always the entire match, groups 1+ are explicit capture groups.

func NewMatchWithCaptures added in v0.2.0

func NewMatchWithCaptures(haystack []byte, captures [][]int) *MatchWithCaptures

NewMatchWithCaptures creates a MatchWithCaptures from capture positions.

func (*MatchWithCaptures) AllGroupStrings added in v0.2.0

func (m *MatchWithCaptures) AllGroupStrings() []string

AllGroupStrings returns all captured groups as strings.

func (*MatchWithCaptures) AllGroups added in v0.2.0

func (m *MatchWithCaptures) AllGroups() [][]byte

AllGroups returns all captured groups as byte slices. Result[0] is the entire match, result[i] may be nil if group i was not captured.

func (*MatchWithCaptures) Bytes added in v0.2.0

func (m *MatchWithCaptures) Bytes() []byte

Bytes returns the matched bytes for group 0 (entire match).

func (*MatchWithCaptures) End added in v0.2.0

func (m *MatchWithCaptures) End() int

End returns the end position of the entire match (group 0).

func (*MatchWithCaptures) Group added in v0.2.0

func (m *MatchWithCaptures) Group(index int) []byte

Group returns the captured text for the specified group. Group 0 is the entire match. Returns nil if group was not captured.

func (*MatchWithCaptures) GroupIndex added in v0.2.0

func (m *MatchWithCaptures) GroupIndex(index int) []int

GroupIndex returns [start, end] positions for the specified group. Returns nil if group was not captured.

func (*MatchWithCaptures) GroupString added in v0.2.0

func (m *MatchWithCaptures) GroupString(index int) string

GroupString returns the captured text as string for the specified group.

func (*MatchWithCaptures) NumCaptures added in v0.2.0

func (m *MatchWithCaptures) NumCaptures() int

NumCaptures returns the number of capture groups (including group 0).

func (*MatchWithCaptures) Start added in v0.2.0

func (m *MatchWithCaptures) Start() int

Start returns the start position of the entire match (group 0).

func (*MatchWithCaptures) String added in v0.2.0

func (m *MatchWithCaptures) String() string

String returns the matched text for group 0 (entire match).

type MultilineReverseSuffixSearcher added in v0.11.1

type MultilineReverseSuffixSearcher struct {
	// contains filtered or unexported fields
}

MultilineReverseSuffixSearcher performs suffix literal prefilter + line-aware forward verification.

This strategy is used for multiline patterns like `(?m)^/.*[\w-]+\.php` where:

  • The pattern has the multiline flag (?m)
  • The pattern is anchored at start of LINE (^), not start of TEXT (\A)
  • Has a good suffix literal for prefiltering
  • Match can occur at ANY line start, not just position 0

Algorithm:

  1. Extract suffix literals from pattern
  2. Build prefilter for suffix literals
  3. Search algorithm: a. Prefilter finds suffix candidates in haystack b. For each candidate: - Scan backward to find LINE start (\n or start of input) - Verify pattern from line start using forward PikeVM c. Return first valid match

Key difference from ReverseSuffixSearcher:

  • ReverseSuffix: match always starts at position 0 (unanchored .*)
  • MultilineReverseSuffix: match starts at LINE start (after \n or pos 0)

Performance:

  • Forward naive search: O(n*m) where n=haystack length, m=pattern length
  • MultilineReverseSuffix: O(k*l) where k=suffix candidates, l=avg line length
  • Expected speedup: 5-20x for patterns like `(?m)^/.*\.php` on large inputs

func NewMultilineReverseSuffixSearcher added in v0.11.1

func NewMultilineReverseSuffixSearcher(
	forwardNFA *nfa.NFA,
	suffixLiterals *literal.Seq,
	_ lazy.Config,
) (*MultilineReverseSuffixSearcher, error)

NewMultilineReverseSuffixSearcher creates a multiline-aware suffix searcher.

Requirements:

  • Pattern must have good suffix literals
  • Pattern must have multiline ^ anchor
  • Prefilter must be available

Parameters:

  • forwardNFA: the compiled forward NFA
  • suffixLiterals: extracted suffix literals from pattern
  • config: DFA configuration (unused, kept for API compatibility)

Returns error if multiline reverse suffix optimization cannot be applied.

func (*MultilineReverseSuffixSearcher) Find added in v0.11.1

func (s *MultilineReverseSuffixSearcher) Find(haystack []byte) *Match

Find searches using suffix literal prefilter + line-aware forward verification.

Algorithm:

  1. Iterate through suffix candidates using prefilter
  2. For each candidate, find LINE start (after \n or input start)
  3. Use forward PikeVM to verify match from line start
  4. Return first valid match

The key insight is that multiline ^ anchors must be verified using FORWARD matching from line start, not reverse matching. The ^ anchor has specific semantics that only work correctly in forward direction.

func (*MultilineReverseSuffixSearcher) FindAt added in v0.11.1

func (s *MultilineReverseSuffixSearcher) FindAt(haystack []byte, at int) *Match

FindAt searches for a match starting from position 'at'.

Returns the first match starting at or after position 'at'. Essential for FindAll iteration.

func (*MultilineReverseSuffixSearcher) FindIndicesAt added in v0.11.1

func (s *MultilineReverseSuffixSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)

FindIndicesAt returns match indices starting from position 'at' - zero allocation version.

func (*MultilineReverseSuffixSearcher) IsMatch added in v0.11.1

func (s *MultilineReverseSuffixSearcher) IsMatch(haystack []byte) bool

IsMatch checks if the pattern matches using suffix prefilter + line-aware verification.

Optimized for boolean matching:

  • Uses prefilter for fast candidate finding
  • Uses forward PikeVM for line-aware verification
  • Early termination on first match
  • No Match object allocation

type ReverseAnchoredSearcher added in v0.4.0

type ReverseAnchoredSearcher struct {
	// contains filtered or unexported fields
}

ReverseAnchoredSearcher performs reverse search for patterns anchored at end.

This strategy is used for patterns like "abc$" or "pattern.*suffix$" where the pattern must match at the end of the haystack. Instead of trying to match from every position in the haystack (O(n) attempts), we search backward from the end of the haystack (O(1) attempt).

Algorithm:

  1. Build reverse NFA from forward NFA
  2. Build reverse DFA from reverse NFA
  3. Search backward from end of haystack using reverse DFA
  4. If match found, convert reverse positions to forward positions

Performance:

  • Forward search (naive): O(n*m) where n=haystack length, m=pattern length
  • Reverse search: O(m) - only one search attempt from the end
  • Speedup: ~n/m (e.g., 1000x for 1MB haystack and 1KB pattern)

Example:

// Pattern "Easy1$" on 1MB data
// Forward: 340 seconds (tries match at every position)
// Reverse: ~1 millisecond (one match attempt from end)

func NewReverseAnchoredSearcher added in v0.4.0

func NewReverseAnchoredSearcher(forwardNFA *nfa.NFA, config lazy.Config) (*ReverseAnchoredSearcher, error)

NewReverseAnchoredSearcher creates a reverse searcher from forward NFA.

Parameters:

  • forwardNFA: the compiled forward NFA
  • config: DFA configuration for reverse DFA cache

Returns nil if reverse DFA cannot be built (falls back to forward search).

func (*ReverseAnchoredSearcher) Find added in v0.4.0

func (s *ReverseAnchoredSearcher) Find(haystack []byte) *Match

Find searches backward from end of haystack and returns the match.

Algorithm:

  1. Use reverse DFA SearchReverse to find match START (zero-allocation)
  2. For $-anchored patterns, END is always len(haystack)

Performance:

  • ZERO-ALLOCATION: no byte reversal needed
  • Single DFA scan: O(m) where m = match length
  • Much faster than PikeVM + reverseBytes approach

Example:

Forward pattern: "abc$"
Forward haystack: "xxxabc"
SearchReverse finds start=3, end=6 (because $ anchor)

func (*ReverseAnchoredSearcher) IsMatch added in v0.4.0

func (s *ReverseAnchoredSearcher) IsMatch(haystack []byte) bool

IsMatch checks if the pattern matches at the end of haystack.

This is optimized for boolean matching:

  • Uses reverse DFA for fast rejection
  • ZERO-ALLOCATION: backward scan without byte reversal
  • No Match object allocation
  • Early termination

type ReverseInnerSearcher added in v0.8.0

type ReverseInnerSearcher struct {
	// contains filtered or unexported fields
}

ReverseInnerSearcher performs inner literal prefilter + bidirectional DFA search.

This strategy is used for patterns with inner literals like `prefix.*inner.*suffix` where:

  • The pattern has a good inner literal for prefiltering
  • Has wildcards/repetitions both BEFORE and AFTER the inner literal
  • Can use bidirectional DFA to find exact match bounds

Algorithm:

  1. Extract inner literals from pattern
  2. Build prefilter for inner literals
  3. Search algorithm: a. Prefilter finds inner literal candidates in haystack b. For each candidate at position P: - Reverse DFA scans backward from P to find match START - Forward DFA scans forward from P+innerLen to find match END c. Return leftmost-longest match

Performance:

  • Forward search: O(n*m) where n=haystack length, m=pattern length
  • ReverseInner: O(k*(m1+m2)) where k=number of inner candidates, m1=prefix length, m2=suffix length
  • Speedup: 10-100x for patterns like `ERROR.*connection.*timeout` on large haystacks

Example:

// Pattern `ERROR.*connection.*timeout` on 1MB log with 5 "connection" occurrences
// Forward: tries pattern match at every position (~1M attempts)
// ReverseInner: prefilter finds 5 "connection" positions, bidirectional DFA verifies (~5 attempts)
// Speedup: ~200,000x

func NewReverseInnerSearcher added in v0.8.0

func NewReverseInnerSearcher(
	fullNFA *nfa.NFA,
	innerInfo *literal.InnerLiteralInfo,
	config lazy.Config,
) (*ReverseInnerSearcher, error)

NewReverseInnerSearcher creates a reverse inner searcher using AST splitting.

The key optimization (from rust-regex):

  • Build reverse NFA from PREFIX AST only (not full pattern)
  • Build forward NFA from SUFFIX AST only (not full pattern)
  • This enables true bidirectional search with 10-100x speedup

Requirements:

  • InnerLiteralInfo must have PrefixAST and SuffixAST populated
  • Inner literal must have wildcards both before and after
  • Prefilter must be available

Parameters:

  • fullNFA: the compiled NFA for the full pattern (for fallback)
  • innerInfo: extracted inner literal info with split AST
  • config: DFA configuration for reverse/forward DFA cache

Returns error if reverse inner optimization cannot be applied:

  • ErrNoInnerLiterals: no inner literals available
  • ErrNoPrefilter: prefilter could not be built
  • DFA compilation errors

func (*ReverseInnerSearcher) Find added in v0.8.0

func (s *ReverseInnerSearcher) Find(haystack []byte) *Match

Find searches using inner literal prefilter + bidirectional DFA and returns the match.

Algorithm (leftmost-longest/greedy semantics):

  1. Use prefilter to find ALL inner literal candidates
  2. For each candidate at position P: a. Reverse DFA scans backward from P to find match START b. Forward DFA scans forward from P+innerLen to find match END
  3. Track leftmost-longest match: - Leftmost: earliest start position - Longest: if same start, choose longest end
  4. Return the best match found

Performance:

  • ZERO PikeVM calls - uses DFA exclusively
  • Bidirectional DFA scan finds both match start and end efficiently
  • Prefilter reduces search space dramatically

Example (bidirectional matching):

Pattern: `ERROR.*connection.*timeout`
Haystack: "ERROR: connection lost due to connection timeout"
Inner literal: "connection"

1. Prefilter finds "connection" at position 7, then at position 32
2. Candidate 1 (pos=7):
   a. Reverse DFA from pos=7 backward → finds start=0 (ERROR)
   b. Forward DFA from pos=17 forward → finds end=24 (lost, no timeout)
   c. No valid match (pattern requires "timeout" after)
3. Candidate 2 (pos=32):
   a. Reverse DFA from pos=32 backward → finds start=0 (ERROR)
   b. Forward DFA from pos=42 forward → finds end=49 (timeout)
   c. Valid match [0:49]
4. Return [0:49] = "ERROR: connection lost due to connection timeout"

func (*ReverseInnerSearcher) FindIndicesAt added in v0.8.21

func (s *ReverseInnerSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)

FindIndicesAt returns match indices starting from position 'at' - zero allocation version. This is used by FindAll* operations for efficient iteration.

func (*ReverseInnerSearcher) IsMatch added in v0.8.0

func (s *ReverseInnerSearcher) IsMatch(haystack []byte) bool

IsMatch checks if the pattern matches using inner prefilter + bidirectional DFA.

This is optimized for boolean matching:

  • Uses prefilter for fast candidate finding
  • Uses bidirectional DFA for fast verification
  • No Match object allocation
  • Early termination on first match
  • ZERO PikeVM calls - DFA confirmation is sufficient

Algorithm:

  1. Prefilter finds inner literal candidates
  2. For each candidate: a. Reverse DFA checks if we can reach inner from start b. Forward DFA checks if we can reach end from inner
  3. Return true on first valid match

type ReverseSuffixSearcher added in v0.6.0

type ReverseSuffixSearcher struct {
	// contains filtered or unexported fields
}

ReverseSuffixSearcher performs suffix literal prefilter + reverse DFA search.

This strategy is used for patterns with literal suffixes like `.*\.txt` where:

  • The pattern is NOT anchored at start (^)
  • Has a good suffix literal for prefiltering
  • Can use reverse DFA to verify the prefix pattern

Algorithm:

  1. Extract suffix literals from pattern
  2. Build prefilter for suffix literals
  3. Search algorithm: a. Prefilter finds suffix candidates in haystack b. For each candidate: - Build reverse search from haystack start to suffix end - Use reverse DFA to verify prefix pattern - If match, use forward DFA to find full match end c. Return first match

Performance:

  • Forward naive search: O(n*m) where n=haystack length, m=pattern length
  • ReverseSuffix: O(k*m) where k=number of suffix candidates (usually k << n)
  • Speedup: 10-100x for patterns like `.*\.txt` on large haystacks

Example:

// Pattern `.*\.txt` on 1MB data with 10 `.txt` occurrences
// Forward: tries pattern match at every position (~1M attempts)
// ReverseSuffix: prefilter finds 10 `.txt` positions, reverse DFA verifies (~10 attempts)
// Speedup: ~100,000x

func NewReverseSuffixSearcher added in v0.6.0

func NewReverseSuffixSearcher(
	forwardNFA *nfa.NFA,
	suffixLiterals *literal.Seq,
	config lazy.Config,
) (*ReverseSuffixSearcher, error)

NewReverseSuffixSearcher creates a reverse suffix searcher from forward NFA.

Requirements:

  • Pattern must have good suffix literals
  • Pattern must NOT be start-anchored (^)
  • Prefilter must be available

Parameters:

  • forwardNFA: the compiled forward NFA
  • suffixLiterals: extracted suffix literals from pattern
  • config: DFA configuration for reverse DFA cache

Returns nil if reverse suffix optimization cannot be applied.

func (*ReverseSuffixSearcher) Find added in v0.6.0

func (s *ReverseSuffixSearcher) Find(haystack []byte) *Match

Find searches using suffix literal prefilter + reverse DFA and returns the match.

Algorithm (find LAST suffix for greedy semantics):

  1. Use prefilter to find the LAST suffix literal candidate
  2. Use reverse DFA to find match START (leftmost)
  3. Return match immediately (no forward scan needed!)

Why find LAST suffix?

  • Pattern `.*\.txt` is greedy - `.*` matches as much as possible
  • For input "a.txt.txt", the greedy match is the ENTIRE string [0:9]
  • Finding the LAST `.txt` (at position 5) and reverse scanning gives us this
  • No expensive forward DFA scan needed!

Performance:

  • Single prefilter scan to find last suffix: O(n)
  • Single reverse DFA scan: O(n)
  • Total: O(n) with small constant

Example:

Pattern: `.*\.txt`
Haystack: "a.txt.txt"
Suffix literal: `.txt`

1. Prefilter finds LAST `.txt` at position 5
2. Reverse DFA from [0,9] finds match start = 0
3. Return [0:9] = "a.txt.txt" (greedy!)

func (*ReverseSuffixSearcher) FindAt added in v0.8.19

func (s *ReverseSuffixSearcher) FindAt(haystack []byte, at int) *Match

FindAt searches for a match starting from position 'at' using suffix prefilter + reverse DFA.

Unlike Find() which returns the greedy (last suffix) match, FindAt returns the first match starting at or after position 'at'. This is essential for FindAll iteration.

Algorithm:

  1. Use prefilter to find FIRST suffix candidate >= at
  2. Use reverse DFA to find match START (from 'at' position)
  3. Return match [start, suffixEnd]

Performance:

  • Prefilter scan: O(n) from 'at' position
  • Reverse DFA verification: O(m) where m is match length
  • Total: O(n) per call

func (*ReverseSuffixSearcher) FindIndicesAt added in v0.8.19

func (s *ReverseSuffixSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)

FindIndicesAt returns match indices starting from position 'at' - zero allocation version.

func (*ReverseSuffixSearcher) IsMatch added in v0.6.0

func (s *ReverseSuffixSearcher) IsMatch(haystack []byte) bool

IsMatch checks if the pattern matches using suffix prefilter + reverse DFA.

This is optimized for boolean matching:

  • Uses prefilter for fast candidate finding
  • Uses reverse DFA for fast prefix verification
  • No Match object allocation
  • Early termination on first match
  • ZERO PikeVM calls - reverse DFA confirmation is sufficient

type ReverseSuffixSetSearcher added in v0.8.20

type ReverseSuffixSetSearcher struct {
	// contains filtered or unexported fields
}

ReverseSuffixSetSearcher performs Teddy multi-suffix prefilter + reverse DFA search.

This strategy handles patterns like `.*\.(txt|log|md)` where:

  • The suffix is an alternation with no common suffix (LCS is empty)
  • Multiple suffix literals are available (2-8 literals, each >= 3 bytes)
  • Teddy can efficiently search for any of the suffixes

This optimization is NOT present in rust-regex (they fall back to Core strategy).

Algorithm:

  1. Build Teddy prefilter from all suffix literals
  2. Search algorithm: a. Teddy finds any suffix literal in haystack b. Use reverse DFA to verify prefix pattern c. For `.*` prefix patterns, match starts at position 0 (skip reverse scan) d. Return match

Performance:

  • Without this optimization: O(n*m) using UseBoth strategy
  • With Teddy suffix prefilter: O(n + k*m) where k = suffix candidates
  • Speedup: 5-10x for patterns like `.*\.(txt|log|md)`

func NewReverseSuffixSetSearcher added in v0.8.20

func NewReverseSuffixSetSearcher(
	forwardNFA *nfa.NFA,
	suffixLiterals *literal.Seq,
	config lazy.Config,
) (*ReverseSuffixSetSearcher, error)

NewReverseSuffixSetSearcher creates a reverse suffix set searcher.

Requirements:

  • Pattern must have 2-8 suffix literals
  • Each suffix literal must be >= 2 bytes (allows extensions like ".md")
  • Pattern must NOT be start-anchored (^)

Returns nil if the optimization cannot be applied.

func (*ReverseSuffixSetSearcher) Find added in v0.8.20

func (s *ReverseSuffixSetSearcher) Find(haystack []byte) *Match

Find searches using Teddy suffix prefilter + reverse DFA.

For greedy matching (like `.*`), we need to find the LAST matching suffix. However, with multiple suffix lengths, we iterate through all candidates and track the best (rightmost) match.

func (*ReverseSuffixSetSearcher) FindAt added in v0.8.20

func (s *ReverseSuffixSetSearcher) FindAt(haystack []byte, at int) *Match

FindAt searches for a match starting from position 'at'.

func (*ReverseSuffixSetSearcher) FindIndicesAt added in v0.8.20

func (s *ReverseSuffixSetSearcher) FindIndicesAt(haystack []byte, at int) (start, end int, found bool)

FindIndicesAt returns match indices - zero allocation version.

func (*ReverseSuffixSetSearcher) IsMatch added in v0.8.20

func (s *ReverseSuffixSetSearcher) IsMatch(haystack []byte) bool

IsMatch checks if the pattern matches using suffix set prefilter.

type SearchState added in v0.10.4

type SearchState struct {
	// contains filtered or unexported fields
}

SearchState holds per-search mutable state for thread-safe concurrent searches. This struct should be obtained from a sync.Pool to enable safe concurrent usage of the same compiled Engine from multiple goroutines.

Usage pattern:

state := engine.getSearchState()
defer engine.putSearchState(state)
// use state for search operations

Thread safety: Each goroutine must use its own SearchState instance. The SearchState itself is NOT thread-safe - it must not be shared between goroutines.

type Stats

type Stats struct {
	// NFASearches counts NFA (PikeVM) searches
	NFASearches uint64

	// DFASearches counts DFA searches
	DFASearches uint64

	// OnePassSearches counts OnePass DFA searches (for FindSubmatch)
	OnePassSearches uint64

	// AhoCorasickSearches counts Aho-Corasick automaton searches
	AhoCorasickSearches uint64

	// PrefilterHits counts successful prefilter matches
	PrefilterHits uint64

	// PrefilterMisses counts prefilter candidates that didn't match
	PrefilterMisses uint64

	// PrefilterAbandoned counts times prefilter was abandoned due to high FP rate
	PrefilterAbandoned uint64

	// DFACacheFull counts times DFA fell back to NFA due to cache full
	DFACacheFull uint64
}

Stats tracks execution statistics for performance analysis.

type Strategy

type Strategy int

Strategy represents the execution strategy for regex matching.

The meta-engine chooses between:

  • UseNFA: use PikeVM exclusively (simple patterns, no cache needed)
  • UseDFA: use Lazy DFA with NFA fallback (complex patterns, good literals)
  • UseBoth: adaptive strategy (try DFA first, fallback to NFA on cache full)

Strategy selection is automatic based on pattern analysis.

const (
	// UseNFA uses only the NFA (PikeVM) engine.
	// Selected for:
	//   - Very small NFAs (< 20 states) where DFA overhead isn't worth it
	//   - Patterns without literals where DFA has no advantage
	//   - When EnableDFA is false in config
	UseNFA Strategy = iota

	// UseDFA uses Lazy DFA with NFA fallback on cache overflow.
	// Selected for:
	//   - Large NFAs (> 100 states) where DFA is essential
	//   - Patterns with good literals (prefilter + DFA is fastest)
	//   - Simple patterns (no alternations) where DFA doesn't blow up
	UseDFA

	// UseBoth uses adaptive strategy: try DFA, fallback to NFA if cache fills.
	// Selected for:
	//   - Medium-sized NFAs (20-100 states)
	//   - Patterns with some literals but complex structure
	//   - Default when pattern characteristics are unclear
	UseBoth

	// UseReverseAnchored uses reverse DFA search for patterns anchored at end.
	// Selected for:
	//   - Patterns with $ or \z anchor (end of text)
	//   - NOT also anchored at start (^)
	//   - Searches backward from end of haystack
	//   - Converts O(n*m) to O(m) for end-anchored patterns
	UseReverseAnchored

	// UseReverseSuffix uses suffix literal prefilter + reverse DFA search.
	// Selected for:
	//   - Patterns with literal suffix (e.g., `.*\.txt`)
	//   - NOT start-anchored (^)
	//   - Has good suffix literal for prefiltering
	//   - Speedup: 10-100x for patterns like `.*\.txt`
	UseReverseSuffix

	// UseOnePass uses one-pass DFA for anchored patterns with capture groups.
	// Selected for:
	//   - Pattern is always anchored (^ or implicit anchor)
	//   - Pattern is "one-pass" (no ambiguity in matching paths)
	//   - Pattern has capture groups (otherwise lazy DFA is faster)
	//   - Speedup: 10-20x over PikeVM for capture group extraction
	//   - Only used for FindSubmatch, not Find
	UseOnePass

	// UseReverseInner uses inner literal prefilter + bidirectional DFA search.
	// Selected for:
	//   - Patterns with inner literal (e.g., `prefix.*inner.*suffix`)
	//   - NOT start-anchored (^) or end-anchored ($)
	//   - Has good inner literal for prefiltering
	//   - NO good prefix or suffix literals (otherwise prefer UseDFA/UseReverseSuffix)
	//   - Has wildcards both before AND after inner literal
	//   - Speedup: 10-100x for patterns like `ERROR.*connection.*timeout`
	UseReverseInner

	// UseBoundedBacktracker uses bounded backtracking with bit-vector visited tracking.
	// Selected for:
	//   - Simple character class patterns (\d+, \w+, [a-z]+) without literals
	//   - Small enough input (states * inputLen <= threshold)
	//   - No prefilter benefit (no extractable literals)
	//   - Speedup: 2-4x over PikeVM for character class patterns
	UseBoundedBacktracker

	// UseTeddy uses Teddy multi-pattern prefilter directly without DFA.
	// Selected for:
	//   - Exact literal alternations like (foo|bar|baz)
	//   - All literals are complete (no regex engine verification needed)
	//   - 2-32 patterns, each >= 3 bytes
	//   - Speedup: 50-250x over PikeVM by skipping all DFA/NFA overhead
	//
	// This implements the "literal engine bypass" optimization from Rust regex:
	// when patterns are exact literals, the prefilter IS the engine.
	UseTeddy

	// UseReverseSuffixSet uses Teddy multi-pattern prefilter for suffix alternations.
	// Selected for:
	//   - Patterns like `.*\.(txt|log|md)` where suffix is an alternation
	//   - No common suffix (LCS is empty), but multiple suffix literals available
	//   - 2-32 suffix literals, each >= 3 bytes
	//   - Speedup: 5-10x over UseBoth by using Teddy for suffix candidates
	//
	// Algorithm:
	//   1. Teddy finds any of the suffix literals (e.g., ".txt", ".log", ".md")
	//   2. Reverse DFA scan from suffix position to find match start
	//   3. For `.*` prefix patterns, match starts at position 0 (skip reverse scan)
	//
	// This is an optimization NOT present in rust-regex (they fallback to Core).
	UseReverseSuffixSet

	// UseCharClassSearcher uses specialized lookup-table searcher for simple char_class+ patterns.
	// Selected for:
	//   - Patterns like `[\w]+`, `[a-z]+`, `\d+` (simple repeated character class)
	//   - NOT concatenations (those use BoundedBacktracker)
	//   - NOT patterns with capture groups
	//   - Speedup: 14-22x over stdlib, 14-17x over BoundedBacktracker
	//
	// Uses 256-byte membership table for O(1) byte classification instead of
	// NFA state tracking. Optimal for "find all words" type patterns.
	UseCharClassSearcher

	// UseCompositeSearcher uses sequential lookup tables for concatenated char class patterns.
	// Selected for:
	//   - Patterns like [a-zA-Z]+[0-9]+, \d+\s+\w+, [a-z]+[A-Z]+
	//   - Concatenation of 2+ quantified character classes
	//   - No anchors, captures, or alternations within
	//   - Speedup: 5-6x over BoundedBacktracker by using O(1) lookup tables
	//
	// Algorithm:
	//   1. Each char class part has [256]bool membership table
	//   2. Greedy matching: consume max chars for each part
	//   3. Backtrack if min requirement not met
	//
	// Reference: https://github.com/coregx/coregex/issues/72
	UseCompositeSearcher

	// UseBranchDispatch uses O(1) first-byte dispatch for anchored alternations.
	// Selected for:
	//   - Start-anchored patterns like ^(\d+|UUID|hex32)
	//   - Each alternation branch has distinct first bytes (no overlap)
	//   - Speedup: 2-3x on match, 10x+ on no-match by avoiding branch iteration
	//
	// Algorithm:
	//   1. Build [256]int8 dispatch table: first_byte → branch_index
	//   2. On search: dispatch[haystack[0]] gives branch to try
	//   3. Only execute that single branch instead of all branches
	//
	// Reference: https://github.com/coregx/coregex/issues/79
	UseBranchDispatch

	// UseDigitPrefilter uses SIMD digit scanning for patterns that must start with digits.
	// Selected for:
	//   - Patterns where ALL alternation branches must start with a digit [0-9]
	//   - Examples: IP address patterns, numeric validators
	//   - Pattern has no extractable prefix literals (due to alternation structure)
	//   - Speedup: 5-10x by skipping non-digit regions with SIMD
	//
	// Algorithm:
	//   1. SIMD scan haystack for digit sequences
	//   2. At each digit position, run lazy DFA to verify match
	//   3. Skip non-digit regions entirely (major speedup for sparse matches)
	//
	// This addresses Issue #50 (IP regex optimization) where alternations like
	// `25[0-5]|2[0-4][0-9]|...` produce no extractable prefix literals.
	UseDigitPrefilter

	// UseAhoCorasick uses Aho-Corasick automaton for large literal alternations.
	// Selected for:
	//   - Exact literal alternations with >32 patterns (beyond Teddy's limit)
	//   - All literals are complete (no regex engine verification needed)
	//   - Each pattern >= 1 byte
	//   - Speedup: 50-500x over PikeVM by using O(n) multi-pattern matching
	//
	// Uses github.com/coregx/ahocorasick library with:
	//   - Dense array transitions for O(1) state lookup
	//   - Byte class compression for memory efficiency
	//   - ~1.6 GB/s throughput
	//
	// This extends the "literal engine bypass" optimization for large pattern sets
	// where Teddy's SIMD approach becomes impractical.
	UseAhoCorasick

	// UseAnchoredLiteral uses specialized O(1) matching for anchored prefix.*suffix patterns.
	// Selected for:
	//   - Patterns matching: ^prefix.*[charclass+]suffix$
	//   - Both start (^) and end ($) anchored
	//   - Contains .* or .+ wildcard
	//   - Has literal suffix (e.g., ".php", ".txt")
	//   - Optional prefix literal and charclass bridge
	//   - Speedup: 50-90x over stdlib by avoiding NFA/DFA entirely
	//
	// Algorithm (all O(1) or O(k) operations):
	//   1. O(1) length check (MinLength)
	//   2. O(k) prefix check (bytes.HasPrefix equivalent)
	//   3. O(k) suffix check (bytes.HasSuffix equivalent)
	//   4. O(m) charclass bridge verification (if required)
	//
	// Examples:
	//   - ^/.*[\w-]+\.php$ → prefix="/", suffix=".php", charclass=[\w-]
	//   - ^.*\.txt$        → no prefix, suffix=".txt", no charclass
	//   - ^api/v1/.*\.json$ → prefix="api/v1/", suffix=".json"
	//
	// This is a specialized "literal engine bypass" for URL/path matching patterns
	// that are extremely common in web applications and routing tables.
	// Reference: https://github.com/coregx/coregex/issues/79
	UseAnchoredLiteral

	// UseMultilineReverseSuffix uses line-aware suffix search for multiline patterns.
	// Selected for:
	//   - Patterns with (?m) multiline flag AND line-start anchor (^)
	//   - Pattern has good suffix literal (e.g., `(?m)^/.*\.php`)
	//   - The ^ anchor matches at LINE start, not just input start
	//   - Speedup: 5-20x over stdlib by avoiding per-position forward NFA
	//
	// Algorithm (from Rust regex rebar benchmarks):
	//   1. Find suffix literal candidates using prefilter
	//   2. For each candidate, scan backward to find LINE start (\n or pos 0)
	//   3. Verify prefix pattern from line start to suffix
	//   4. Return first valid match
	//
	// Key difference from UseReverseSuffix:
	//   - ReverseSuffix: unanchored (.*suffix), match starts at pos 0
	//   - MultilineReverseSuffix: line-anchored ((?m)^.*suffix), match starts at line start
	//
	// This fixes Issue #97 where `(?m)^/.*[\w-]+\.php` was 24% SLOWER than stdlib
	// because UseReverseSuffix assumed match always starts at position 0.
	// Reference: https://github.com/coregx/coregex/issues/97
	UseMultilineReverseSuffix
)

func SelectStrategy

func SelectStrategy(n *nfa.NFA, re *syntax.Regexp, literals *literal.Seq, config Config) Strategy

SelectStrategy analyzes the NFA and literals to choose the best execution strategy.

Algorithm:

  1. If end-anchored ($ or \z) and not start-anchored → UseReverseAnchored
  2. If DFA disabled in config → UseNFA
  3. If NFA is tiny (< 20 states) → UseNFA (DFA overhead not worth it)
  4. If simple character class pattern without literals → UseNFA (DFA overhead not worth it)
  5. If good literals exist → UseDFA (prefilter + DFA is fastest)
  6. If NFA is large (> 100 states) → UseDFA (essential for performance)
  7. Otherwise → UseBoth (adaptive)

"Good literals" means:

  • At least one literal exists
  • Longest common prefix (LCP) length >= MinLiteralLen
  • This enables effective prefiltering

Parameters:

  • n: the compiled NFA to analyze
  • re: the parsed regexp (for anchor detection, can be nil)
  • literals: extracted prefix literals (can be nil)
  • config: meta-engine configuration

Example:

strategy := meta.SelectStrategy(nfa, re, literals, config)
switch strategy {
case meta.UseNFA:
    // Use PikeVM only
case meta.UseDFA:
    // Use Lazy DFA
case meta.UseReverseAnchored:
    // Use reverse search
case meta.UseBoth:
    // Adaptive
}

func (Strategy) String

func (s Strategy) String() string

String returns a human-readable representation of the Strategy.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL