pdf

package module

v1.2.9 Latest Latest Go to latest Published: Jan 7, 2026 License: BSD-3-Clause Imports: 39 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/Geek0x0/pdf

Links

Open Source Insights

README ¶

GoPDF - High-Performance PDF Processing Library

GoPDF is a powerful PDF processing library written in Go, focused on efficient text extraction, content analysis, and multilingual support. Built with a modular architecture, it provides high-performance concurrent processing capabilities.

✨ Key Features

📖 Text Extraction & Analysis

Intelligent Text Extraction: Supports plain text and styled text extraction
Semantic Classification: Automatic identification of titles, paragraphs, lists, tables, and other content types
Multilingual Support: Built-in English, French, German, and Spanish language detection and processing
Layout Analysis: Smart handling of multi-column layouts and complex page structures

🚀 Performance Optimization

Memory Optimization (NEW): Targeted allocation reduction for high-volume processing
- Pre-allocated slices with capacity estimation (30-40% allocation reduction)
- Eliminated unnecessary copies in hot paths (50% memory reduction in sorting)
- Precise capacity calculation in merge operations (100+ allocations → 3)
- Optimized string builder growth (40-50% reduction in string operations)
Sharded Caching: 256-shard cache with lock-free statistics (70-80% lock contention reduction)
Font Prefetching: Intelligent pattern-based font preloading with priority queuing
Zero-Copy Strings: Unsafe pointer optimization reducing memory allocation by 30-50%
Pool Warmup: Startup memory pool pre-warming reducing first-access latency by 60-80%
Enhanced Parallel Processing: Adaptive worker pools with batch processing (50% scheduling overhead reduction)
Memory Management: Advanced object pooling and resource management
Spatial Indexing: R-tree spatial indexing for optimized layout analysis
Asynchronous I/O: Streaming support for large files

🔧 Technical Features

Encoding Support: UTF-16, PDFDocEncoding, WinAnsi, MacRoman, and more
Compression Formats: Flate, LZW, ASCII85, RunLength
Encryption Support: RC4, AES encrypted PDFs
PDF Compatibility: Comprehensive PDF version and feature compatibility checking
PDF Recovery: Automatic recovery from malformed or corrupted PDF files
Thread Safety: Fully concurrent-safe operations
Robust Error Handling: Graceful degradation for malformed PDFs
- Library never panics on invalid input (errors returned instead)
- Tolerates missing PDF structure elements (endobj, endstream, etc.)
- Handles malformed hex strings, names, and escape sequences
- Graceful handling of truncated or corrupted files

📦 Installation

go get -u github.com/Geek0x0/pdf

🚀 Quick Start

Basic Text Extraction

package main

import (
    "fmt"
    "log"
    "github.com/Geek0x0/pdf"
)

func main() {
    // Open PDF file
    file, reader, err := gopdf.Open("example.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer file.Close()

    // Extract plain text
    textReader, err := reader.GetPlainText()
    if err != nil {
        log.Fatal(err)
    }

    // Read text content
    // ... use textReader
}

⚡ Performance Quick Start

For high-performance PDF processing, follow these optimization steps:

1. Optimized Application Startup

import "github.com/Geek0x0/pdf"

func init() {
    // Pre-warm memory pools and optimize GC settings
    config := pdf.DefaultStartupConfig()
    config.WarmupPools = true
    config.GCPercent = 200  // Reduce GC frequency
    
    if err := pdf.OptimizedStartup(config); err != nil {
        log.Fatalf("Startup optimization failed: %v", err)
    }
}

2. Use Parallel Extraction for Large Documents

func extractLargeDocument(filename string) ([]string, error) {
    f, r, err := pdf.Open(filename)
    if err != nil {
        return nil, err
    }
    defer f.Close()
    
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
    defer cancel()
    
    // Automatically uses all CPU cores
    return r.ExtractAllPagesParallel(ctx, 0)
}

3. Enable Caching for Repeated Operations

// Create global cache
var globalCache = pdf.NewShardedCache(100000, 1*time.Hour)

func getPageText(reader *pdf.Reader, pageNum int) (string, error) {
    cacheKey := fmt.Sprintf("page_%d", pageNum)
    
    // Check cache first
    if cached, ok := globalCache.Get(cacheKey); ok {
        return cached.(string), nil
    }
    
    // Extract and cache
    page := reader.Page(pageNum)
    text, err := page.GetPlainText(nil)
    if err == nil {
        globalCache.Set(cacheKey, text, int64(len(text)))
    }
    
    return text, err
}

4. Use Zero-Copy for String Operations

func processTexts(texts []string) string {
    // Fast zero-copy string operations
    builder := pdf.NewStringBuffer(10240)
    
    for _, text := range texts {
        trimmed := pdf.TrimSpaceZeroCopy(text)
        builder.WriteString(trimmed)
        builder.WriteByte('\n')
    }
    
    return builder.StringCopy()
}

PDF Compatibility Checking

// Check PDF compatibility and features
data, err := os.ReadFile("document.pdf")
if err != nil {
    log.Fatal(err)
}

compat, err := pdf.CheckPDFCompatibility(data)
if err != nil {
    log.Fatal(err)
}

fmt.Printf("PDF Version: %s\n", compat.Version)
fmt.Printf("Is Linearized: %v\n", compat.IsLinearized)
fmt.Printf("Has Transparency: %v\n", compat.HasTransparency)
fmt.Printf("Has Forms: %v\n", compat.HasForms)

if len(compat.Warnings) > 0 {
    fmt.Println("Warnings:")
    for _, warning := range compat.Warnings {
        fmt.Printf("  - %s\n", warning)
    }
}

// Validate PDF/A compliance
issues, err := pdf.ValidatePDFA(data)
if err != nil {
    log.Fatal(err)
}

if len(issues) == 0 {
    fmt.Println("PDF/A validation passed")
} else {
    fmt.Println("PDF/A validation issues:")
    for _, issue := range issues {
        fmt.Printf("  - %s\n", issue)
    }
}

PDF Integrity Checking and Recovery

// Check PDF integrity before processing
f, err := os.Open("potentially_corrupted.pdf")
if err != nil {
    log.Fatal(err)
}
defer f.Close()

stat, err := f.Stat()
if err != nil {
    log.Fatal(err)
}

integrity := pdf.CheckIntegrity(f, stat.Size())
fmt.Printf("PDF Valid: %v\n", integrity.IsValid)
fmt.Printf("Is Truncated: %v\n", integrity.IsTruncated)
fmt.Printf("Estimated Objects: %d\n", integrity.EstimatedObjects)

if len(integrity.Issues) > 0 {
    fmt.Println("Issues found:")
    for _, issue := range integrity.Issues {
        fmt.Printf("  - %s\n", issue)
    }
}

// Attempt to recover corrupted PDF
data, err := os.ReadFile("corrupted.pdf")
if err != nil {
    log.Fatal(err)
}

recovered, err := pdf.RecoverPDF(data)
if err != nil {
    log.Printf("Recovery failed: %v", err)
} else {
    fmt.Printf("Recovered PDF size: %d bytes\n", len(recovered))
    // Save recovered PDF
    err = os.WriteFile("recovered.pdf", recovered, 0644)
    if err != nil {
        log.Fatal(err)
    }
}

High-Performance Parallel Extraction

import "context"

// Extract all pages in parallel with all optimizations
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()

// Automatically uses runtime.NumCPU() workers when workers=0
pages, err := reader.ExtractAllPagesParallel(ctx, 0)
if err != nil {
    log.Fatal(err)
}

for i, text := range pages {
    fmt.Printf("Page %d: %d characters\n", i+1, len(text))
}

Using ParallelExtractor Directly

// Create parallel extractor with custom worker count
extractor := pdf.NewParallelExtractor(4)
defer extractor.Close()

// Collect pages
numPages := reader.NumPage()
pages := make([]pdf.Page, numPages)
for i := 0; i < numPages; i++ {
    pages[i] = reader.Page(i + 1)
    pages[i].SetFontCacheInterface(extractor.GetCache())
}

// Extract with context
results, err := extractor.ExtractAllPages(ctx, pages)
if err != nil {
    log.Fatal(err)
}

// Get performance stats
cacheStats := extractor.GetCacheStats()
fmt.Printf("Cache hits: %d, misses: %d\n", cacheStats.Hits, cacheStats.Misses)

Multilingual Text Processing

// Create multilingual processor
processor := gopdf.NewMultiLangProcessor()

// Detect text language
result := processor.DetectLanguage("Hello world! Bonjour le monde!")
fmt.Printf("Detected language: %s (confidence: %.2f)\n", result.Language, result.Confidence)

// Extract text by language
extractor := gopdf.NewLanguageTextExtractor()
textsByLang, err := extractor.ExtractTextByLanguage(reader)

Performance Optimization Features

// 1. Optimized Startup with Pool Warmup
err := pdf.OptimizedStartup(pdf.DefaultStartupConfig())
if err != nil {
    log.Fatal(err)
}

// 2. Sharded Cache (256 shards, lock-free)
cache := pdf.NewShardedCache(10000, 30*time.Minute)
cache.Set("key", value, 100)
if val, ok := cache.Get("key"); ok {
    // Use cached value
}
stats := cache.GetStats()
fmt.Printf("Hits: %d, Misses: %d, Evictions: %d\n", 
    stats.Hits, stats.Misses, stats.Evictions)

// 3. Font Prefetching (intelligent pattern-based)
fontCache := pdf.NewOptimizedFontCache(1000)
prefetcher := pdf.NewFontPrefetcher(fontCache)
defer prefetcher.Close()
prefetcher.RecordAccess("Arial", []string{"Helvetica", "Times"})

// 4. Zero-Copy String Operations
builder := pdf.NewStringBuffer(1024)
builder.WriteString("Hello")
builder.WriteByte(' ')
builder.WriteString("World")
result := builder.StringCopy()  // Safe copy

// Fast string operations
trimmed := pdf.TrimSpaceZeroCopy("  text  ")
parts := pdf.SplitZeroCopy("a,b,c", ',')
joined := pdf.JoinZeroCopy([]string{"a", "b", "c"}, ",")

🏗️ Architecture

GoPDF uses a modular architecture with clear component responsibilities:

gopdf/
├── lex.go                       # PDF lexical analysis and tokenization
├── read.go                      # PDF file reading and parsing
├── text.go                      # Core text extraction logic
├── page.go                      # Page structure analysis
├── metadata.go                  # Metadata processing
├── compatibility.go             # PDF format compatibility checking
├── recovery.go                  # PDF recovery for malformed files
├── errors.go                    # Error handling and wrapping
├── caching.go                   # Caching strategy implementation
├── spatial_index.go             # Spatial indexing (R-tree)
├── text_classifier.go           # Text classifier
├── multilang.go                 # Multilingual support
├── parallel_processing.go       # Parallel processing
├── performance.go               # Performance optimization
├── async_io.go                  # Asynchronous I/O
│
├── Performance Optimizations (2024)
├── sharded_cache.go             # 256-shard high-performance cache
├── font_prefetch.go             # Intelligent font prefetching
├── zero_copy_strings.go         # Zero-copy string operations
├── pool_warmup.go               # Memory pool pre-warming
├── enhanced_parallel.go         # Enhanced parallel processing
├── optimizations_advanced.go    # Advanced optimizations
└── memory_pools.go              # Advanced memory pool management

Core Components

Reader: Main PDF reading interface with encryption support
Text Extractor: Intelligent text extraction engine with smart ordering
Classifier: ML-based text classification for semantic analysis
Compatibility Checker: PDF version and feature compatibility validation
Recovery Engine: Automatic repair and recovery for damaged PDFs
Sharded Cache: 256-shard lock-free cache system
Font Prefetcher: Pattern-based predictive font loading
Parallel Extractor: Adaptive worker pool with batch processing
Spatial Index: R-tree spatial query optimization
Language Processor: Multilingual detection and processing
Zero-Copy Optimizer: Memory allocation reduction utilities

📊 Performance Benchmarks

Performance metrics based on standard test datasets (Intel i7-14700K):

Overall Performance

Text Extraction Speed: Average 50-100 pages/second
Memory Usage: Smart object pooling, 40% reduction in memory footprint
Concurrent Processing: Multi-core support, 3-5x performance improvement with parallel extractor

Optimization Benchmarks

Sharded Cache Performance

Set Operations: ~118 ns/op (256 shards)
Get Operations: ~112 ns/op
Concurrent Access: ~31 ns/op (70-80% lock contention reduction)
Cache Hit Rate: Up to 85% with LRU policies

Zero-Copy String Operations

BytesToString: 0.14 ns/op (97x faster than standard)
String Concat: 10.12 ns/op (3.1x faster)
TrimSpace: 2.67 ns/op (1.2x faster)
Split: 59.62 ns/op (1.3x faster)

Memory Pool Warmup

Light Warmup: ~37 µs (development)
Default Warmup: ~96 µs (production)
Aggressive Warmup: ~358 µs (high-performance)
Concurrent vs Sequential: 35% faster with concurrent warmup

Parallel Extraction

2 Workers: 1.8x speedup
4 Workers: 3.1x speedup
8 Workers: 5.0x speedup
Auto (CPU cores): 4.2x average speedup

🧪 Testing

The project maintains testing standards with 67.6% coverage (main package):

Unit tests covering all core functionality
Integration tests for end-to-end PDF processing
Performance tests with benchmarks and memory profiling
Concurrency tests for thread safety validation
Optimization-specific tests for new features

# Run all tests
go test ./...

# Run coverage tests
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms

# Run specific optimization benchmarks
go test -bench=BenchmarkShardedCache -run=^$
go test -bench=BenchmarkStringOperations -run=^$
go test -bench=BenchmarkParallelExtractor -run=^$
go test -bench=BenchmarkWarmup -run=^$

Benchmark Examples

# Compare parallel vs sequential extraction
go test -bench=BenchmarkParallelExtractorVsSequential -run=^$ -benchtime=500ms

# Zero-copy string operations performance
go test -bench=BenchmarkStringOperations -run=^$ -benchtime=500ms

# Cache performance under different loads
go test -bench=BenchmarkShardedCache -run=^$ -benchtime=1s

📋 Command Line Tool

GoPDF includes a command-line tool for quick PDF text extraction:

# Build the CLI tool
go build -o pdfcli ./cmd/pdfcli

# Extract plain text from all pages
./pdfcli document.pdf

# Extract text from specific page
./pdfcli -page 1 document.pdf

# Extract styled text with formatting
./pdfcli -mode styled document.pdf

# Extract text organized by rows
./pdfcli -mode rows -page 1 document.pdf

# Extract text organized by columns
./pdfcli -mode columns -page 1 document.pdf

Core Interfaces

// PDF file operations
Open(filename string) (*os.File, *Reader, error)
NewReader(r io.ReaderAt, size int64) (*Reader, error)

// PDF compatibility checking
CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)
ValidatePDFA(data []byte) ([]string, error)
ValidatePDFX(data []byte) ([]string, error)

// PDF integrity and recovery
CheckIntegrity(r io.ReaderAt, size int64) *IntegrityStatus
RecoverPDF(data []byte) ([]byte, error)

// Text extraction
(reader *Reader) GetPlainText() (io.Reader, error)
(reader *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)
(reader *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)

// Page operations
(reader *Reader) Page(num int) *Page
(page *Page) Content() *Content
(page *Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)

// High-Performance Parallel Extraction
NewParallelExtractor(workers int) *ParallelExtractor
(pe *ParallelExtractor) ExtractAllPages(ctx context.Context, pages []Page) ([][]Text, error)
(pe *ParallelExtractor) GetCacheStats() ShardedCacheStats
(pe *ParallelExtractor) GetPrefetchStats() PrefetchStats
(pe *ParallelExtractor) Close()

// Sharded Cache
NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache
(sc *ShardedCache) Get(key string) (interface{}, bool)
(sc *ShardedCache) Set(key string, value interface{}, size int64)
(sc *ShardedCache) GetStats() ShardedCacheStats
(sc *ShardedCache) Clear()

// Font Prefetching
NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher
(fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)
(fp *FontPrefetcher) GetStats() PrefetchStats
(fp *FontPrefetcher) Close()

// Zero-Copy String Operations
BytesToString(b []byte) string
StringToBytes(s string) []byte
NewStringBuffer(capacity int) *StringBuffer
FastStringConcatZC(parts ...string) string
TrimSpaceZeroCopy(s string) string
SplitZeroCopy(s string, sep byte) []string
JoinZeroCopy(parts []string, sep string) string

// Pool Warmup
OptimizedStartup(config *StartupConfig) error
WarmupGlobal(config *WarmupConfig) error
DefaultWarmupConfig() *WarmupConfig

Performance Optimization APIs

// Optimized startup (recommended at application start)
config := pdf.DefaultStartupConfig()
config.WarmupPools = true
config.PreallocateCaches = true
err := pdf.OptimizedStartup(config)

// Create optimized font cache
fontCache := pdf.NewOptimizedFontCache(1000)

// Use string pool for repeated strings
pool := pdf.NewStringPool()
fontName := pool.Intern("Arial")

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Create a Pull Request

Development Setup

# Clone repository
git clone https://github.com/Geek0x0/pdf.git
cd gopdf

# Install dependencies
go mod download

# Run tests
go test ./...

# Run with coverage
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out

# Run performance benchmarks
go test -bench=. -benchmem -benchtime=500ms

# Build examples
go build ./examples/...

# Run specific example
go run ./examples/extract/main.go sample.pdf

Code Quality

Linting: Use golangci-lint for code quality checks
Formatting: Follow standard Go formatting with gofmt
Testing: Maintain or improve test coverage with new features
Documentation: Update README and code comments for API changes

Performance Contributions

When contributing performance optimizations:

Include benchmark tests for the optimization
Measure memory usage impact with go test -benchmem
Test under concurrent load scenarios
Document the performance improvement metrics

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Based on unidoc/unipdf PDF parsing technology
Valuable feedback and suggestions from community contributors
Excellent language and toolchain provided by the Go team

📞 Contact

Project Home: https://github.com/Geek0x0/pdf
Issue Tracker: https://github.com/Geek0x0/pdf/issues

⭐ If this project helps you, please give us a star!

Documentation ¶

Rendered for

Overview ¶

compatibility.go - PDF format compatibility handling

Package pdf implements reading of PDF files.

Overview ¶

PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package.

Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds:

Null, for the null object.
Integer, for an integer.
Real, for a floating-point number.
Bool, for a boolean value.
Name, for a name constant (as in /Helvetica).
String, for a string constant.
Dict, for a dictionary of name-value pairs.
Array, for an array of values.
Stream, for an opaque data stream and associated header dictionary.

The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported.

The basic structure of the PDF file is exposed as the graph of Values.

Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.

Example (ZeroCopyInPDFProcessing) ¶

Demonstrate how to use zero-copy optimization in actual PDF processing

// Assume some text blocks are extracted from PDF
texts := []string{
	"  First paragraph  ",
	"  Second paragraph  ",
	"  Third paragraph  ",
}

// Process using zero-copy operations
builder := NewStringBuffer(1024)

for i, text := range texts {
	// Remove leading and trailing spaces (zero-copy)
	trimmed := TrimSpaceZeroCopy(text)
	builder.WriteString(trimmed)

	if i < len(texts)-1 {
		builder.WriteString("\n")
	}
}

result := builder.StringCopy()
fmt.Println(result)

Output:

First paragraph
Second paragraph
Third paragraph

Index ¶

Constants
Variables
func AutoWarmup() error
func BatchCompareFloat64(a, b []float64, threshold float64) []bool
func BatchHexDecode(hexStrings []string) ([][]byte, []error)
func BenchmarkSortingAlgorithms(texts []Text, getCoord func(Text) float64) map[string]float64
func BytesToString(b []byte) string
func ClearGlobalStringPool()
func CompareStringsZeroCopy(s1, s2 string) int
func DetectCJKOrdering(fontName string) string
func EstimateCapacity(currentLen int, growthFactor float64) int
func ExampleOptimizations()
func FastHexValidation(hexStr string) bool
func FastSortTexts(texts []Text, less func(i, j int) bool)
func FastSortTextsByX(texts []Text)
func FastSortTextsByY(texts []Text)
func FastStringConcat(strings ...string) string
func FastStringConcatZC(parts ...string) string
func FastStringSearch(haystack, needle string) int
func GetBuilder() *strings.Builder
func GetByteBuffer() *[]byte
func GetCMapWritingMode(name string) int
func GetContentExtractorSlices() ([]Text, []Rect)
func GetIntSlice(size int) []int
func GetPDFBuffer() *buffer
func GetSizedBuffer(size int) []byte
func GetVerticalVariant(r rune) rune
func GlyphNameToRune(name string) rune
func HasPrefixZeroCopy(s, prefix string) bool
func HasSuffixZeroCopy(s, suffix string) bool
func HexDecodeSIMD(hexStr string) ([]byte, error)
func HilbertXYToIndex(x, y, order uint32) uint64
func InitPredefinedCMaps()
func InternRune(r rune) string
func InternString(s string) string
func Interpret(strm Value, do func(stk *Stack, op string))
func InterpretWithContext(ctx context.Context, strm Value, do func(stk *Stack, op string))
func InterpretWithContextAndLimits(ctx context.Context, strm Value, do func(stk *Stack, op string), ...)
func IsCJKCMap(name string) bool
func IsCJKFont(fontName string) bool
func IsSameSentence(last, current Text) bool
func IsType1Font(v Value) bool
func JoinZeroCopy(parts []string, sep string) string
func ListRegisteredCMaps() []string
func OptimizedStartup(config *StartupConfig) error
func PreallocateCache(fontCacheSize, resultCacheSize int)
func ProcessLargePDF(reader *Reader, chunkSize, bufferSize int, maxMemory int64, ...) error
func ProcessTextWithMultiLanguage(reader *Reader) (map[Language][]ClassifiedBlock, error)
func PutBlockSlice(s []ClassifiedBlock)
func PutBuilder(b *strings.Builder)
func PutByteBuffer(buf *[]byte)
func PutContentExtractorSlices(text []Text, rect []Rect)
func PutIntSlice(s []int)
func PutPDFBuffer(b *buffer)
func PutSizedBuffer(buf []byte)
func PutSizedStringBuilder(sb *FastStringBuilder, estimatedSize int)
func PutSizedTextSlice(slice []Text)
func PutText(t *Text)
func PutTextBlock(tb *TextBlock)
func PutTextBlocks(blocks []*TextBlock)
func PutTextSlice(s []Text)
func RadixSortFloat64(values []float64)
func RegisterCJKFont(name string, info *CJKFontInfo)
func RegisterPredefinedCMap(name string, cmap *PredefinedCMap)
func ResetSortingMetrics()
func ShouldRotateGlyph(r rune) bool
func SmartTextRunsToPlain(texts []Text) string
func SplitZeroCopy(s string, sep byte) []string
func StringSliceToByteSlice(strings []string) [][]byte
func StringToBytes(s string) []byte
func SubstringZeroCopy(s string, start, end int) string
func TrimSpaceZeroCopy(s string) string
func ValidatePDFA(data []byte) ([]string, error)
func ValidatePDFX(data []byte) ([]string, error)
func WarmupGlobal(config *WarmupConfig) error
func ZeroCopyStringSlice(data []byte, separators []byte) []string
type AccessPattern
type AccessPatternTracker
type AdaptiveCapacityEstimator
- func NewAdaptiveCapacityEstimator(maxSamples int) *AdaptiveCapacityEstimator
- func (ace *AdaptiveCapacityEstimator) Estimate(hint int) int
- func (ace *AdaptiveCapacityEstimator) Record(actual int)
type AdaptiveProcessor
- func NewAdaptiveProcessor(min, max int) *AdaptiveProcessor
- func (ap *AdaptiveProcessor) AdjustWorkers()
- func (ap *AdaptiveProcessor) GetWorkerCount() int
- func (ap *AdaptiveProcessor) ProcessAdaptive(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
type AdaptiveSorter
- func NewAdaptiveSorter() *AdaptiveSorter
- func (as *AdaptiveSorter) SortTextsByComparison(texts []Text, less func(i, j int) bool)
- func (as *AdaptiveSorter) SortTextsByCoordinate(texts []Text, getCoord func(Text) float64)
type AsyncReader
- func NewAsyncReader(reader *Reader) *AsyncReader
- func (ar *AsyncReader) AsyncExtractStructured(ctx context.Context) (<-chan []ClassifiedBlock, <-chan error)
- func (ar *AsyncReader) AsyncExtractText(ctx context.Context) (<-chan string, <-chan error)
- func (ar *AsyncReader) AsyncExtractTextWithContext(ctx context.Context, opts ExtractOptions) (<-chan string, <-chan error)
- func (ar *AsyncReader) AsyncStream(ctx context.Context, processor func(Page, int) error) <-chan error
- func (ar *AsyncReader) StreamValueReader(ctx context.Context, v Value) (<-chan []byte, <-chan error)
type AsyncReaderAt
- func NewAsyncReaderAt(reader io.ReaderAt) *AsyncReaderAt
- func (ara *AsyncReaderAt) ReadAtAsync(ctx context.Context, buf []byte, offset int64) (<-chan int, <-chan error)
type BatchExtractOptions
type BatchResult
type BatchStringBuilder
- func NewBatchStringBuilder(texts []Text) *BatchStringBuilder
- func (bsb *BatchStringBuilder) AppendTexts(texts []Text) string
- func (bsb *BatchStringBuilder) Reset()
- func (bsb *BatchStringBuilder) String() string
type BlockType
- func (bt BlockType) String() string
type CCITTFaxDecoder
- func NewCCITTFaxDecoder(r io.Reader, params CCITTFaxParams) *CCITTFaxDecoder
- func (d *CCITTFaxDecoder) Read(p []byte) (n int, err error)
type CCITTFaxParams
- func DefaultCCITTFaxParams() CCITTFaxParams
- func ParseCCITTFaxParams(param Value) CCITTFaxParams
type CFFCache
- func GetGlobalCFFCache() *CFFCache
- func NewCFFCache(maxSize int, ttl time.Duration) *CFFCache
- func (cc *CFFCache) GetDecoding(data []byte) ([]interface{}, bool)
- func (cc *CFFCache) GetFont(data []byte) (*CFFFont, bool)
- func (cc *CFFCache) PutDecoding(data []byte, commands []interface{})
- func (cc *CFFCache) PutFont(data []byte, font *CFFFont)
type CFFCacheEntry
- func (ce *CFFCacheEntry) IsExpired() bool
type CFFCharStringDecoder
- func NewCFFCharStringDecoder(data []byte) *CFFCharStringDecoder
- func (d *CFFCharStringDecoder) Decode() ([]interface{}, error)
- func (d *CFFCharStringDecoder) GetWidth() (float64, bool)
type CFFDict
type CFFFont
- func NewCFFFont(data []byte) (*CFFFont, error)
- func (f *CFFFont) GetCharString(gid int) []byte
- func (f *CFFFont) GetFDIndex(cid int) int
- func (f *CFFFont) GetFontName() string
- func (f *CFFFont) IsCID() bool
type CFFHeader
type CFFIndex
type CFFObjectPool
- func GetGlobalCFFPool() *CFFObjectPool
- func NewCFFObjectPool() *CFFObjectPool
- func (p *CFFObjectPool) GetCommandSlice() []interface{}
- func (p *CFFObjectPool) GetStack() []float64
- func (p *CFFObjectPool) PutCommandSlice(commands []interface{})
- func (p *CFFObjectPool) PutStack(stack []float64)
type CIDFont
- func NewCIDFont() *CIDFont
- func (f *CIDFont) DecodeToUnicode(raw string) string
- func (f *CIDFont) GetWidth(cid int) int
- func (f *CIDFont) SetCIDToGIDMap(m *CIDToGIDMap)
- func (f *CIDFont) SetCMap(cmap *CMap)
- func (f *CIDFont) SetDefaultWidth(w int)
- func (f *CIDFont) SetToUnicode(toUnicode *ToUnicodeCMap)
- func (f *CIDFont) SetWidth(cid, width int)
- func (f *CIDFont) SetWritingMode(wMode int)
- func (f *CIDFont) WritingMode() int
type CIDFontDescriptor
type CIDSystemInfo
type CIDToGIDMap
- func NewCIDToGIDMap(data []byte) *CIDToGIDMap
- func NewIdentityCIDToGIDMap() *CIDToGIDMap
- func (m *CIDToGIDMap) IsIdentity() bool
- func (m *CIDToGIDMap) LookupGID(cid int) int
type CJKFontInfo
- func GetCJKFontInfo(name string) *CJKFontInfo
type CJKFontRegistry
type CJKGlyphMetrics
type CJKTextProcessor
- func NewCJKTextProcessor(font *ExtendedCIDFont, isVertical bool) *CJKTextProcessor
- func (p *CJKTextProcessor) GetGlyphMetrics(cid int) CJKGlyphMetrics
- func (p *CJKTextProcessor) ProcessText(text string) string
type CMap
- func NewCMap(name string, cmapType CMapType) *CMap
- func ParseCMap(r io.Reader, name string) (*CMap, error)
- func (c *CMap) AddBFChar(orig, repl string)
- func (c *CMap) AddBFRange(low, high string, dst Value)
- func (c *CMap) AddCIDChar(code []byte, cid int)
- func (c *CMap) AddCIDRange(low, high []byte, startCID int)
- func (c *CMap) AddCodeSpaceRange(low, high []byte)
- func (c *CMap) Decode(raw string) string
- func (c *CMap) LookupCID(code []byte) (int, bool)
- func (c *CMap) OptimizeCIDLookup()
- func (c *CMap) SetCIDSystemInfo(registry, ordering string, supplement int)
- func (c *CMap) SetUseCMap(parent TextEncoding)
- func (c *CMap) String() string
type CMapInfo
- func GetCMapInfo(name string) *CMapInfo
type CMapParser
type CMapType
type CacheContext
- func NewCacheContext(parent context.Context, cache *ResultCache) *CacheContext
- func (cc *CacheContext) Close()
- func (cc *CacheContext) GetWithTimeout(key string, timeout time.Duration) (interface{}, bool, error)
type CacheEntry
- func (ce *CacheEntry) IsExpired() bool
type CacheKeyGenerator
- func NewCacheKeyGenerator() *CacheKeyGenerator
- func (ckg *CacheKeyGenerator) GenerateFullHash(data string) string
- func (ckg *CacheKeyGenerator) GeneratePageContentKey(pageNum int, readerHash string) string
- func (ckg *CacheKeyGenerator) GenerateReaderHash(reader *Reader) string
- func (ckg *CacheKeyGenerator) GenerateTextClassificationKey(pageNum int, readerHash string, processorParams string) string
- func (ckg *CacheKeyGenerator) GenerateTextOrderingKey(pageNum int, readerHash string, orderingParams string) string
type CacheLineAlignedCounter
- func NewCacheLineAlignedCounter(n int) *CacheLineAlignedCounter
- func (c *CacheLineAlignedCounter) Add(idx int, delta uint64)
- func (c *CacheLineAlignedCounter) Get(idx int) uint64
type CacheLinePadded
type CacheManager
- func NewCacheManager() *CacheManager
- func (cm *CacheManager) GetClassificationCache() *ResultCache
- func (cm *CacheManager) GetMetadataCache() *ResultCache
- func (cm *CacheManager) GetPageCache() *ResultCache
- func (cm *CacheManager) GetTextOrderingCache() *ResultCache
- func (cm *CacheManager) GetTotalStats() CacheStats
type CacheShard
type CacheStats
type CachedReader
- func NewCachedReader(reader *Reader, cache *ResultCache) *CachedReader
- func (cr *CachedReader) CachedClassifyTextBlocks(pageNum int) ([]ClassifiedBlock, error)
- func (cr *CachedReader) CachedPage(pageNum int) ([]Text, error)
type ClassifiedBlock
- func GetBlockSlice() []ClassifiedBlock
- func GetTextByType(blocks []ClassifiedBlock, blockType BlockType) []ClassifiedBlock
- func GetTitles(blocks []ClassifiedBlock, level int) []ClassifiedBlock
type ClassifiedBlockWithLanguage
type Column
type Columns
type ConnectionPool
- func NewConnectionPool(maxSize int, newFunc func() interface{}, closeFunc func(interface{})) *ConnectionPool
- func (cp *ConnectionPool) Close()
- func (cp *ConnectionPool) Get() interface{}
- func (cp *ConnectionPool) Put(conn interface{})
type Content
type CryptoEngine
- func NewCryptoEngine(info *PDFEncryptionInfo) *CryptoEngine
- func (e *CryptoEngine) DecryptData(data []byte, objID, genID int) ([]byte, error)
- func (e *CryptoEngine) EncryptData(data []byte, objID, genID int) ([]byte, error)
- func (e *CryptoEngine) SetKey(key []byte)
type EncryptionMethod
type EncryptionRevision
type EncryptionVersion
type EnhancedParallelProcessor
- func NewEnhancedParallelProcessor(workers int, batchSize int) *EnhancedParallelProcessor
- func (epp *EnhancedParallelProcessor) ProcessPagesEnhanced(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (epp *EnhancedParallelProcessor) ProcessWithLoadBalancing(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (epp *EnhancedParallelProcessor) ProcessWithPipeline(ctx context.Context, pages []Page, stages []func(Page, []Text) ([]Text, error)) ([][]Text, error)
type ExtendedCIDFont
- func NewExtendedCIDFont(v Value) *ExtendedCIDFont
- func (cf *ExtendedCIDFont) Descriptor() *CIDFontDescriptor
- func (cf *ExtendedCIDFont) GID(cid int) uint16
- func (cf *ExtendedCIDFont) Info() *CJKFontInfo
- func (cf *ExtendedCIDFont) IsVertical() bool
- func (cf *ExtendedCIDFont) VerticalOrigin(cid int) (float64, float64)
- func (cf *ExtendedCIDFont) VerticalWidth(cid int) float64
type ExtractMode
type ExtractOptions
type ExtractResult
type Extractor
- func NewExtractor(r *Reader) *Extractor
- func (e *Extractor) Context(ctx context.Context) *Extractor
- func (e *Extractor) Extract() (*ExtractResult, error)
- func (e *Extractor) ExtractStructured() ([]ClassifiedBlock, error)
- func (e *Extractor) ExtractStyledTexts() ([]Text, error)
- func (e *Extractor) ExtractText() (string, error)
- func (e *Extractor) Mode(mode ExtractMode) *Extractor
- func (e *Extractor) Pages(pages ...int) *Extractor
- func (e *Extractor) SmartOrdering(enabled bool) *Extractor
- func (e *Extractor) Workers(n int) *Extractor
type FastStringBuilder
- func GetSizedStringBuilder(estimatedSize int) *FastStringBuilder
- func NewFastStringBuilder(estimatedSize int) *FastStringBuilder
- func (b *FastStringBuilder) Len() int
- func (b *FastStringBuilder) Reset()
- func (b *FastStringBuilder) String() string
- func (b *FastStringBuilder) WriteByte(c byte) error
- func (b *FastStringBuilder) WriteString(s string)
type Font
- func (f Font) BaseFont() string
- func (f *Font) Encoder() TextEncoding
- func (f *Font) ExtendedCIDFont() *ExtendedCIDFont
- func (f Font) FirstChar() int
- func (f Font) LastChar() int
- func (f Font) Width(code int) float64
- func (f Font) Widths() []float64
type FontCache
- func NewFontCache() *FontCache
- func (fc *FontCache) Get(key string) (*Font, bool)
- func (fc *FontCache) Set(key string, font *Font)
type FontCacheInterface
type FontCacheStats
type FontCacheType
type FontPool
- func GetGlobalFontPool() *FontPool
- func NewFontPool() *FontPool
- func (fp *FontPool) Clear()
- func (fp *FontPool) GetFont(id uint32) string
- func (fp *FontPool) GetID(font string) uint32
- func (fp *FontPool) Len() int
type FontPrefetcher
- func NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher
- func (fp *FontPrefetcher) ClearPatterns()
- func (fp *FontPrefetcher) Close()
- func (fp *FontPrefetcher) Disable()
- func (fp *FontPrefetcher) Enable()
- func (fp *FontPrefetcher) GetStats() PrefetchStats
- func (fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)
type GlobalFontCache
- func GetGlobalFontCache() *GlobalFontCache
- func NewGlobalFontCache(maxEntries int, maxAge time.Duration) *GlobalFontCache
- func (gfc *GlobalFontCache) Cleanup() int
- func (gfc *GlobalFontCache) Clear()
- func (gfc *GlobalFontCache) Get(key string) (*Font, bool)
- func (gfc *GlobalFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)
- func (gfc *GlobalFontCache) GetStats() FontCacheStats
- func (gfc *GlobalFontCache) Remove(key string)
- func (gfc *GlobalFontCache) Set(key string, font *Font)
- func (gfc *GlobalFontCache) StartCleanupRoutine(interval time.Duration) chan struct{}
type GridKey
type InplaceStringBuilder
- func NewInplaceStringBuilder(capacity int) *InplaceStringBuilder
- func (isb *InplaceStringBuilder) Append(s string)
- func (isb *InplaceStringBuilder) Build() string
- func (isb *InplaceStringBuilder) Len() int
- func (isb *InplaceStringBuilder) Reset()
type IntegrityStatus
- func CheckIntegrity(f io.ReaderAt, size int64) *IntegrityStatus
type JBIG2Decoder
- func NewJBIG2Decoder(r io.Reader, params JBIG2Params) *JBIG2Decoder
- func (d *JBIG2Decoder) Read(p []byte) (n int, err error)
type JBIG2Params
- func ParseJBIG2Params(param Value) JBIG2Params
type KDNode
type KDTree
- func BuildKDTree(blocks []*TextBlock) *KDTree
- func (tree *KDTree) RangeSearch(targetX, targetY, radiusSq float64) []*TextBlock
- func (tree *KDTree) RangeSearchWithBuffer(targetX, targetY, radiusSq float64, buffer []*TextBlock) []*TextBlock
type LZWPredictor
- func NewLZWPredictor(r io.Reader, params LZWPredictorParams) *LZWPredictor
- func (p *LZWPredictor) Read(b []byte) (n int, err error)
type LZWPredictorParams
- func DefaultLZWPredictorParams() LZWPredictorParams
- func ParseLZWPredictorParams(param Value) LZWPredictorParams
type Language
type LanguageInfo
type LanguageTextExtractor
- func NewLanguageTextExtractor() *LanguageTextExtractor
- func (lte *LanguageTextExtractor) ExtractTextByLanguage(reader *Reader) (map[Language][]Text, error)
- func (lte *LanguageTextExtractor) GetLanguageStats(texts []Text) map[Language]int
- func (lte *LanguageTextExtractor) GetTextsByLanguage(texts []Text, targetLang Language) []Text
type LazyPage
- func NewLazyPage(r *Reader, pageNum int) *LazyPage
- func (lp *LazyPage) GetContent() *Content
- func (lp *LazyPage) IsLoaded() bool
- func (lp *LazyPage) Release()
type LazyPageManager
- func NewLazyPageManager(r *Reader, maxCached int) *LazyPageManager
- func (m *LazyPageManager) Clear()
- func (m *LazyPageManager) GetPage(pageNum int) *LazyPage
- func (m *LazyPageManager) GetStats() (totalPages, loadedPages int)
type LockFreeRingBuffer
- func NewLockFreeRingBuffer(size int) *LockFreeRingBuffer
- func (rb *LockFreeRingBuffer) Pop() (interface{}, bool)
- func (rb *LockFreeRingBuffer) Push(item interface{}) bool
type MemoryArena
- func NewMemoryArena(chunkSize int) *MemoryArena
- func (a *MemoryArena) Alloc(size int) []byte
- func (a *MemoryArena) Reset()
type MemoryEfficientExtractor
- func NewMemoryEfficientExtractor(chunkSize, bufferSize int, maxMemory int64) *MemoryEfficientExtractor
- func (mee *MemoryEfficientExtractor) ExtractTextStream(reader *Reader) (<-chan TextStream, <-chan error)
- func (mee *MemoryEfficientExtractor) ExtractTextToWriter(reader *Reader, writer io.Writer) (err error)
type Metadata
- func (m Metadata) String() string
type MultiLangProcessor
- func NewMultiLangProcessor() *MultiLangProcessor
- func (mlp *MultiLangProcessor) DetectLanguage(text string) LanguageInfo
- func (mlp *MultiLangProcessor) GetLanguageConfidenceThreshold() float64
- func (mlp *MultiLangProcessor) GetLanguageName(lang Language) string
- func (mlp *MultiLangProcessor) GetSupportedLanguages() []Language
- func (mlp *MultiLangProcessor) IsEnglish(text string) bool
- func (mlp *MultiLangProcessor) IsFrench(text string) bool
- func (mlp *MultiLangProcessor) IsGerman(text string) bool
- func (mlp *MultiLangProcessor) IsSpanish(text string) bool
- func (mlp *MultiLangProcessor) ProcessTextWithLanguageDetection(texts []Text) []TextWithLanguage
type MultiLanguageTextClassifier
- func NewMultiLanguageTextClassifier(texts []Text, pageWidth, pageHeight float64) *MultiLanguageTextClassifier
- func (mltc *MultiLanguageTextClassifier) ClassifyBlocksWithLanguage() []ClassifiedBlockWithLanguage
type MultiLevelCache
- func NewMultiLevelCache() *MultiLevelCache
- func (mlc *MultiLevelCache) Get(key string) (interface{}, bool)
- func (mlc *MultiLevelCache) Prefetch(keys []string)
- func (mlc *MultiLevelCache) Put(key string, value interface{})
- func (mlc *MultiLevelCache) Stats() map[string]uint64
type OptimizedCMapCache
- func GetGlobalCMapCache() *OptimizedCMapCache
- func NewOptimizedCMapCache(maxEntries int) *OptimizedCMapCache
- func (c *OptimizedCMapCache) Get(key string) (*CMap, bool)
- func (c *OptimizedCMapCache) GetStats() (hits, misses uint64)
- func (c *OptimizedCMapCache) Put(key string, cmap *CMap)
- func (c *OptimizedCMapCache) Release(key string)
type OptimizedFontCache
- func NewOptimizedFontCache(totalCapacity int) *OptimizedFontCache
- func (ofc *OptimizedFontCache) Clear()
- func (ofc *OptimizedFontCache) Get(key string) (*Font, bool)
- func (ofc *OptimizedFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)
- func (ofc *OptimizedFontCache) GetStats() FontCacheStats
- func (ofc *OptimizedFontCache) Prefetch(keys []string, compute func(key string) (*Font, error))
- func (ofc *OptimizedFontCache) Remove(key string)
- func (ofc *OptimizedFontCache) Set(key string, font *Font)
type OptimizedMemoryPool
- func NewOptimizedMemoryPool(size int) *OptimizedMemoryPool
- func (omp *OptimizedMemoryPool) Get() []byte
- func (omp *OptimizedMemoryPool) Put(bufPtr *[]byte)
type OptimizedSorter
- func NewOptimizedSorter() *OptimizedSorter
- func (os *OptimizedSorter) QuickSortTexts(texts []Text, less func(i, j int) bool)
- func (os *OptimizedSorter) SortTextHorizontalByOptimized(th TextHorizontal)
- func (os *OptimizedSorter) SortTextVerticalByOptimized(tv TextVertical)
- func (os *OptimizedSorter) SortTexts(texts []Text, less func(i, j int) bool)
- func (os *OptimizedSorter) SortTextsWithAlgorithm(texts []Text, less func(i, j int) bool, algorithm string)
type OptimizedTextClusterSorter
- func NewOptimizedTextClusterSorter() *OptimizedTextClusterSorter
- func (otcs *OptimizedTextClusterSorter) SortTextBlocks(blocks []*TextBlock, sortBy string)
type Outline
type PDFCompatibilityInfo
- func CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)
type PDFEncryptionInfo
type PDFError
- func (e *PDFError) Error() string
- func (e *PDFError) Unwrap() error
type PDFVersion
- func (v PDFVersion) IsSupported() bool
- func (v PDFVersion) String() string
type Page
- func (p Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)
- func (p *Page) Cleanup()
- func (p Page) Content() Content
- func (p Page) Font(name string) Font
- func (p Page) Fonts() []string
- func (p *Page) GetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p *Page) GetPlainTextWithSmartOrdering(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p Page) GetTextByColumn() (Columns, error)
- func (p Page) GetTextByRow() (Rows, error)
- func (p Page) OptimizedGetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p Page) OptimizedGetTextByColumn() (Columns, error)
- func (p Page) OptimizedGetTextByRow() (Rows, error)
- func (p Page) Resources() Value
- func (p *Page) SetFontCache(cache *GlobalFontCache)
- func (p *Page) SetFontCacheInterface(cache FontCacheInterface)
type PageStream
type ParallelExtractor
- func NewParallelExtractor(workers int) *ParallelExtractor
- func (pe *ParallelExtractor) Close()
- func (pe *ParallelExtractor) ExtractAllPages(ctx context.Context, pages []Page) ([][]Text, error)
- func (pe *ParallelExtractor) GetCacheStats() ShardedCacheStats
- func (pe *ParallelExtractor) GetPrefetchStats() PrefetchStats
type ParallelProcessor
- func NewParallelProcessor(workers int) *ParallelProcessor
- func (pp *ParallelProcessor) ProcessPages(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (pp *ParallelProcessor) ProcessTextBlocks(ctx context.Context, blocks []*TextBlock, ...) ([]*TextBlock, error)
- func (pp *ParallelProcessor) ProcessTextInParallel(ctx context.Context, texts []Text, processorFunc func(Text) (Text, error)) ([]Text, error)
type ParallelTextExtractor
- func NewParallelTextExtractor(workers int) *ParallelTextExtractor
- func (pte *ParallelTextExtractor) ExtractWithParallelProcessing(ctx context.Context, reader *Reader) ([]Text, error)
- func (pte *ParallelTextExtractor) ParallelSort(ctx context.Context, texts []Text, less func(i, j int) bool) error
type ParseLimits
- func DefaultParseLimits() ParseLimits
type PasswordAuth
- func NewPasswordAuth(info *PDFEncryptionInfo) *PasswordAuth
- func (pa *PasswordAuth) Authenticate(password string) ([]byte, error)
- func (pa *PasswordAuth) AuthenticateOwner(password string) ([]byte, error)
- func (pa *PasswordAuth) AuthenticateUser(password string) ([]byte, error)
- func (pa *PasswordAuth) ValidatePermissions(key []byte) error
type PerformanceMetrics
- func (pm *PerformanceMetrics) GetMetrics() map[string]interface{}
- func (pm *PerformanceMetrics) RecordAllocation(bytes uint64)
- func (pm *PerformanceMetrics) RecordExtractDuration(d time.Duration)
type Point
type PoolStats
type PoolWarmer
- func (pw *PoolWarmer) GetWarmupStats() WarmupStats
- func (pw *PoolWarmer) IsWarmed() bool
- func (pw *PoolWarmer) Reset()
- func (pw *PoolWarmer) Warmup(config *WarmupConfig) error
type PredefinedCMap
- func GetPredefinedCMap(name string) *PredefinedCMap
type PrefetchItem
type PrefetchQueue
- func (pq *PrefetchQueue) Len() int
- func (pq *PrefetchQueue) Less(i, j int) bool
- func (pq *PrefetchQueue) Pop() interface{}
- func (pq *PrefetchQueue) Push(x interface{})
- func (pq *PrefetchQueue) Swap(i, j int)
type PrefetchStats
type RTreeNode
type RTreeSpatialIndex
- func NewRTreeSpatialIndex(texts []Text) *RTreeSpatialIndex
- func (rt *RTreeSpatialIndex) Insert(text Text)
- func (rt *RTreeSpatialIndex) Query(bounds Rect) []Text
type Reader
- func NewReader(f io.ReaderAt, size int64) (*Reader, error)
- func NewReaderEncrypted(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func NewReaderEncryptedWithMmap(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func NewReaderLinearized(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func Open(file string) (*os.File, *Reader, error)
- func RecoverPDF(f io.ReaderAt, size int64, opts *RecoveryOptions) (*Reader, error)
- func (r *Reader) BatchExtractText(pageNums []int, useLazy bool) (map[int]string, error)
- func (r *Reader) ClearCache()
- func (r *Reader) Close() error
- func (r *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)
- func (r *Reader) ExtractPagesBatch(opts BatchExtractOptions) <-chan BatchResult
- func (r *Reader) ExtractPagesBatchToString(opts BatchExtractOptions) (string, error)
- func (r *Reader) ExtractStructuredBatch(opts BatchExtractOptions) <-chan StructuredBatchResult
- func (r *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)
- func (r *Reader) GetCacheCapacity() int
- func (r *Reader) GetCompatibilityInfo() *PDFCompatibilityInfo
- func (r *Reader) GetMetadata() (Metadata, error)
- func (r *Reader) GetPlainText() (reader io.Reader, err error)
- func (r *Reader) GetPlainTextConcurrent(workers int) (io.Reader, error)
- func (r *Reader) GetStyledTexts() (sentences []Text, err error)
- func (r *Reader) NumPage() int
- func (r *Reader) Outline() Outline
- func (r *Reader) Page(num int) Page
- func (r *Reader) SetCacheCapacity(n int)
- func (r *Reader) SetMetadata(meta Metadata) error
- func (r *Reader) Trailer() Value
type RecoveryOptions
- func DefaultRecoveryOptions() *RecoveryOptions
type Rect
type ResourceManager
- func NewResourceManager() *ResourceManager
- func (rm *ResourceManager) Add(resource io.Closer)
- func (rm *ResourceManager) Close() error
type ResultCache
- func GetGlobalCache() *ResultCache
- func NewResultCache(maxSize int64, ttl time.Duration, policy string) *ResultCache
- func (rc *ResultCache) Clear()
- func (rc *ResultCache) Close()
- func (rc *ResultCache) Get(key string) (interface{}, bool)
- func (rc *ResultCache) GetHitRatio() float64
- func (rc *ResultCache) GetStats() CacheStats
- func (rc *ResultCache) Has(key string) bool
- func (rc *ResultCache) Put(key string, value interface{})
- func (rc *ResultCache) Remove(key string) bool
type Row
type Rows
type ShardedCache
- func NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache
- func (sc *ShardedCache) Clear()
- func (sc *ShardedCache) Close()
- func (sc *ShardedCache) Delete(key string)
- func (sc *ShardedCache) Get(key string) (interface{}, bool)
- func (sc *ShardedCache) GetStats() ShardedCacheStats
- func (sc *ShardedCache) Set(key string, value interface{}, size int64)
type ShardedCacheEntry
type ShardedCacheStats
type SizedBytePool
- func NewSizedBytePool() *SizedBytePool
- func (sp *SizedBytePool) Get(size int) []byte
- func (sp *SizedBytePool) Put(buf []byte)
type SizedPool
- func NewSizedPool() *SizedPool
- func (sp *SizedPool) Get(size int) []byte
- func (sp *SizedPool) Put(bufPtr *[]byte)
type SizedTextSlicePool
- func NewSizedTextSlicePool() *SizedTextSlicePool
- func (sp *SizedTextSlicePool) Get(size int) []Text
- func (sp *SizedTextSlicePool) Put(slice []Text)
type SortStrategy
type SortingMetrics
- func GetSortingMetrics() SortingMetrics
type SpatialGrid
- func NewSpatialGrid(blocks []*TextBlock, cellSize float64) *SpatialGrid
- func (g *SpatialGrid) GetNearbyBlocks(blockIdx int) []int
type SpatialIndex
- func NewSpatialIndex(texts []Text) *SpatialIndex
- func (si *SpatialIndex) Query(bounds Rect) []Text
type SpatialIndexInterface
- func NewSpatialIndexInterface(texts []Text) SpatialIndexInterface
type Stack
- func (stk *Stack) DrainTo(dst []Value) []Value
- func (stk *Stack) Len() int
- func (stk *Stack) Pop() Value
- func (stk *Stack) Push(v Value)
type StartupConfig
- func DefaultStartupConfig() *StartupConfig
type StreamProcessor
- func NewStreamProcessor(chunkSize, bufferSize int, maxMemory int64) *StreamProcessor
- func (sp *StreamProcessor) Close()
- func (sp *StreamProcessor) ProcessPageStream(reader *Reader, handler func(PageStream) error) error
- func (sp *StreamProcessor) ProcessTextBlockStream(reader *Reader, handler func(TextBlockStream) error) error
- func (sp *StreamProcessor) ProcessTextStream(reader *Reader, handler func(TextStream) error) error
type StreamingBatchExtractor
- func NewStreamingBatchExtractor(r *Reader, opts BatchExtractOptions) *StreamingBatchExtractor
- func (sbe *StreamingBatchExtractor) Next() *BatchResult
- func (sbe *StreamingBatchExtractor) ProcessAll(callback func(BatchResult) error) error
- func (sbe *StreamingBatchExtractor) Start()
type StreamingMetadataExtractor
- func NewStreamingMetadataExtractor(chunkSize, bufferSize int, maxMemory int64) *StreamingMetadataExtractor
- func (sme *StreamingMetadataExtractor) ExtractMetadataStream(reader *Reader) (<-chan Metadata, <-chan error)
type StreamingTextClassifier
- func NewStreamingTextClassifier(chunkSize, bufferSize int, maxMemory int64) *StreamingTextClassifier
- func (stc *StreamingTextClassifier) ClassifyTextStream(reader *Reader) (<-chan ClassifiedBlock, <-chan error)
type StreamingTextExtractor
- func NewStreamingTextExtractor(r *Reader, maxCachedPages int) *StreamingTextExtractor
- func (e *StreamingTextExtractor) Close()
- func (e *StreamingTextExtractor) GetProgress() float64
- func (e *StreamingTextExtractor) NextBatch() (results map[int]string, hasMore bool, err error)
- func (e *StreamingTextExtractor) NextPage() (pageNum int, text string, hasMore bool, err error)
- func (e *StreamingTextExtractor) Reset()
type StringBuffer
- func NewStringBuffer(capacity int) *StringBuffer
- func (sb *StringBuffer) Bytes() []byte
- func (sb *StringBuffer) Cap() int
- func (sb *StringBuffer) Len() int
- func (sb *StringBuffer) Reset()
- func (sb *StringBuffer) String() string
- func (sb *StringBuffer) StringCopy() string
- func (sb *StringBuffer) WriteByte(b byte) error
- func (sb *StringBuffer) WriteBytes(b []byte)
- func (sb *StringBuffer) WriteString(s string)
type StringBuilderPool
type StringPool
- func NewStringPool() *StringPool
- func (sp *StringPool) Clear()
- func (sp *StringPool) Intern(s string) string
- func (sp *StringPool) Size() int
type StructuredBatchResult
type Task
type Text
- func ConvertOptimizedSliceToText(texts []TextOptimized, pool *FontPool) []Text
- func ConvertOptimizedToText(t TextOptimized, pool *FontPool) Text
- func GetSizedTextSlice(size int) []Text
- func GetText() *Text
- func GetTextBySize(contentLength int) *Text
- func GetTextSlice(minCap int) []Text
type TextBlock
- func ClusterTextBlocksOptimized(texts []Text) []*TextBlock
- func ClusterTextBlocksOptimizedV2(texts []Text) []*TextBlock
- func ClusterTextBlocksParallel(texts []Text) []*TextBlock
- func ClusterTextBlocksParallelV2(texts []Text) []*TextBlock
- func ClusterTextBlocksUltraOptimized(texts []Text) []*TextBlock
- func ClusterTextBlocksUltraV2(texts []Text) []*TextBlock
- func ClusterTextBlocksV3(texts []Text) []*TextBlock
- func ClusterTextBlocksV3Fast(texts []Text, maxClusters int) []*TextBlock
- func ClusterTextBlocksV4(texts []Text) []*TextBlock
- func GetTextBlock() *TextBlock
- func (tb *TextBlock) Bounds() Rect
- func (tb *TextBlock) Center() Point
- func (tb *TextBlock) Height() float64
- func (tb *TextBlock) Width() float64
type TextBlockStream
type TextClassifier
- func NewTextClassifier(texts []Text, pageWidth, pageHeight float64) *TextClassifier
- func (tc *TextClassifier) ClassifyBlocks() []ClassifiedBlock
type TextEncoding
- func EnhancedCMapEncoding(name string) TextEncoding
- func LookupPredefinedCMap(name string) TextEncoding
type TextHorizontal
- func (x TextHorizontal) Len() int
- func (x TextHorizontal) Less(i, j int) bool
- func (x TextHorizontal) Swap(i, j int)
type TextOptimized
- func ConvertTextSliceToOptimized(texts []Text, pool *FontPool) []TextOptimized
- func ConvertTextToOptimized(t Text, pool *FontPool) TextOptimized
- func (t *TextOptimized) IsBold() bool
- func (t *TextOptimized) IsItalic() bool
- func (t *TextOptimized) IsUnderline() bool
- func (t *TextOptimized) IsVertical() bool
- func (t *TextOptimized) SetBold(v bool)
- func (t *TextOptimized) SetItalic(v bool)
- func (t *TextOptimized) SetUnderline(v bool)
- func (t *TextOptimized) SetVertical(v bool)
type TextStream
type TextVertical
- func (x TextVertical) Len() int
- func (x TextVertical) Less(i, j int) bool
- func (x TextVertical) Swap(i, j int)
type TextWithLanguage
type ToUnicodeCMap
- func NewToUnicodeCMap() *ToUnicodeCMap
- func ParseToUnicodeCMap(r io.Reader) (*ToUnicodeCMap, error)
- func (c *ToUnicodeCMap) DecodeCID(cid int) string
type Type1Cache
- func GetGlobalType1Cache() *Type1Cache
- func NewType1Cache(maxSize int, ttl time.Duration) *Type1Cache
- func (tc *Type1Cache) GetFont(data []byte) (*Type1Font, bool)
- func (tc *Type1Cache) PutFont(data []byte, font *Type1Font)
type Type1CacheEntry
- func (ce *Type1CacheEntry) IsExpired() bool
type Type1Font
- func NewType1Font(data []byte) (*Type1Font, error)
- func ParseType1FromStream(v Value) (*Type1Font, error)
- func (f *Type1Font) GlyphName(code byte) string
- func (f *Type1Font) GlyphWidth(name string) float64
- func (f *Type1Font) Info() *Type1FontInfo
type Type1FontInfo
- func GetType1FontInfo(v Value) *Type1FontInfo
type Value
- func (v Value) Bool() bool
- func (v Value) Float64() float64
- func (v Value) Index(i int) Value
- func (v Value) Int64() int64
- func (v Value) IsNull() bool
- func (v Value) Key(key string) Value
- func (v Value) Keys() []string
- func (v Value) Kind() ValueKind
- func (v Value) Len() int
- func (v Value) Name() string
- func (v Value) RawString() string
- func (v Value) Reader() io.ReadCloser
- func (v Value) String() string
- func (v Value) Text() string
- func (v Value) TextFromUTF16() string
type ValueKind
type VerticalTextTransform
- func (vt *VerticalTextTransform) TransformGlyph(x, y, w, h float64) (nx, ny, nw, nh float64)
type WSDeque
- func NewWSDeque(size int) *WSDeque
- func (d *WSDeque) PopBottom() WSTask
- func (d *WSDeque) PushBottom(task WSTask)
- func (d *WSDeque) Steal() WSTask
type WSTask
type WSWorker
type WarmupConfig
- func AggressiveWarmupConfig() *WarmupConfig
- func DefaultWarmupConfig() *WarmupConfig
- func LightWarmupConfig() *WarmupConfig
type WarmupStats
type WorkStealingExecutor
- func NewWorkStealingExecutor(numWorkers int) *WorkStealingExecutor
- func (p *WorkStealingExecutor) Start()
- func (p *WorkStealingExecutor) Stop()
- func (p *WorkStealingExecutor) Submit(task WSTask)
type WorkStealingScheduler
- func NewWorkStealingScheduler(numWorkers int) *WorkStealingScheduler
- func (wss *WorkStealingScheduler) Start()
- func (wss *WorkStealingScheduler) Stop()
- func (wss *WorkStealingScheduler) Submit(task Task)
- func (wss *WorkStealingScheduler) Wait()
type Worker
type WorkerPool
- func (wp *WorkerPool) GetStats() WorkerPoolStats
type WorkerPoolStats
type YBand
type ZeroCopyBuilder
- func NewZeroCopyBuilder(cap int) *ZeroCopyBuilder
- func (b *ZeroCopyBuilder) Reset()
- func (b *ZeroCopyBuilder) UnsafeString() string
- func (b *ZeroCopyBuilder) WriteByte(c byte) error
- func (b *ZeroCopyBuilder) WriteString(s string)
Bugs

Examples ¶

Package (ZeroCopyInPDFProcessing)
BatchExtractOptions (OptimizedCache)
BatchExtractOptions (StandardCache)
FastStringConcatZC
GetGlobalFontCache
GlobalFontCache
JoinZeroCopy
ParallelExtractor (Basic)
Reader.ExtractAllPagesParallel
Reader.ExtractPagesBatch
Reader.ExtractPagesBatchToString
SplitZeroCopy
StreamingBatchExtractor
StringBuffer
StringPool
TrimSpaceZeroCopy

Constants ¶

View Source

const (
	FlagVertical  uint8 = 1 << 0 // 0x01
	FlagBold      uint8 = 1 << 1 // 0x02
	FlagItalic    uint8 = 1 << 2 // 0x04
	FlagUnderline uint8 = 1 << 3 // 0x08
)

Flag constants for TextOptimized.Flags

Variables ¶

View Source

var (
	// ErrInvalidFont indicates a font definition is malformed or unsupported
	ErrInvalidFont = errors.New("invalid or unsupported font")

	// ErrUnsupportedEncoding indicates the character encoding is not supported
	ErrUnsupportedEncoding = errors.New("unsupported character encoding")

	// ErrMalformedStream indicates a content stream is malformed
	ErrMalformedStream = errors.New("malformed content stream")

	// ErrInvalidPage indicates an invalid page number or corrupted page
	ErrInvalidPage = errors.New("invalid page")

	// ErrEncrypted indicates the PDF is encrypted and cannot be read without a password
	ErrEncrypted = errors.New("PDF is encrypted")

	// ErrCorrupted indicates the PDF file structure is corrupted
	ErrCorrupted = errors.New("PDF file is corrupted")

	// ErrUnsupportedVersion indicates the PDF version is not supported
	ErrUnsupportedVersion = errors.New("unsupported PDF version")

	// ErrNoContent indicates the page has no content
	ErrNoContent = errors.New("page has no content")
)

Common errors

View Source

var DebugOn = false

DebugOn is responsible for logging messages into stdout. If problems arise during reading, set it true.

View Source

var ErrContextCancelled = errors.New("pdf: context cancelled")

ErrContextCancelled is returned when a context is cancelled during PDF processing

View Source

var ErrInvalidPassword = fmt.Errorf("encrypted PDF: invalid password")

View Source

var ErrMaxParseTimeExceeded = errors.New("pdf: max parse time exceeded")

ErrMaxParseTimeExceeded is returned when max parse time is exceeded

View Source

var ErrMemoryLimitExceeded = errors.New("pdf: stream processor memory limit exceeded")

View Source

var ErrTimeout = errors.New("pdf: operation timeout")

ErrTimeout is returned when processing times out

View Source

var GlobalMetrics = &PerformanceMetrics{}

Global performance metrics instance

View Source

var GlobalPoolWarmer = &PoolWarmer{
	bytePool: globalSizedBytePool,
	textPool: globalSizedTextSlicePool,
}

GlobalPoolWarmer global pool warmer instance

View Source

var SupportedVersions = []PDFVersion{
	{1, 0}, {1, 1}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {1, 7},
	{2, 0},
}

SupportedVersions defines the supported PDF versions

View Source

var Type1GlyphNames = map[string]rune{}/* 183 elements not displayed */

Type1GlyphNames provides glyph name to Unicode mapping

Functions ¶

func AutoWarmup ¶ added in v1.0.2

func AutoWarmup() error

AutoWarmup automatic warmup (selects config based on available memory)

func BatchCompareFloat64 ¶

func BatchCompareFloat64(a, b []float64, threshold float64) []bool

9. SIMD-friendly batch operations (pseudocode, actual assembly needed)

func BatchHexDecode ¶ added in v1.1.6

func BatchHexDecode(hexStrings []string) ([][]byte, []error)

BatchHexDecode processes multiple hex strings in parallel using SIMD operations

func BenchmarkSortingAlgorithms ¶ added in v1.0.1

func BenchmarkSortingAlgorithms(texts []Text, getCoord func(Text) float64) map[string]float64

BenchmarkSortingAlgorithms compares performance of different algorithms

func BytesToString ¶ added in v1.0.2

func BytesToString(b []byte) string

BytesToString zero-copy conversion from []byte to string Warning: The returned string directly references the underlying byte array, do not modify the original []byte

func ClearGlobalStringPool ¶ added in v1.0.2

func ClearGlobalStringPool()

ClearGlobalStringPool clears the global string pool

func CompareStringsZeroCopy ¶ added in v1.0.2

func CompareStringsZeroCopy(s1, s2 string) int

CompareStringsZeroCopy zero-copy string comparison Returns -1 (s1 < s2), 0 (s1 == s2), 1 (s1 > s2)

func DetectCJKOrdering ¶ added in v1.2.8

func DetectCJKOrdering(fontName string) string

DetectCJKOrdering detects the CJK ordering from font name

func EstimateCapacity ¶

func EstimateCapacity(currentLen int, growthFactor float64) int

EstimateCapacity provides better capacity estimation for slices

func ExampleOptimizations ¶

func ExampleOptimizations()

Usage example

func FastHexValidation ¶ added in v1.1.6

func FastHexValidation(hexStr string) bool

FastHexValidation performs SIMD-style validation of hex strings

func FastSortTexts ¶ added in v1.0.1

func FastSortTexts(texts []Text, less func(i, j int) bool)

FastSortTexts sorts texts using the fastest algorithm for the comparison function

func FastSortTextsByX ¶ added in v1.0.1

func FastSortTextsByX(texts []Text)

FastSortTextsByX sorts texts by X coordinate using the fastest algorithm

func FastSortTextsByY ¶ added in v1.0.1

func FastSortTextsByY(texts []Text)

FastSortTextsByY sorts texts by Y coordinate using the fastest algorithm

func FastStringConcat ¶

func FastStringConcat(strings ...string) string

FastStringConcat concatenates strings with optimized memory allocation

func FastStringConcatZC ¶ added in v1.0.2

func FastStringConcatZC(parts ...string) string

FastStringConcatZC fast concatenation of multiple strings (zero-copy version)

Example ¶

ExampleFastStringConcatZC Demonstrate fast string concatenation

result := FastStringConcatZC("Hello", " ", "World", "!")
fmt.Println(result)

Output:

Hello World!

func FastStringSearch ¶

func FastStringSearch(haystack, needle string) int

FastStringSearch performs optimized string search using SIMD-like operations This is a simplified implementation that can be extended with actual SIMD instructions

func GetBuilder ¶

func GetBuilder() *strings.Builder

GetBuilder retrieves a strings.Builder from the pool

func GetByteBuffer ¶

func GetByteBuffer() *[]byte

GetByteBuffer retrieves a byte buffer from the pool

func GetCMapWritingMode ¶ added in v1.2.8

func GetCMapWritingMode(name string) int

GetCMapWritingMode returns the writing mode for a CMap name Returns 0 for horizontal, 1 for vertical, -1 if unknown

func GetContentExtractorSlices ¶ added in v1.2.3

func GetContentExtractorSlices() ([]Text, []Rect)

GetContentExtractorSlices gets pre-allocated slices from pool

func GetIntSlice ¶ added in v1.2.3

func GetIntSlice(size int) []int

GetIntSlice gets an int slice from pool

func GetPDFBuffer ¶

func GetPDFBuffer() *buffer

GetPDFBuffer retrieves a PDF buffer from the pool

func GetSizedBuffer ¶ added in v1.0.1

func GetSizedBuffer(size int) []byte

GetSizedBuffer retrieves a byte buffer from the global sized pool This is a convenience function for common use cases

func GetVerticalVariant ¶ added in v1.2.8

func GetVerticalVariant(r rune) rune

GetVerticalVariant returns the vertical variant of a character if available

func GlyphNameToRune ¶ added in v1.2.8

func GlyphNameToRune(name string) rune

GlyphNameToRune converts a glyph name to Unicode rune

func HasPrefixZeroCopy ¶ added in v1.0.2

func HasPrefixZeroCopy(s, prefix string) bool

HasPrefixZeroCopy zero-copy prefix check

func HasSuffixZeroCopy ¶ added in v1.0.2

func HasSuffixZeroCopy(s, suffix string) bool

HasSuffixZeroCopy zero-copy suffix check

func HexDecodeSIMD ¶ added in v1.1.6

func HexDecodeSIMD(hexStr string) ([]byte, error)

HexDecodeSIMD performs SIMD-optimized hex string decoding This function uses vectorized operations to decode hex strings efficiently

func HilbertXYToIndex ¶

func HilbertXYToIndex(x, y, order uint32) uint64

8. Hilbert curve calculation (for spatial indexing)

func InitPredefinedCMaps ¶ added in v1.2.8

func InitPredefinedCMaps()

InitPredefinedCMaps initializes common predefined CMaps These provide basic Unicode mappings for CJK character sets

func InternRune ¶ added in v1.2.3

func InternRune(r rune) string

InternRune converts a rune to interned string

func InternString ¶ added in v1.0.2

func InternString(s string) string

InternString adds string to global pool

func Interpret ¶

func Interpret(strm Value, do func(stk *Stack, op string))

Interpret interprets the content in a stream as a basic PostScript program, pushing values onto a stack and then calling the do function to execute operators. The do function may push or pop values from the stack as needed to implement op.

Interpret handles the operators "dict", "currentdict", "begin", "end", "def", and "pop" itself.

Interpret is not a full-blown PostScript interpreter. Its job is to handle the very limited PostScript found in certain supporting file formats embedded in PDF files, such as cmap files that describe the mapping from font code points to Unicode code points.

A stream can also be represented by an array of streams that has to be handled as a single stream In the case of a simple stream read only once, otherwise get the length of the stream to handle it properly

There is no support for executable blocks, among other limitations.

func InterpretWithContext ¶ added in v1.1.5

func InterpretWithContext(ctx context.Context, strm Value, do func(stk *Stack, op string))

InterpretWithContext is like Interpret but accepts a context for cancellation support. When the context is cancelled, interpretation stops and returns.

func InterpretWithContextAndLimits ¶ added in v1.1.5

func InterpretWithContextAndLimits(ctx context.Context, strm Value, do func(stk *Stack, op string), limits *ParseLimits)

InterpretWithContextAndLimits is like InterpretWithContext but also accepts parse limits.

func IsCJKCMap ¶ added in v1.2.8

func IsCJKCMap(name string) bool

IsCJKCMap checks if a CMap name is for CJK (Chinese, Japanese, Korean) encoding

func IsCJKFont ¶ added in v1.2.8

func IsCJKFont(fontName string) bool

IsCJKFont returns true if the font name suggests a CJK font

func IsSameSentence ¶

func IsSameSentence(last, current Text) bool

isSameSentence checks if the current text segment likely belongs to the same sentence as the last text segment based on font, size, vertical position, and lack of sentence-ending punctuation in the last segment.

func IsType1Font ¶ added in v1.2.8

func IsType1Font(v Value) bool

IsType1Font checks if a font value is a Type1 font

func JoinZeroCopy ¶ added in v1.0.2

func JoinZeroCopy(parts []string, sep string) string

JoinZeroCopy zero-copy string joining (single allocation)

Example ¶

ExampleJoinZeroCopy Demonstrate zero-copy joining

parts := []string{"apple", "banana", "cherry"}
result := JoinZeroCopy(parts, ", ")
fmt.Println(result)

Output:

apple, banana, cherry

func ListRegisteredCMaps ¶ added in v1.2.8

func ListRegisteredCMaps() []string

ListRegisteredCMaps returns a list of all registered predefined CMap names

func OptimizedStartup ¶ added in v1.0.2

func OptimizedStartup(config *StartupConfig) error

OptimizedStartup optimized startup process includes pool warmup, cache pre-allocation, etc.

func PreallocateCache ¶ added in v1.0.2

func PreallocateCache(fontCacheSize, resultCacheSize int)

PreallocateCache pre-allocates cache (additional feature)

func ProcessLargePDF ¶

func ProcessLargePDF(reader *Reader, chunkSize, bufferSize int, maxMemory int64,
	handler func(PageStream) error) error

ProcessLargePDF handles very large PDFs with streaming

func ProcessTextWithMultiLanguage ¶

func ProcessTextWithMultiLanguage(reader *Reader) (map[Language][]ClassifiedBlock, error)

ProcessTextWithMultiLanguage handles multi-language text processing for the entire PDF

func PutBlockSlice ¶

func PutBlockSlice(s []ClassifiedBlock)

PutBlockSlice returns a ClassifiedBlock slice to the pool

func PutBuilder ¶

func PutBuilder(b *strings.Builder)

PutBuilder returns a strings.Builder to the pool after resetting it

func PutByteBuffer ¶

func PutByteBuffer(buf *[]byte)

PutByteBuffer returns a byte buffer to the pool

func PutContentExtractorSlices ¶ added in v1.2.3

func PutContentExtractorSlices(text []Text, rect []Rect)

PutContentExtractorSlices returns slices to pool

func PutIntSlice ¶ added in v1.2.3

func PutIntSlice(s []int)

PutIntSlice returns an int slice to pool

func PutPDFBuffer ¶

func PutPDFBuffer(b *buffer)

PutPDFBuffer returns a PDF buffer to the pool after resetting

func PutSizedBuffer ¶ added in v1.0.1

func PutSizedBuffer(buf []byte)

PutSizedBuffer returns a byte buffer to the global sized pool This is a convenience function for common use cases

func PutSizedStringBuilder ¶ added in v1.0.1

func PutSizedStringBuilder(sb *FastStringBuilder, estimatedSize int)

PutSizedStringBuilder returns a string builder to the appropriate pool

func PutSizedTextSlice ¶ added in v1.0.1

func PutSizedTextSlice(slice []Text)

PutSizedTextSlice returns a Text slice to the global pool

func PutText ¶

func PutText(t *Text)

PutText returns a Text object to the appropriate pool

func PutTextBlock ¶ added in v1.2.3

func PutTextBlock(tb *TextBlock)

PutTextBlock returns a TextBlock to pool

func PutTextBlocks ¶ added in v1.2.3

func PutTextBlocks(blocks []*TextBlock)

PutTextBlocks returns multiple TextBlocks to pool

func PutTextSlice ¶

func PutTextSlice(s []Text)

PutTextSlice returns a Text slice to pool

func RadixSortFloat64 ¶

func RadixSortFloat64(values []float64)

7. Radix Sort float64 optimization

func RegisterCJKFont ¶ added in v1.2.8

func RegisterCJKFont(name string, info *CJKFontInfo)

RegisterCJKFont registers a CJK font

func RegisterPredefinedCMap ¶ added in v1.2.8

func RegisterPredefinedCMap(name string, cmap *PredefinedCMap)

RegisterPredefinedCMap registers a predefined CMap

func ResetSortingMetrics ¶ added in v1.0.1

func ResetSortingMetrics()

ResetSortingMetrics resets the sorting metrics

func ShouldRotateGlyph ¶ added in v1.2.8

func ShouldRotateGlyph(r rune) bool

ShouldRotateGlyph returns true if the glyph should be rotated in vertical text

func SmartTextRunsToPlain ¶

func SmartTextRunsToPlain(texts []Text) string

SmartTextRunsToPlain converts text runs to plain text using improved ordering

func SplitZeroCopy ¶ added in v1.0.2

func SplitZeroCopy(s string, sep byte) []string

SplitZeroCopy zero-copy string splitting Strings in the returned slice are all slices of the original string

Example ¶

ExampleSplitZeroCopy Demonstrate zero-copy splitting

str := "a,b,c,d"
parts := SplitZeroCopy(str, ',')
for _, part := range parts {
	fmt.Println(part)
}

Output:

a
b
c
d

func StringSliceToByteSlice ¶ added in v1.0.2

func StringSliceToByteSlice(strings []string) [][]byte

StringSliceToByteSlice zero-copy conversion of each string in []string Each element in the returned [][]byte is read-only

func StringToBytes ¶ added in v1.0.2

func StringToBytes(s string) []byte

StringToBytes zero-copy conversion from string to []byte Warning: The returned []byte is read-only, do not modify

func SubstringZeroCopy ¶ added in v1.0.2

func SubstringZeroCopy(s string, start, end int) string

SubstringZeroCopy zero-copy substring extraction Actually all string slicing in Go is already zero-copy

func TrimSpaceZeroCopy ¶ added in v1.0.2

func TrimSpaceZeroCopy(s string) string

TrimSpaceZeroCopy zero-copy trim leading and trailing spaces

Example ¶

ExampleTrimSpaceZeroCopy Demonstrate zero-copy space trimming

str := "   hello world   "
result := TrimSpaceZeroCopy(str)
fmt.Println(result)

Output:

hello world

func ValidatePDFA ¶ added in v1.2.0

func ValidatePDFA(data []byte) ([]string, error)

ValidatePDFA validates PDF/A compliance

func ValidatePDFX ¶ added in v1.2.0

func ValidatePDFX(data []byte) ([]string, error)

ValidatePDFX validates PDF/X compliance

func WarmupGlobal ¶ added in v1.0.2

func WarmupGlobal(config *WarmupConfig) error

WarmupGlobal warms up global memory pool (convenience function)

func ZeroCopyStringSlice ¶

func ZeroCopyStringSlice(data []byte, separators []byte) []string

ZeroCopyStringSlice creates a string slice without copying data WARNING: This is unsafe and the returned strings share memory with the input

Types ¶

type AccessPattern ¶ added in v1.0.2

type AccessPattern struct {
	// contains filtered or unexported fields
}

AccessPattern records access pattern of single font

type AccessPatternTracker ¶ added in v1.0.2

type AccessPatternTracker struct {
	// contains filtered or unexported fields
}

AccessPatternTracker tracks font access patterns

type AdaptiveCapacityEstimator ¶

type AdaptiveCapacityEstimator struct {
	// contains filtered or unexported fields
}

AdaptiveCapacityEstimator adaptive capacity estimator Dynamically adjusts pre-allocated capacity based on historical data, reducing reallocation

func NewAdaptiveCapacityEstimator ¶

func NewAdaptiveCapacityEstimator(maxSamples int) *AdaptiveCapacityEstimator

NewAdaptiveCapacityEstimator creates new adaptive estimator

func (*AdaptiveCapacityEstimator) Estimate ¶

func (ace *AdaptiveCapacityEstimator) Estimate(hint int) int

Estimate estimates required capacity based on historical data

func (*AdaptiveCapacityEstimator) Record ¶

func (ace *AdaptiveCapacityEstimator) Record(actual int)

Record records actual capacity used

type AdaptiveProcessor ¶ added in v1.0.2

type AdaptiveProcessor struct {
	// contains filtered or unexported fields
}

AdaptiveProcessor adaptive processor Automatically adjusts concurrency level based on system load

func NewAdaptiveProcessor ¶ added in v1.0.2

func NewAdaptiveProcessor(min, max int) *AdaptiveProcessor

NewAdaptiveProcessor creates adaptive processor

func (*AdaptiveProcessor) AdjustWorkers ¶ added in v1.0.2

func (ap *AdaptiveProcessor) AdjustWorkers()

AdjustWorkers adjusts worker count based on system load

func (*AdaptiveProcessor) GetWorkerCount ¶ added in v1.0.2

func (ap *AdaptiveProcessor) GetWorkerCount() int

GetWorkerCount gets current worker goroutine count

func (*AdaptiveProcessor) ProcessAdaptive ¶ added in v1.0.2

func (ap *AdaptiveProcessor) ProcessAdaptive(
	ctx context.Context,
	pages []Page,
	processorFunc func(Page) ([]Text, error),
) ([][]Text, error)

ProcessAdaptive processes adaptively

type AdaptiveSorter ¶ added in v1.0.1

type AdaptiveSorter struct {
	// contains filtered or unexported fields
}

AdaptiveSorter selects the best sorting algorithm based on data characteristics

func NewAdaptiveSorter ¶ added in v1.0.1

func NewAdaptiveSorter() *AdaptiveSorter

NewAdaptiveSorter creates a new adaptive sorter with default thresholds

func (*AdaptiveSorter) SortTextsByComparison ¶ added in v1.0.1

func (as *AdaptiveSorter) SortTextsByComparison(texts []Text, less func(i, j int) bool)

SortTextsByComparison sorts texts using a comparison function

func (*AdaptiveSorter) SortTextsByCoordinate ¶ added in v1.0.1

func (as *AdaptiveSorter) SortTextsByCoordinate(texts []Text, getCoord func(Text) float64)

SortTextsByCoordinate sorts texts by a numeric coordinate using the best algorithm

type AsyncReader ¶

type AsyncReader struct {
	*Reader
	// contains filtered or unexported fields
}

AsyncReader wraps a Reader to provide asynchronous operations

func NewAsyncReader ¶

func NewAsyncReader(reader *Reader) *AsyncReader

NewAsyncReader creates a new async reader with async I/O support

func (*AsyncReader) AsyncExtractStructured ¶

func (ar *AsyncReader) AsyncExtractStructured(ctx context.Context) (<-chan []ClassifiedBlock, <-chan error)

AsyncExtractStructured extracts structured text asynchronously

func (*AsyncReader) AsyncExtractText ¶

func (ar *AsyncReader) AsyncExtractText(ctx context.Context) (<-chan string, <-chan error)

AsyncExtractText extracts text from all pages asynchronously

func (*AsyncReader) AsyncExtractTextWithContext ¶

func (ar *AsyncReader) AsyncExtractTextWithContext(ctx context.Context, opts ExtractOptions) (<-chan string, <-chan error)

AsyncExtractTextWithContext extracts text with cancellation and timeout support

func (*AsyncReader) AsyncStream ¶

func (ar *AsyncReader) AsyncStream(ctx context.Context, processor func(Page, int) error) <-chan error

AsyncStream processes the PDF file with async I/O operations

func (*AsyncReader) StreamValueReader ¶

func (ar *AsyncReader) StreamValueReader(ctx context.Context, v Value) (<-chan []byte, <-chan error)

StreamValueReader provides async streaming of value data

type AsyncReaderAt ¶

type AsyncReaderAt struct {
	// contains filtered or unexported fields
}

AsyncReaderAt provides async I/O for low-level file operations

func NewAsyncReaderAt ¶

func NewAsyncReaderAt(reader io.ReaderAt) *AsyncReaderAt

NewAsyncReaderAt creates a new async reader with async I/O support

func (*AsyncReaderAt) ReadAtAsync ¶

func (ara *AsyncReaderAt) ReadAtAsync(ctx context.Context, buf []byte, offset int64) (<-chan int, <-chan error)

ReadAtAsync reads from the file asynchronously

type BatchExtractOptions ¶ added in v1.0.1

type BatchExtractOptions struct {
	// Pages to extract (nil means all pages)
	Pages []int

	// Number of concurrent workers (0 = NumCPU)
	Workers int

	// Whether to use smart text ordering
	SmartOrdering bool

	// Context for cancellation
	Context context.Context

	// Buffer size for each page result (0 = default 2KB)
	PageBufferSize int

	// Whether to enable font cache for this batch (default: false)
	// When enabled, a temporary font cache is created for the batch
	// to reduce redundant font parsing across pages
	UseFontCache bool

	// Maximum number of fonts to cache (0 = default 1000)
	// Only used when UseFontCache is true
	FontCacheSize int

	// FontCacheType specifies which cache implementation to use
	// - FontCacheStandard: Standard implementation (default)
	// - FontCacheOptimized: High-performance optimized cache (10-85x faster)
	// Only used when UseFontCache is true
	FontCacheType FontCacheType

	// PageTimeout is the maximum time allowed for processing a single page
	// If zero, defaults to 30 seconds. Set to negative value to disable.
	PageTimeout time.Duration

	// ParseLimits configures resource limits for parsing operations
	// If nil, uses DefaultParseLimits()
	ParseLimits *ParseLimits
}

BatchExtractOptions configures batch extraction behavior

Example (OptimizedCache) ¶

ExampleBatchExtractOptions_optimizedCache demonstrates using optimized cache

// This example shows how to use the optimized cache
opts := BatchExtractOptions{
	Workers:       8,
	SmartOrdering: true,
	UseFontCache:  true,
	FontCacheType: FontCacheOptimized, // Optimized cache (10-85x faster)
	FontCacheSize: 2000,
}

fmt.Printf("Cache type: Optimized, Size: %d\n", opts.FontCacheSize)

Output:

Cache type: Optimized, Size: 2000

Example (StandardCache) ¶

ExampleBatchExtractOptions_standardCache demonstrates using standard cache

// This example shows how to use the standard cache
opts := BatchExtractOptions{
	Workers:       4,
	SmartOrdering: true,
	UseFontCache:  true,
	FontCacheType: FontCacheStandard, // Standard cache
	FontCacheSize: 1000,
}

fmt.Printf("Cache type: Standard, Size: %d\n", opts.FontCacheSize)

Output:

Cache type: Standard, Size: 1000

type BatchResult ¶ added in v1.0.1

type BatchResult struct {
	PageNum int
	Text    string
	Error   error
}

BatchResult contains the result of extracting a single page

type BatchStringBuilder ¶

type BatchStringBuilder struct {
	// contains filtered or unexported fields
}

BatchStringBuilder batch string builder Avoids multiple reallocations by precisely calculating required capacity

func NewBatchStringBuilder ¶

func NewBatchStringBuilder(texts []Text) *BatchStringBuilder

NewBatchStringBuilder creates batch string builder

func (*BatchStringBuilder) AppendTexts ¶

func (bsb *BatchStringBuilder) AppendTexts(texts []Text) string

AppendTexts appends text content in batch

func (*BatchStringBuilder) Reset ¶

func (bsb *BatchStringBuilder) Reset()

Reset resets builder for reuse

func (*BatchStringBuilder) String ¶

func (bsb *BatchStringBuilder) String() string

String returns built string

type BlockType ¶

type BlockType int

BlockType represents the semantic type of a text block

const (
	BlockUnknown   BlockType = iota
	BlockTitle               // Title or heading
	BlockParagraph           // Regular paragraph
	BlockList                // List item (numbered or bulleted)
	BlockCaption             // Image or table caption
	BlockFootnote            // Footnote or endnote
	BlockHeader              // Page header
	BlockFooter              // Page footer
)

func (BlockType) String ¶

func (bt BlockType) String() string

String returns the string representation of BlockType

type CCITTFaxDecoder ¶ added in v1.2.8

type CCITTFaxDecoder struct {
	// contains filtered or unexported fields
}

CCITTFaxDecoder decodes CCITT Group 3 and Group 4 fax encoded data as specified in PDF 32000-1:2008, Section 7.4.6

func NewCCITTFaxDecoder ¶ added in v1.2.8

func NewCCITTFaxDecoder(r io.Reader, params CCITTFaxParams) *CCITTFaxDecoder

NewCCITTFaxDecoder creates a new CCITT fax decoder

func (*CCITTFaxDecoder) Read ¶ added in v1.2.8

func (d *CCITTFaxDecoder) Read(p []byte) (n int, err error)

Read implements io.Reader

type CCITTFaxParams ¶ added in v1.2.8

type CCITTFaxParams struct {
	K                      int  // <0: pure 2D (Group 4), 0: pure 1D (Group 3), >0: mixed
	EndOfLine              bool // If true, require EOL alignment bits
	EncodedByteAlign       bool // If true, encoded data is byte-aligned after each row
	Columns                int  // Width of image in pixels (default: 1728)
	Rows                   int  // Height of image (0 = unknown)
	EndOfBlock             bool // If true, expect EOFB sequence
	BlackIs1               bool // If true, 1 bits represent black pixels
	DamagedRowsBeforeError int  // Max consecutive damaged rows before error
}

CCITTFaxParams contains parameters for CCITT fax decoding

func DefaultCCITTFaxParams ¶ added in v1.2.8

func DefaultCCITTFaxParams() CCITTFaxParams

DefaultCCITTFaxParams returns default CCITT fax parameters

func ParseCCITTFaxParams ¶ added in v1.2.8

func ParseCCITTFaxParams(param Value) CCITTFaxParams

ParseCCITTFaxParams parses CCITT fax parameters from a Value

type CFFCache ¶ added in v1.2.8

type CFFCache struct {
	// contains filtered or unexported fields
}

CFFCache provides caching for CFF font parsing and decoding operations

func GetGlobalCFFCache ¶ added in v1.2.8

func GetGlobalCFFCache() *CFFCache

GetGlobalCFFCache returns the global CFF cache instance

func NewCFFCache ¶ added in v1.2.8

func NewCFFCache(maxSize int, ttl time.Duration) *CFFCache

NewCFFCache creates a new CFF cache

func (*CFFCache) GetDecoding ¶ added in v1.2.8

func (cc *CFFCache) GetDecoding(data []byte) ([]interface{}, bool)

GetDecoding retrieves cached character string decoding results

func (*CFFCache) GetFont ¶ added in v1.2.8

func (cc *CFFCache) GetFont(data []byte) (*CFFFont, bool)

GetFont retrieves a cached CFF font

func (*CFFCache) PutDecoding ¶ added in v1.2.8

func (cc *CFFCache) PutDecoding(data []byte, commands []interface{})

PutDecoding caches character string decoding results

func (*CFFCache) PutFont ¶ added in v1.2.8

func (cc *CFFCache) PutFont(data []byte, font *CFFFont)

PutFont caches a CFF font

type CFFCacheEntry ¶ added in v1.2.8

type CFFCacheEntry struct {
	Data        interface{}
	Expiration  time.Time
	LastAccess  time.Time
	AccessCount int64
}

CFFCacheEntry represents a cached CFF font or decoded result

func (*CFFCacheEntry) IsExpired ¶ added in v1.2.8

func (ce *CFFCacheEntry) IsExpired() bool

IsExpired checks if the cache entry has expired

type CFFCharStringDecoder ¶ added in v1.2.8

type CFFCharStringDecoder struct {
	// contains filtered or unexported fields
}

CFFCharStringDecoder decodes CFF CharString data

func NewCFFCharStringDecoder ¶ added in v1.2.8

func NewCFFCharStringDecoder(data []byte) *CFFCharStringDecoder

NewCFFCharStringDecoder creates a new CharString decoder with pooled objects

func (*CFFCharStringDecoder) Decode ¶ added in v1.2.8

func (d *CFFCharStringDecoder) Decode() ([]interface{}, error)

Decode decodes the CharString and returns the path commands with caching and pooling

func (*CFFCharStringDecoder) GetWidth ¶ added in v1.2.8

func (d *CFFCharStringDecoder) GetWidth() (float64, bool)

GetWidth returns the glyph width if available

type CFFDict ¶ added in v1.2.8

type CFFDict struct {
	Data map[int]interface{}
}

CFFDict represents a CFF DICT data structure

type CFFFont ¶ added in v1.2.8

type CFFFont struct {
	Header      *CFFHeader
	NameIndex   *CFFIndex
	TopDict     *CFFDict
	StringIndex *CFFIndex
	GlobalSubrs *CFFIndex
	CharStrings *CFFIndex
	PrivateDict *CFFDict
	LocalSubrs  *CFFIndex
	FDArray     []*CFFDict // For CID-keyed fonts
	FDSelect    []byte     // For CID-keyed fonts
	// contains filtered or unexported fields
}

CFFFont represents a parsed CFF font

func NewCFFFont ¶ added in v1.2.8

func NewCFFFont(data []byte) (*CFFFont, error)

NewCFFFont parses CFF font data with caching

func (*CFFFont) GetCharString ¶ added in v1.2.8

func (f *CFFFont) GetCharString(gid int) []byte

GetCharString returns the CharString for a given glyph index

func (*CFFFont) GetFDIndex ¶ added in v1.2.8

func (f *CFFFont) GetFDIndex(cid int) int

GetFDIndex returns the Font DICT index for a CID (CID-keyed fonts)

func (*CFFFont) GetFontName ¶ added in v1.2.8

func (f *CFFFont) GetFontName() string

GetFontName returns the font name

func (*CFFFont) IsCID ¶ added in v1.2.8

func (f *CFFFont) IsCID() bool

IsCID returns true if this is a CID-keyed font

type CFFHeader ¶ added in v1.2.8

type CFFHeader struct {
	Major   uint8
	Minor   uint8
	HdrSize uint8
	OffSize uint8
}

CFFHeader represents the CFF font header

type CFFIndex ¶ added in v1.2.8

type CFFIndex struct {
	Count   uint16
	OffSize uint8
	Offsets []uint32
	Data    [][]byte
}

CFFIndex represents a CFF INDEX structure

type CFFObjectPool ¶ added in v1.2.8

type CFFObjectPool struct {
	// contains filtered or unexported fields
}

CFFObjectPool provides object pooling for CFF decoding operations

func GetGlobalCFFPool ¶ added in v1.2.8

func GetGlobalCFFPool() *CFFObjectPool

GetGlobalCFFPool returns the global CFF object pool instance

func NewCFFObjectPool ¶ added in v1.2.8

func NewCFFObjectPool() *CFFObjectPool

NewCFFObjectPool creates a new CFF object pool

func (*CFFObjectPool) GetCommandSlice ¶ added in v1.2.8

func (p *CFFObjectPool) GetCommandSlice() []interface{}

GetCommandSlice retrieves a command slice from the pool

func (*CFFObjectPool) GetStack ¶ added in v1.2.8

func (p *CFFObjectPool) GetStack() []float64

GetStack retrieves a stack slice from the pool

func (*CFFObjectPool) PutCommandSlice ¶ added in v1.2.8

func (p *CFFObjectPool) PutCommandSlice(commands []interface{})

PutCommandSlice returns a command slice to the pool

func (*CFFObjectPool) PutStack ¶ added in v1.2.8

func (p *CFFObjectPool) PutStack(stack []float64)

PutStack returns a stack slice to the pool

type CIDFont ¶ added in v1.2.8

type CIDFont struct {
	// contains filtered or unexported fields
}

CIDFont represents a CID-keyed font

func NewCIDFont ¶ added in v1.2.8

func NewCIDFont() *CIDFont

NewCIDFont creates a new CID font

func (*CIDFont) DecodeToUnicode ¶ added in v1.2.8

func (f *CIDFont) DecodeToUnicode(raw string) string

DecodeToUnicode decodes a string using the CMap and ToUnicode

func (*CIDFont) GetWidth ¶ added in v1.2.8

func (f *CIDFont) GetWidth(cid int) int

GetWidth returns the width for a CID

func (*CIDFont) SetCIDToGIDMap ¶ added in v1.2.8

func (f *CIDFont) SetCIDToGIDMap(m *CIDToGIDMap)

SetCIDToGIDMap sets the CID to GID mapping

func (*CIDFont) SetCMap ¶ added in v1.2.8

func (f *CIDFont) SetCMap(cmap *CMap)

SetCMap sets the CMap for this CID font

func (*CIDFont) SetDefaultWidth ¶ added in v1.2.8

func (f *CIDFont) SetDefaultWidth(w int)

SetDefaultWidth sets the default glyph width

func (*CIDFont) SetToUnicode ¶ added in v1.2.8

func (f *CIDFont) SetToUnicode(toUnicode *ToUnicodeCMap)

SetToUnicode sets the ToUnicode CMap

func (*CIDFont) SetWidth ¶ added in v1.2.8

func (f *CIDFont) SetWidth(cid, width int)

SetWidth sets the width for a specific CID

func (*CIDFont) SetWritingMode ¶ added in v1.2.8

func (f *CIDFont) SetWritingMode(wMode int)

SetWritingMode sets the writing mode (0=horizontal, 1=vertical)

func (*CIDFont) WritingMode ¶ added in v1.2.8

func (f *CIDFont) WritingMode() int

WritingMode returns the writing mode

type CIDFontDescriptor ¶ added in v1.2.8

type CIDFontDescriptor struct {
	FontName     string
	FontFamily   string
	Flags        int
	FontBBox     [4]float64
	ItalicAngle  float64
	Ascent       float64
	Descent      float64
	Leading      float64
	CapHeight    float64
	XHeight      float64
	StemV        float64
	StemH        float64
	AvgWidth     float64
	MaxWidth     float64
	MissingWidth float64
}

CIDFontDescriptor contains font descriptor information for CID fonts

type CIDSystemInfo ¶ added in v1.2.8

type CIDSystemInfo struct {
	Registry   string
	Ordering   string
	Supplement int
}

CIDSystemInfo represents the CIDSystemInfo dictionary in a CMap

type CIDToGIDMap ¶ added in v1.2.8

type CIDToGIDMap struct {
	// contains filtered or unexported fields
}

CIDToGIDMap represents a CIDToGIDMap for CID-keyed fonts

func NewCIDToGIDMap ¶ added in v1.2.8

func NewCIDToGIDMap(data []byte) *CIDToGIDMap

NewCIDToGIDMap creates a CIDToGIDMap from raw data

func NewIdentityCIDToGIDMap ¶ added in v1.2.8

func NewIdentityCIDToGIDMap() *CIDToGIDMap

NewIdentityCIDToGIDMap creates an identity CIDToGIDMap

func (*CIDToGIDMap) IsIdentity ¶ added in v1.2.8

func (m *CIDToGIDMap) IsIdentity() bool

IsIdentity returns true if this is an identity mapping

func (*CIDToGIDMap) LookupGID ¶ added in v1.2.8

func (m *CIDToGIDMap) LookupGID(cid int) int

LookupGID returns the GID for a given CID

type CJKFontInfo ¶ added in v1.2.8

type CJKFontInfo struct {
	Name       string // Font name
	Registry   string // Registry (e.g., "Adobe")
	Ordering   string // Ordering (e.g., "GB1", "CNS1", "Japan1", "Korea1")
	Supplement int    // Supplement number
	IsVertical bool   // Whether vertical writing mode
	WMode      int    // Writing mode: 0=horizontal, 1=vertical
}

CJKFontInfo contains information about CJK fonts

func GetCJKFontInfo ¶ added in v1.2.8

func GetCJKFontInfo(name string) *CJKFontInfo

GetCJKFontInfo returns information about a CJK font

type CJKFontRegistry ¶ added in v1.2.8

type CJKFontRegistry struct {
	// contains filtered or unexported fields
}

CJKFontRegistry is a registry of CJK fonts

type CJKGlyphMetrics ¶ added in v1.2.8

type CJKGlyphMetrics struct {
	Width       float64 // Horizontal advance width
	Height      float64 // Vertical advance height
	VOriginX    float64 // Vertical origin X
	VOriginY    float64 // Vertical origin Y
	HasVertical bool    // Whether vertical metrics are available
}

CJKGlyphMetrics contains glyph metrics for CJK fonts

type CJKTextProcessor ¶ added in v1.2.8

type CJKTextProcessor struct {
	// contains filtered or unexported fields
}

CJKTextProcessor processes CJK text for proper rendering

func NewCJKTextProcessor ¶ added in v1.2.8

func NewCJKTextProcessor(font *ExtendedCIDFont, isVertical bool) *CJKTextProcessor

NewCJKTextProcessor creates a new CJK text processor

func (*CJKTextProcessor) GetGlyphMetrics ¶ added in v1.2.8

func (p *CJKTextProcessor) GetGlyphMetrics(cid int) CJKGlyphMetrics

GetGlyphMetrics returns the metrics for a glyph in the current writing mode

func (*CJKTextProcessor) ProcessText ¶ added in v1.2.8

func (p *CJKTextProcessor) ProcessText(text string) string

ProcessText processes CJK text, handling vertical writing and character variants

type CMap ¶ added in v1.2.8

type CMap struct {
	Name          string
	Type          CMapType
	CIDSystemInfo CIDSystemInfo
	WMode         int // 0: horizontal, 1: vertical
	// contains filtered or unexported fields
}

CMap represents a character code to CID or Unicode mapping

func NewCMap ¶ added in v1.2.8

func NewCMap(name string, cmapType CMapType) *CMap

NewCMap creates a new empty CMap

func ParseCMap ¶ added in v1.2.8

func ParseCMap(r io.Reader, name string) (*CMap, error)

ParseCMap parses a CMap from a reader

func (*CMap) AddBFChar ¶ added in v1.2.8

func (c *CMap) AddBFChar(orig, repl string)

AddBFChar adds a single base font character mapping (for ToUnicode)

func (*CMap) AddBFRange ¶ added in v1.2.8

func (c *CMap) AddBFRange(low, high string, dst Value)

AddBFRange adds a base font range mapping (for ToUnicode)

func (*CMap) AddCIDChar ¶ added in v1.2.8

func (c *CMap) AddCIDChar(code []byte, cid int)

AddCIDChar adds a single CID character mapping

func (*CMap) AddCIDRange ¶ added in v1.2.8

func (c *CMap) AddCIDRange(low, high []byte, startCID int)

AddCIDRange adds a CID range mapping

func (*CMap) AddCodeSpaceRange ¶ added in v1.2.8

func (c *CMap) AddCodeSpaceRange(low, high []byte)

AddCodeSpaceRange adds a code space range to the CMap

func (*CMap) Decode ¶ added in v1.2.8

func (c *CMap) Decode(raw string) string

Decode implements TextEncoding interface for ToUnicode CMaps (lock-free)

func (*CMap) LookupCID ¶ added in v1.2.8

func (c *CMap) LookupCID(code []byte) (int, bool)

LookupCID looks up the CID for a given character code (lock-free)

func (*CMap) OptimizeCIDLookup ¶ added in v1.2.8

func (c *CMap) OptimizeCIDLookup()

OptimizeCIDLookup precomputes CID mappings for fast lookup This should be called after all CID mappings are added to the CMap

func (*CMap) SetCIDSystemInfo ¶ added in v1.2.8

func (c *CMap) SetCIDSystemInfo(registry, ordering string, supplement int)

SetCIDSystemInfo sets the CIDSystemInfo for the CMap

func (*CMap) SetUseCMap ¶ added in v1.2.8

func (c *CMap) SetUseCMap(parent TextEncoding)

SetUseCMap sets the parent CMap to use for unmapped codes

func (*CMap) String ¶ added in v1.2.8

func (c *CMap) String() string

fmt.Stringer implementation for debugging

type CMapInfo ¶ added in v1.2.8

type CMapInfo struct {
	Name       string
	Registry   string
	Ordering   string
	Supplement int
	WMode      int
	Type       CMapType
}

CMapInfo contains information about a CMap

func GetCMapInfo ¶ added in v1.2.8

func GetCMapInfo(name string) *CMapInfo

GetCMapInfo returns information about a registered CMap

type CMapParser ¶ added in v1.2.8

type CMapParser struct {
	// contains filtered or unexported fields
}

CMapParser parses CMap files/streams

type CMapType ¶ added in v1.2.8

type CMapType int

CMapType represents the type of CMap

const (
	CMapTypeToUnicode CMapType = iota // ToUnicode CMap
	CMapTypeCID                       // CID CMap (for Adobe-* encodings)
)

type CacheContext ¶

type CacheContext struct {
	// contains filtered or unexported fields
}

CacheContext provides a context-aware cache with automatic cleanup

func NewCacheContext ¶

func NewCacheContext(parent context.Context, cache *ResultCache) *CacheContext

NewCacheContext creates a new context-aware cache

func (*CacheContext) Close ¶

func (cc *CacheContext) Close()

Close releases resources used by the cache context

func (*CacheContext) GetWithTimeout ¶

func (cc *CacheContext) GetWithTimeout(key string, timeout time.Duration) (interface{}, bool, error)

GetWithTimeout gets a value with timeout

type CacheEntry ¶

type CacheEntry struct {
	Data        interface{}
	Expiration  time.Time
	AccessCount int64
	LastAccess  time.Time
	Size        int64 // Estimated size in bytes
}

CacheEntry represents a cached item

func (*CacheEntry) IsExpired ¶

func (ce *CacheEntry) IsExpired() bool

IsExpired checks if the cache entry has expired

type CacheKeyGenerator ¶

type CacheKeyGenerator struct{}

CacheKeyGenerator provides functions to generate cache keys

func NewCacheKeyGenerator ¶

func NewCacheKeyGenerator() *CacheKeyGenerator

NewCacheKeyGenerator creates a new key generator

func (*CacheKeyGenerator) GenerateFullHash ¶

func (ckg *CacheKeyGenerator) GenerateFullHash(data string) string

GenerateFullHash generates a hash from arbitrary data

func (*CacheKeyGenerator) GeneratePageContentKey ¶

func (ckg *CacheKeyGenerator) GeneratePageContentKey(pageNum int, readerHash string) string

GeneratePageContentKey generates a cache key for page content

func (*CacheKeyGenerator) GenerateReaderHash ¶

func (ckg *CacheKeyGenerator) GenerateReaderHash(reader *Reader) string

GenerateReaderHash generates a hash for the reader object (simplified)

func (*CacheKeyGenerator) GenerateTextClassificationKey ¶

func (ckg *CacheKeyGenerator) GenerateTextClassificationKey(pageNum int, readerHash string, processorParams string) string

GenerateTextClassificationKey generates a cache key for text classification

func (*CacheKeyGenerator) GenerateTextOrderingKey ¶

func (ckg *CacheKeyGenerator) GenerateTextOrderingKey(pageNum int, readerHash string, orderingParams string) string

GenerateTextOrderingKey generates a cache key for text ordering

type CacheLineAlignedCounter ¶

type CacheLineAlignedCounter struct {
	// contains filtered or unexported fields
}

func NewCacheLineAlignedCounter ¶

func NewCacheLineAlignedCounter(n int) *CacheLineAlignedCounter

func (*CacheLineAlignedCounter) Add ¶

func (c *CacheLineAlignedCounter) Add(idx int, delta uint64)

func (*CacheLineAlignedCounter) Get ¶

func (c *CacheLineAlignedCounter) Get(idx int) uint64

type CacheLinePadded ¶

type CacheLinePadded struct {
	// contains filtered or unexported fields
}

4. Cache line aligned structure

type CacheManager ¶

type CacheManager struct {
	// contains filtered or unexported fields
}

CacheManager provides centralized cache management

func NewCacheManager ¶

func NewCacheManager() *CacheManager

NewCacheManager creates a new cache manager with separate caches for different data types

func (*CacheManager) GetClassificationCache ¶

func (cm *CacheManager) GetClassificationCache() *ResultCache

GetClassificationCache returns the classification cache

func (*CacheManager) GetMetadataCache ¶

func (cm *CacheManager) GetMetadataCache() *ResultCache

GetMetadataCache returns the metadata cache

func (*CacheManager) GetPageCache ¶

func (cm *CacheManager) GetPageCache() *ResultCache

GetPageCache returns the page content cache

func (*CacheManager) GetTextOrderingCache ¶

func (cm *CacheManager) GetTextOrderingCache() *ResultCache

GetTextOrderingCache returns the text ordering cache

func (*CacheManager) GetTotalStats ¶

func (cm *CacheManager) GetTotalStats() CacheStats

GetTotalStats returns combined statistics for all caches

type CacheShard ¶ added in v1.0.2

type CacheShard struct {
	// contains filtered or unexported fields
}

CacheShard represents a single shard of the cache

type CacheStats ¶

type CacheStats struct {
	Hits        int64
	Misses      int64
	Evictions   int64
	CurrentSize int64
	MaxSize     int64
	Entries     int64
}

CacheStats provides statistics about cache performance

type CachedReader ¶

type CachedReader struct {
	*Reader
	// contains filtered or unexported fields
}

CachedReader wraps a Reader to provide caching functionality

func NewCachedReader ¶

func NewCachedReader(reader *Reader, cache *ResultCache) *CachedReader

NewCachedReader creates a new cached reader

func (*CachedReader) CachedClassifyTextBlocks ¶

func (cr *CachedReader) CachedClassifyTextBlocks(pageNum int) ([]ClassifiedBlock, error)

CachedClassifyTextBlocks returns classified text blocks with caching

func (*CachedReader) CachedPage ¶

func (cr *CachedReader) CachedPage(pageNum int) ([]Text, error)

CachedPage returns page content with caching

type ClassifiedBlock ¶

type ClassifiedBlock struct {
	Type    BlockType // Semantic type of the block
	Level   int       // Hierarchy level (for titles: 1=h1, 2=h2, etc.)
	Content []Text    // Text runs in this block
	Bounds  Rect      // Bounding box
	Text    string    // Concatenated text content
}

ClassifiedBlock represents a classified block of text with semantic information

func GetBlockSlice ¶

func GetBlockSlice() []ClassifiedBlock

GetBlockSlice retrieves a ClassifiedBlock slice from the pool

func GetTextByType ¶

func GetTextByType(blocks []ClassifiedBlock, blockType BlockType) []ClassifiedBlock

GetTextByType returns all text blocks of a specific type

func GetTitles ¶

func GetTitles(blocks []ClassifiedBlock, level int) []ClassifiedBlock

GetTitles returns all title blocks, optionally filtered by level

type ClassifiedBlockWithLanguage ¶

type ClassifiedBlockWithLanguage struct {
	ClassifiedBlock
	Language LanguageInfo
}

ClassifiedBlockWithLanguage represents a classified block with language information

type Column ¶

type Column struct {
	Position int64
	Content  TextVertical
}

Column represents the contents of a column

type Columns ¶

type Columns []*Column

Columns is a list of column

type ConnectionPool ¶

type ConnectionPool struct {
	// contains filtered or unexported fields
}

ConnectionPool manages a pool of connections/resources

func NewConnectionPool ¶

func NewConnectionPool(maxSize int, newFunc func() interface{}, closeFunc func(interface{})) *ConnectionPool

NewConnectionPool creates a new connection pool

func (*ConnectionPool) Close ¶

func (cp *ConnectionPool) Close()

Close closes all connections in the pool

func (*ConnectionPool) Get ¶

func (cp *ConnectionPool) Get() interface{}

Get retrieves a connection from the pool

func (*ConnectionPool) Put ¶

func (cp *ConnectionPool) Put(conn interface{})

Put returns a connection to the pool

type Content ¶

type Content struct {
	Text []Text
	Rect []Rect
}

Content describes the basic content on a page: the text and any drawn rectangles.

type CryptoEngine ¶ added in v1.2.8

type CryptoEngine struct {
	// contains filtered or unexported fields
}

CryptoEngine provides encryption/decryption functionality

func NewCryptoEngine ¶ added in v1.2.8

func NewCryptoEngine(info *PDFEncryptionInfo) *CryptoEngine

NewCryptoEngine creates a new crypto engine

func (*CryptoEngine) DecryptData ¶ added in v1.2.8

func (e *CryptoEngine) DecryptData(data []byte, objID, genID int) ([]byte, error)

DecryptData decrypts data using the current encryption method

func (*CryptoEngine) EncryptData ¶ added in v1.2.8

func (e *CryptoEngine) EncryptData(data []byte, objID, genID int) ([]byte, error)

EncryptData encrypts data using the current encryption method

func (*CryptoEngine) SetKey ¶ added in v1.2.8

func (e *CryptoEngine) SetKey(key []byte)

SetKey sets the encryption key

type EncryptionMethod ¶ added in v1.2.8

type EncryptionMethod int

EncryptionMethod represents the encryption method

const (
	MethodRC4   EncryptionMethod = 0
	MethodAESV2 EncryptionMethod = 1 // AES-128 CBC
	MethodAESV3 EncryptionMethod = 2 // AES-256 CBC
)

type EncryptionRevision ¶ added in v1.2.8

type EncryptionRevision int

EncryptionRevision represents PDF encryption revision

const (
	Revision2 EncryptionRevision = 2 // MD5-based
	Revision3 EncryptionRevision = 3 // MD5-based with key strengthening
	Revision4 EncryptionRevision = 4 // MD5-based with access permissions
	Revision5 EncryptionRevision = 5 // SHA-256-based
	Revision6 EncryptionRevision = 6 // SHA-384/512-based
)

type EncryptionVersion ¶ added in v1.2.8

type EncryptionVersion int

EncryptionVersion represents PDF encryption version

const (
	EncryptionV1 EncryptionVersion = 1 // RC4 40-bit
	EncryptionV2 EncryptionVersion = 2 // RC4 40-128-bit
	EncryptionV4 EncryptionVersion = 4 // RC4 or AES 128-bit
	EncryptionV5 EncryptionVersion = 5 // AES 256-bit
)

type EnhancedParallelProcessor ¶ added in v1.0.2

type EnhancedParallelProcessor struct {
	// contains filtered or unexported fields
}

EnhancedParallelProcessor enhanced parallel processor Provides better concurrency control, load balancing, and error handling

func NewEnhancedParallelProcessor ¶ added in v1.0.2

func NewEnhancedParallelProcessor(workers int, batchSize int) *EnhancedParallelProcessor

NewEnhancedParallelProcessor creates enhanced parallel processor

func (*EnhancedParallelProcessor) ProcessPagesEnhanced ¶ added in v1.0.2

func (epp *EnhancedParallelProcessor) ProcessPagesEnhanced(
	ctx context.Context,
	pages []Page,
	processorFunc func(Page) ([]Text, error),
) ([][]Text, error)

ProcessPagesEnhanced processes pages in parallel with enhancements

func (*EnhancedParallelProcessor) ProcessWithLoadBalancing ¶ added in v1.0.2

func (epp *EnhancedParallelProcessor) ProcessWithLoadBalancing(
	ctx context.Context,
	pages []Page,
	processorFunc func(Page) ([]Text, error),
) ([][]Text, error)

ProcessWithLoadBalancing processes with load balancing

func (*EnhancedParallelProcessor) ProcessWithPipeline ¶ added in v1.0.2

func (epp *EnhancedParallelProcessor) ProcessWithPipeline(
	ctx context.Context,
	pages []Page,
	stages []func(Page, []Text) ([]Text, error),
) ([][]Text, error)

ProcessWithPipeline processes with pipeline

type ExtendedCIDFont ¶ added in v1.2.8

type ExtendedCIDFont struct {
	*CIDFont

	V Value
	// contains filtered or unexported fields
}

ExtendedCIDFont extends CIDFont with additional CJK-specific features

func NewExtendedCIDFont ¶ added in v1.2.8

func NewExtendedCIDFont(v Value) *ExtendedCIDFont

NewExtendedCIDFont creates a new ExtendedCIDFont from a PDF value

func (*ExtendedCIDFont) Descriptor ¶ added in v1.2.8

func (cf *ExtendedCIDFont) Descriptor() *CIDFontDescriptor

Descriptor returns the font descriptor

func (*ExtendedCIDFont) GID ¶ added in v1.2.8

func (cf *ExtendedCIDFont) GID(cid int) uint16

GID returns the GID for the given CID

func (*ExtendedCIDFont) Info ¶ added in v1.2.8

func (cf *ExtendedCIDFont) Info() *CJKFontInfo

Info returns the CJK font info

func (*ExtendedCIDFont) IsVertical ¶ added in v1.2.8

func (cf *ExtendedCIDFont) IsVertical() bool

IsVertical returns true if this font uses vertical writing mode

func (*ExtendedCIDFont) VerticalOrigin ¶ added in v1.2.8

func (cf *ExtendedCIDFont) VerticalOrigin(cid int) (float64, float64)

VerticalOrigin returns the vertical origin of the given CID

func (*ExtendedCIDFont) VerticalWidth ¶ added in v1.2.8

func (cf *ExtendedCIDFont) VerticalWidth(cid int) float64

VerticalWidth returns the vertical width of the given CID

type ExtractMode ¶

type ExtractMode int

ExtractMode specifies the type of extraction to perform

const (
	ModePlain      ExtractMode = iota // Plain text extraction
	ModeStyled                        // Text with style information
	ModeStructured                    // Structured text with classification
)

type ExtractOptions ¶

type ExtractOptions struct {
	Workers   int   // Number of concurrent workers (0 = use NumCPU)
	PageRange []int // Specific pages to extract (nil = all pages)
}

ExtractOptions configures text extraction behavior

type ExtractResult ¶

type ExtractResult struct {
	Text             string            // Plain text (for ModePlain)
	StyledTexts      []Text            // Styled texts (for ModeStyled)
	ClassifiedBlocks []ClassifiedBlock // Classified blocks (for ModeStructured)
	Metadata         Metadata          // Document metadata
	PageCount        int               // Total number of pages
}

ExtractResult contains the results of text extraction

type Extractor ¶

type Extractor struct {
	// contains filtered or unexported fields
}

Extractor provides a builder pattern for configuring and executing extraction

func NewExtractor ¶

func NewExtractor(r *Reader) *Extractor

NewExtractor creates a new extractor for the given reader

func (*Extractor) Context ¶

func (e *Extractor) Context(ctx context.Context) *Extractor

Context sets the context for cancellation

func (*Extractor) Extract ¶

func (e *Extractor) Extract() (*ExtractResult, error)

Extract performs the extraction and returns the result

func (*Extractor) ExtractStructured ¶

func (e *Extractor) ExtractStructured() ([]ClassifiedBlock, error)

ExtractStructured is a convenience method for extracting structured text

func (*Extractor) ExtractStyledTexts ¶

func (e *Extractor) ExtractStyledTexts() ([]Text, error)

ExtractStyledTexts is a convenience method for extracting styled texts

func (*Extractor) ExtractText ¶

func (e *Extractor) ExtractText() (string, error)

ExtractText is a convenience method for extracting plain text

func (*Extractor) Mode ¶

func (e *Extractor) Mode(mode ExtractMode) *Extractor

Mode sets the extraction mode

func (*Extractor) Pages ¶

func (e *Extractor) Pages(pages ...int) *Extractor

Pages sets specific pages to extract (1-indexed)

func (*Extractor) SmartOrdering ¶

func (e *Extractor) SmartOrdering(enabled bool) *Extractor

SmartOrdering enables smart text ordering for multi-column layouts

func (*Extractor) Workers ¶

func (e *Extractor) Workers(n int) *Extractor

Workers sets the number of concurrent workers

type FastStringBuilder ¶

type FastStringBuilder struct {
	// contains filtered or unexported fields
}

FastStringBuilder provides optimized string building with pre-allocation

func GetSizedStringBuilder ¶ added in v1.0.1

func GetSizedStringBuilder(estimatedSize int) *FastStringBuilder

GetSizedStringBuilder retrieves a string builder from the appropriate pool

func NewFastStringBuilder ¶

func NewFastStringBuilder(estimatedSize int) *FastStringBuilder

NewFastStringBuilder creates a builder with estimated capacity

func (*FastStringBuilder) Len ¶

func (b *FastStringBuilder) Len() int

Len returns the current length

func (*FastStringBuilder) Reset ¶

func (b *FastStringBuilder) Reset()

Reset clears the builder

func (*FastStringBuilder) String ¶

func (b *FastStringBuilder) String() string

func (*FastStringBuilder) WriteByte ¶

func (b *FastStringBuilder) WriteByte(c byte) error

func (*FastStringBuilder) WriteString ¶

func (b *FastStringBuilder) WriteString(s string)

WriteString appends a string

type Font ¶

type Font struct {
	V Value
	// contains filtered or unexported fields
}

A Font represent a font in a PDF file. The methods interpret a Font dictionary stored in V.

func (Font) BaseFont ¶

func (f Font) BaseFont() string

BaseFont returns the font's name (BaseFont property).

func (*Font) Encoder ¶

func (f *Font) Encoder() TextEncoding

Encoder returns the encoding between font code point sequences and UTF-8. Pointer receiver is required so the computed encoder is cached on the shared Font instance instead of a copy. The previous value-receiver implementation rebuilt the encoder for every call, causing large allocations to pile up during batch extraction.

func (*Font) ExtendedCIDFont ¶ added in v1.2.8

func (f *Font) ExtendedCIDFont() *ExtendedCIDFont

ExtendedCIDFont returns an ExtendedCIDFont for CID-keyed fonts with enhanced CJK support

func (Font) FirstChar ¶

func (f Font) FirstChar() int

FirstChar returns the code point of the first character in the font.

func (Font) LastChar ¶

func (f Font) LastChar() int

LastChar returns the code point of the last character in the font.

func (Font) Width ¶

func (f Font) Width(code int) float64

Width returns the width of the given code point.

func (Font) Widths ¶

func (f Font) Widths() []float64

Widths returns the widths of the glyphs in the font. In a well-formed PDF, len(f.Widths()) == f.LastChar()+1 - f.FirstChar().

type FontCache ¶

type FontCache struct {
	// contains filtered or unexported fields
}

FontCache stores parsed fonts to avoid re-parsing across pages

func NewFontCache ¶

func NewFontCache() *FontCache

NewFontCache creates a new font cache

func (*FontCache) Get ¶

func (fc *FontCache) Get(key string) (*Font, bool)

Get retrieves a font from the cache

func (*FontCache) Set ¶

func (fc *FontCache) Set(key string, font *Font)

Set stores a font in the cache

type FontCacheInterface ¶ added in v1.0.1

type FontCacheInterface interface {
	Get(key string) (*Font, bool)
	Set(key string, font *Font)
	Clear()
	GetStats() FontCacheStats
}

FontCacheInterface defines the common interface for font caches

type FontCacheStats ¶ added in v1.0.1

type FontCacheStats struct {
	Entries     int
	MaxEntries  int
	Hits        uint64
	Misses      uint64
	HitRate     float64
	AvgAccesses float64
}

Stats returns cache statistics

type FontCacheType ¶ added in v1.0.1

type FontCacheType int

FontCacheType specifies which font cache implementation to use

const (
	// FontCacheStandard uses the standard GlobalFontCache (default)
	// - Stable and well-tested
	// - Good performance for most use cases
	// - Simpler implementation
	FontCacheStandard FontCacheType = iota

	// FontCacheOptimized uses the OptimizedFontCache
	// - 10-85x faster than standard (depending on workload)
	// - Lock-free read path with 16 shards
	// - Best for high-concurrency scenarios (>1000 qps)
	// - Recommended for production environments with heavy load
	FontCacheOptimized
)

type FontPool ¶ added in v1.2.3

type FontPool struct {
	// contains filtered or unexported fields
}

FontPool manages a pool of font names and provides compact IDs. Thread-safe for concurrent access.

func GetGlobalFontPool ¶ added in v1.2.3

func GetGlobalFontPool() *FontPool

GetGlobalFontPool returns the global font pool instance

func NewFontPool ¶ added in v1.2.3

func NewFontPool() *FontPool

NewFontPool creates a new FontPool

func (*FontPool) Clear ¶ added in v1.2.3

func (fp *FontPool) Clear()

Clear removes all fonts from the pool. Should only be called when you're sure no TextOptimized objects reference these IDs.

func (*FontPool) GetFont ¶ added in v1.2.3

func (fp *FontPool) GetFont(id uint32) string

GetFont returns the font name for an ID. Returns empty string if ID is invalid. Thread-safe.

func (*FontPool) GetID ¶ added in v1.2.3

func (fp *FontPool) GetID(font string) uint32

GetID returns the ID for a font name, creating a new ID if needed. Thread-safe.

func (*FontPool) Len ¶ added in v1.2.3

func (fp *FontPool) Len() int

Len returns the number of unique fonts in the pool

type FontPrefetcher ¶ added in v1.0.2

type FontPrefetcher struct {
	// contains filtered or unexported fields
}

FontPrefetcher implements intelligent font prefetch strategy Based on access pattern prediction and preloading potentially needed fonts

func NewFontPrefetcher ¶ added in v1.0.2

func NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher

NewFontPrefetcher create new font prefetcher

func (*FontPrefetcher) ClearPatterns ¶ added in v1.0.2

func (fp *FontPrefetcher) ClearPatterns()

ClearPatterns clears access patterns

func (*FontPrefetcher) Close ¶ added in v1.0.2

func (fp *FontPrefetcher) Close()

Close closes the prefetcher

func (*FontPrefetcher) Disable ¶ added in v1.0.2

func (fp *FontPrefetcher) Disable()

Disable disables prefetching

func (*FontPrefetcher) Enable ¶ added in v1.0.2

func (fp *FontPrefetcher) Enable()

Enable enables prefetching

func (*FontPrefetcher) GetStats ¶ added in v1.0.2

func (fp *FontPrefetcher) GetStats() PrefetchStats

GetStats gets prefetch statistics

func (*FontPrefetcher) RecordAccess ¶ added in v1.0.2

func (fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)

RecordAccess record font access

type GlobalFontCache ¶ added in v1.0.1

type GlobalFontCache struct {
	// contains filtered or unexported fields
}

GlobalFontCache implements an enhanced global font cache with: - LRU eviction for memory control - Hit/miss statistics for monitoring - Content-based hashing for accurate cache keys

Example ¶

// Create a cache with max 100 entries and 1 hour expiration
cache := NewGlobalFontCache(100, 1*time.Hour)

// Store a font
font := &Font{}
cache.Set("MyFont", font)

// Retrieve the font
retrieved, ok := cache.Get("MyFont")
if ok {
	fmt.Println("Font found in cache")
	_ = retrieved
}

// Get statistics
stats := cache.GetStats()
fmt.Printf("Cache entries: %d, Hit rate: %.2f%%\n",
	stats.Entries, stats.HitRate*100)

func GetGlobalFontCache ¶ added in v1.0.1

func GetGlobalFontCache() *GlobalFontCache

GetGlobalFontCache returns the global font cache instance

Example ¶

// Get the global singleton instance
cache := GetGlobalFontCache()

font := &Font{}
cache.Set("GlobalFont", font)

// The same instance can be accessed from anywhere
sameCacheInstance := GetGlobalFontCache()
retrieved, _ := sameCacheInstance.Get("GlobalFont")
_ = retrieved

func NewGlobalFontCache ¶ added in v1.0.1

func NewGlobalFontCache(maxEntries int, maxAge time.Duration) *GlobalFontCache

NewGlobalFontCache creates a new global font cache

func (*GlobalFontCache) Cleanup ¶ added in v1.0.1

func (gfc *GlobalFontCache) Cleanup() int

Cleanup removes expired entries

func (*GlobalFontCache) Clear ¶ added in v1.0.1

func (gfc *GlobalFontCache) Clear()

Clear removes all fonts from the cache

func (*GlobalFontCache) Get ¶ added in v1.0.1

func (gfc *GlobalFontCache) Get(key string) (*Font, bool)

Get retrieves a font from the cache

func (*GlobalFontCache) GetOrCompute ¶ added in v1.0.1

func (gfc *GlobalFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)

GetOrCompute retrieves a font from cache or computes it if not present This is a convenience function that combines Get and Set

func (*GlobalFontCache) GetStats ¶ added in v1.0.1

func (gfc *GlobalFontCache) GetStats() FontCacheStats

GetStats returns current cache statistics

func (*GlobalFontCache) Remove ¶ added in v1.0.1

func (gfc *GlobalFontCache) Remove(key string)

Remove removes a font from the cache

func (*GlobalFontCache) Set ¶ added in v1.0.1

func (gfc *GlobalFontCache) Set(key string, font *Font)

Set stores a font in the cache

func (*GlobalFontCache) StartCleanupRoutine ¶ added in v1.0.1

func (gfc *GlobalFontCache) StartCleanupRoutine(interval time.Duration) chan struct{}

StartCleanupRoutine starts a background goroutine to periodically clean up expired entries

type GridKey ¶

type GridKey struct {
	X, Y int
}

GridKey represents a grid cell identifier

type InplaceStringBuilder ¶ added in v1.0.2

type InplaceStringBuilder struct {
	// contains filtered or unexported fields
}

InplaceStringBuilder in-place string builder Avoid intermediate allocations

func NewInplaceStringBuilder ¶ added in v1.0.2

func NewInplaceStringBuilder(capacity int) *InplaceStringBuilder

NewInplaceStringBuilder create new in-place string builder

func (*InplaceStringBuilder) Append ¶ added in v1.0.2

func (isb *InplaceStringBuilder) Append(s string)

Append append string

func (*InplaceStringBuilder) Build ¶ added in v1.0.2

func (isb *InplaceStringBuilder) Build() string

Build build final string (single allocation)

func (*InplaceStringBuilder) Len ¶ added in v1.0.2

func (isb *InplaceStringBuilder) Len() int

Len return total length

func (*InplaceStringBuilder) Reset ¶ added in v1.0.2

func (isb *InplaceStringBuilder) Reset()

Reset reset builder

type IntegrityStatus ¶ added in v1.2.2

type IntegrityStatus struct {
	// IsValid indicates whether the PDF is valid enough to parse
	IsValid bool
	// IsTruncated indicates whether the file appears to be truncated
	IsTruncated bool
	// HasValidHeader indicates whether a valid PDF header was found
	HasValidHeader bool
	// HasValidEOF indicates whether a valid %%EOF marker was found
	HasValidEOF bool
	// HasStartxref indicates whether a startxref marker was found
	HasStartxref bool
	// HasXref indicates whether xref table or stream was found
	HasXref bool
	// HasTrailer indicates whether trailer dictionary was found
	HasTrailer bool
	// EstimatedObjects is the estimated number of objects in the file
	EstimatedObjects int
	// Issues contains descriptions of any problems found
	Issues []string
}

IntegrityStatus represents the result of a PDF integrity check

func CheckIntegrity ¶ added in v1.2.2

func CheckIntegrity(f io.ReaderAt, size int64) *IntegrityStatus

CheckIntegrity performs a quick integrity check on a PDF file

type JBIG2Decoder ¶ added in v1.2.8

type JBIG2Decoder struct {
	// contains filtered or unexported fields
}

JBIG2Decoder decodes JBIG2 encoded data JBIG2 is a complex format primarily used for scanned documents This implementation provides basic support for embedded JBIG2 streams

func NewJBIG2Decoder ¶ added in v1.2.8

func NewJBIG2Decoder(r io.Reader, params JBIG2Params) *JBIG2Decoder

NewJBIG2Decoder creates a new JBIG2 decoder

func (*JBIG2Decoder) Read ¶ added in v1.2.8

func (d *JBIG2Decoder) Read(p []byte) (n int, err error)

Read implements io.Reader

type JBIG2Params ¶ added in v1.2.8

type JBIG2Params struct {
	Globals []byte // Data from JBIG2Globals stream
}

JBIG2Params contains parameters for JBIG2 decoding

func ParseJBIG2Params ¶ added in v1.2.8

func ParseJBIG2Params(param Value) JBIG2Params

ParseJBIG2Params parses JBIG2 parameters from a Value

type KDNode ¶

type KDNode struct {
	// contains filtered or unexported fields
}

KDNode KD tree node Optimized: Use fixed float64 instead of slice to avoid allocation

type KDTree ¶

type KDTree struct {
	// contains filtered or unexported fields
}

KDTree KD tree spatial index For O(log n) time complexity nearest neighbor search

func BuildKDTree ¶

func BuildKDTree(blocks []*TextBlock) *KDTree

BuildKDTree builds KD tree from text blocks Optimized: pre-allocate indices once, use fixed-size coordinates

func (*KDTree) RangeSearch ¶

func (tree *KDTree) RangeSearch(targetX, targetY, radiusSq float64) []*TextBlock

RangeSearch range search, returns all text blocks within specified radius of target point Optimized: uses object pool for stack, inlined distance calculation, direct coordinates

func (*KDTree) RangeSearchWithBuffer ¶ added in v1.2.3

func (tree *KDTree) RangeSearchWithBuffer(targetX, targetY, radiusSq float64, buffer []*TextBlock) []*TextBlock

RangeSearchWithBuffer is an optimized version that reuses a provided buffer for results. This eliminates repeated allocations when performing multiple searches. The buffer will be cleared and reused. If buffer is nil, behaves like RangeSearch. Returns the result slice (may be the same as buffer or a new allocation if capacity insufficient).

type LZWPredictor ¶ added in v1.2.8

type LZWPredictor struct {
	// contains filtered or unexported fields
}

LZWPredictor implements PNG prediction filters for LZW decoded data

func NewLZWPredictor ¶ added in v1.2.8

func NewLZWPredictor(r io.Reader, params LZWPredictorParams) *LZWPredictor

NewLZWPredictor creates a new LZW predictor filter

func (*LZWPredictor) Read ¶ added in v1.2.8

func (p *LZWPredictor) Read(b []byte) (n int, err error)

Read implements io.Reader

type LZWPredictorParams ¶ added in v1.2.8

type LZWPredictorParams struct {
	Predictor int // 1=none, 2=TIFF, 10-15=PNG
	Colors    int // Number of color components (default: 1)
	BPC       int // Bits per component (default: 8)
	Columns   int // Pixels per row (default: 1)
}

LZWPredictorParams contains parameters for LZW prediction

func DefaultLZWPredictorParams ¶ added in v1.2.8

func DefaultLZWPredictorParams() LZWPredictorParams

DefaultLZWPredictorParams returns default predictor parameters

func ParseLZWPredictorParams ¶ added in v1.2.8

func ParseLZWPredictorParams(param Value) LZWPredictorParams

ParseLZWPredictorParams parses predictor parameters from a Value

type Language ¶

type Language string

Language represents a language code

const (
	English Language = "en"
	French  Language = "fr"
	German  Language = "de"
	Spanish Language = "es"
	Unknown Language = "unknown"
)

type LanguageInfo ¶

type LanguageInfo struct {
	Language      Language
	Confidence    float64 // Confidence level (0.0 to 1.0)
	Characters    []rune  // Unique characters in the text
	WordCount     int     // Number of words in the text
	SentenceCount int     // Number of sentences in the text
}

LanguageInfo contains information about a detected language

type LanguageTextExtractor ¶

type LanguageTextExtractor struct {
	// contains filtered or unexported fields
}

LanguageTextExtractor extracts text while detecting languages

func NewLanguageTextExtractor ¶

func NewLanguageTextExtractor() *LanguageTextExtractor

NewLanguageTextExtractor creates a new language-aware text extractor

func (*LanguageTextExtractor) ExtractTextByLanguage ¶

func (lte *LanguageTextExtractor) ExtractTextByLanguage(reader *Reader) (map[Language][]Text, error)

ExtractTextByLanguage extracts text grouped by detected language

func (*LanguageTextExtractor) GetLanguageStats ¶

func (lte *LanguageTextExtractor) GetLanguageStats(texts []Text) map[Language]int

GetLanguageStats returns statistics about languages detected in the text

func (*LanguageTextExtractor) GetTextsByLanguage ¶

func (lte *LanguageTextExtractor) GetTextsByLanguage(texts []Text, targetLang Language) []Text

GetTextsByLanguage returns text elements filtered by specific language

type LazyPage ¶

type LazyPage struct {
	// contains filtered or unexported fields
}

LazyPage provides lazy loading of page content to reduce memory usage for large PDFs where not all pages need to be processed

func NewLazyPage ¶

func NewLazyPage(r *Reader, pageNum int) *LazyPage

NewLazyPage creates a lazy-loading page wrapper

func (*LazyPage) GetContent ¶

func (lp *LazyPage) GetContent() *Content

GetContent loads and returns the page content (cached after first call)

func (*LazyPage) IsLoaded ¶

func (lp *LazyPage) IsLoaded() bool

IsLoaded returns whether the page content has been loaded

func (*LazyPage) Release ¶

func (lp *LazyPage) Release()

Release clears the cached content to free memory

type LazyPageManager ¶

type LazyPageManager struct {
	// contains filtered or unexported fields
}

LazyPageManager manages lazy loading of multiple pages

func NewLazyPageManager ¶

func NewLazyPageManager(r *Reader, maxCached int) *LazyPageManager

NewLazyPageManager creates a manager with LRU cache

func (*LazyPageManager) Clear ¶

func (m *LazyPageManager) Clear()

Clear releases all cached pages

func (*LazyPageManager) GetPage ¶

func (m *LazyPageManager) GetPage(pageNum int) *LazyPage

GetPage returns a lazy page, loading it if necessary

func (*LazyPageManager) GetStats ¶

func (m *LazyPageManager) GetStats() (totalPages, loadedPages int)

GetStats returns cache statistics

type LockFreeRingBuffer ¶

type LockFreeRingBuffer struct {
	// contains filtered or unexported fields
}

3. Lock-free ring buffer (for producer-consumer)

func NewLockFreeRingBuffer ¶

func NewLockFreeRingBuffer(size int) *LockFreeRingBuffer

func (*LockFreeRingBuffer) Pop ¶

func (rb *LockFreeRingBuffer) Pop() (interface{}, bool)

func (*LockFreeRingBuffer) Push ¶

func (rb *LockFreeRingBuffer) Push(item interface{}) bool

type MemoryArena ¶

type MemoryArena struct {
	// contains filtered or unexported fields
}

10. Memory pool manager (reduce GC pressure)

func NewMemoryArena ¶

func NewMemoryArena(chunkSize int) *MemoryArena

func (*MemoryArena) Alloc ¶

func (a *MemoryArena) Alloc(size int) []byte

func (*MemoryArena) Reset ¶

func (a *MemoryArena) Reset()

type MemoryEfficientExtractor ¶

type MemoryEfficientExtractor struct {
	// contains filtered or unexported fields
}

MemoryEfficientExtractor provides memory-efficient extraction using streaming

func NewMemoryEfficientExtractor ¶

func NewMemoryEfficientExtractor(chunkSize, bufferSize int, maxMemory int64) *MemoryEfficientExtractor

NewMemoryEfficientExtractor creates a new memory-efficient extractor

func (*MemoryEfficientExtractor) ExtractTextStream ¶

func (mee *MemoryEfficientExtractor) ExtractTextStream(reader *Reader) (<-chan TextStream, <-chan error)

ExtractTextStream extracts text in a memory-efficient streaming way

func (*MemoryEfficientExtractor) ExtractTextToWriter ¶

func (mee *MemoryEfficientExtractor) ExtractTextToWriter(reader *Reader, writer io.Writer) (err error)

ExtractTextToWriter extracts text directly to an io.Writer to minimize memory usage

type Metadata ¶

type Metadata struct {
	Title        string            // Document title
	Author       string            // Author name
	Subject      string            // Document subject
	Keywords     []string          // Keywords
	Creator      string            // Application that created the document
	Producer     string            // PDF producer (converter)
	CreationDate time.Time         // Creation date
	ModDate      time.Time         // Last modification date
	Trapped      string            // Trapping information (True/False/Unknown)
	Custom       map[string]string // Custom metadata fields
}

Metadata represents PDF document metadata

func (Metadata) String ¶

func (m Metadata) String() string

GetDocumentInfo returns a formatted string with document information

type MultiLangProcessor ¶

type MultiLangProcessor struct {
	// contains filtered or unexported fields
}

MultiLangProcessor provides multi-language text processing

func NewMultiLangProcessor ¶

func NewMultiLangProcessor() *MultiLangProcessor

NewMultiLangProcessor creates a new multi-language processor

func (*MultiLangProcessor) DetectLanguage ¶

func (mlp *MultiLangProcessor) DetectLanguage(text string) LanguageInfo

DetectLanguage detects the language of a given text

func (*MultiLangProcessor) GetLanguageConfidenceThreshold ¶

func (mlp *MultiLangProcessor) GetLanguageConfidenceThreshold() float64

GetLanguageConfidenceThreshold returns a confidence threshold for reliable detection

func (*MultiLangProcessor) GetLanguageName ¶

func (mlp *MultiLangProcessor) GetLanguageName(lang Language) string

GetLanguageName returns the full name of a language

func (*MultiLangProcessor) GetSupportedLanguages ¶

func (mlp *MultiLangProcessor) GetSupportedLanguages() []Language

GetSupportedLanguages returns the list of supported languages

func (*MultiLangProcessor) IsEnglish ¶

func (mlp *MultiLangProcessor) IsEnglish(text string) bool

IsEnglish checks if text is likely English

func (*MultiLangProcessor) IsFrench ¶

func (mlp *MultiLangProcessor) IsFrench(text string) bool

IsFrench checks if text is likely French

func (*MultiLangProcessor) IsGerman ¶

func (mlp *MultiLangProcessor) IsGerman(text string) bool

IsGerman checks if text is likely German

func (*MultiLangProcessor) IsSpanish ¶

func (mlp *MultiLangProcessor) IsSpanish(text string) bool

IsSpanish checks if text is likely Spanish

func (*MultiLangProcessor) ProcessTextWithLanguageDetection ¶

func (mlp *MultiLangProcessor) ProcessTextWithLanguageDetection(texts []Text) []TextWithLanguage

ProcessTextWithLanguageDetection processes text with language detection

type MultiLanguageTextClassifier ¶

type MultiLanguageTextClassifier struct {
	*TextClassifier
	// contains filtered or unexported fields
}

MultiLanguageTextClassifier extends the text classifier with language awareness

func NewMultiLanguageTextClassifier ¶

func NewMultiLanguageTextClassifier(texts []Text, pageWidth, pageHeight float64) *MultiLanguageTextClassifier

NewMultiLanguageTextClassifier creates a new multi-language text classifier

func (*MultiLanguageTextClassifier) ClassifyBlocksWithLanguage ¶

func (mltc *MultiLanguageTextClassifier) ClassifyBlocksWithLanguage() []ClassifiedBlockWithLanguage

ClassifyBlocksWithLanguage extends the classification with language information

type MultiLevelCache ¶

type MultiLevelCache struct {
	// contains filtered or unexported fields
}

MultiLevelCache multi-level cache manager

func NewMultiLevelCache ¶

func NewMultiLevelCache() *MultiLevelCache

NewMultiLevelCache create multi-level cache

func (*MultiLevelCache) Get ¶

func (mlc *MultiLevelCache) Get(key string) (interface{}, bool)

Get get data from cache

func (*MultiLevelCache) Prefetch ¶

func (mlc *MultiLevelCache) Prefetch(keys []string)

Prefetch prefetch page data

func (*MultiLevelCache) Put ¶

func (mlc *MultiLevelCache) Put(key string, value interface{})

Put store in cache

func (*MultiLevelCache) Stats ¶

func (mlc *MultiLevelCache) Stats() map[string]uint64

Stats get cache statistics

type OptimizedCMapCache ¶ added in v1.2.8

type OptimizedCMapCache struct {
	// contains filtered or unexported fields
}

OptimizedCMapCache provides high-performance CMap caching with: - Lock-free read path using atomic operations - Sharded design to reduce lock contention (8 shards) - Zero-allocation fast path for cache hits - LRU eviction with atomic operations

func GetGlobalCMapCache ¶ added in v1.2.8

func GetGlobalCMapCache() *OptimizedCMapCache

GetGlobalCMapCache returns the global CMap cache

func NewOptimizedCMapCache ¶ added in v1.2.8

func NewOptimizedCMapCache(maxEntries int) *OptimizedCMapCache

NewOptimizedCMapCache creates a new optimized CMap cache

func (*OptimizedCMapCache) Get ¶ added in v1.2.8

func (c *OptimizedCMapCache) Get(key string) (*CMap, bool)

Get retrieves a CMap from cache with lock-free fast path

func (*OptimizedCMapCache) GetStats ¶ added in v1.2.8

func (c *OptimizedCMapCache) GetStats() (hits, misses uint64)

GetStats returns cache statistics

func (*OptimizedCMapCache) Put ¶ added in v1.2.8

func (c *OptimizedCMapCache) Put(key string, cmap *CMap)

Put adds a CMap to the cache

func (*OptimizedCMapCache) Release ¶ added in v1.2.8

func (c *OptimizedCMapCache) Release(key string)

Release decrements reference count

type OptimizedFontCache ¶ added in v1.0.1

type OptimizedFontCache struct {
	// contains filtered or unexported fields
}

OptimizedFontCache implements an ultra-high-performance font cache with: - Lock-free read path using atomic operations - Sharded design to reduce lock contention (16 shards) - Zero-allocation fast path for cache hits - Inline LRU using lock-free linked list approximation - Pre-allocated pools for metadata structs - SIMD-friendly memory layout

func NewOptimizedFontCache ¶ added in v1.0.1

func NewOptimizedFontCache(totalCapacity int) *OptimizedFontCache

NewOptimizedFontCache creates a new optimized font cache

func (*OptimizedFontCache) Clear ¶ added in v1.0.1

func (ofc *OptimizedFontCache) Clear()

Clear removes all entries from all shards

func (*OptimizedFontCache) Get ¶ added in v1.0.1

func (ofc *OptimizedFontCache) Get(key string) (*Font, bool)

Get retrieves a font from the cache (lock-free fast path)

func (*OptimizedFontCache) GetOrCompute ¶ added in v1.0.1

func (ofc *OptimizedFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)

GetOrCompute retrieves a font from cache or computes it if not present

func (*OptimizedFontCache) GetStats ¶ added in v1.0.1

func (ofc *OptimizedFontCache) GetStats() FontCacheStats

GetStats returns aggregated statistics across all shards

func (*OptimizedFontCache) Prefetch ¶ added in v1.0.1

func (ofc *OptimizedFontCache) Prefetch(keys []string, compute func(key string) (*Font, error))

Prefetch warms up the cache with multiple keys concurrently

func (*OptimizedFontCache) Remove ¶ added in v1.0.1

func (ofc *OptimizedFontCache) Remove(key string)

Remove removes a specific key from the cache

func (*OptimizedFontCache) Set ¶ added in v1.0.1

func (ofc *OptimizedFontCache) Set(key string, font *Font)

Set stores a font in the cache

type OptimizedMemoryPool ¶

type OptimizedMemoryPool struct {
	// contains filtered or unexported fields
}

OptimizedMemoryPool provides better memory pool management

func NewOptimizedMemoryPool ¶

func NewOptimizedMemoryPool(size int) *OptimizedMemoryPool

NewOptimizedMemoryPool creates a pool with size tracking

func (*OptimizedMemoryPool) Get ¶

func (omp *OptimizedMemoryPool) Get() []byte

Get retrieves a buffer from the pool

func (*OptimizedMemoryPool) Put ¶

func (omp *OptimizedMemoryPool) Put(bufPtr *[]byte)

Put returns a buffer to the pool, resetting it

type OptimizedSorter ¶

type OptimizedSorter struct {
	// contains filtered or unexported fields
}

OptimizedSorter provides optimized sorting algorithms for large text collections

func NewOptimizedSorter ¶

func NewOptimizedSorter() *OptimizedSorter

NewOptimizedSorter creates a new optimized sorter

func (*OptimizedSorter) QuickSortTexts ¶

func (os *OptimizedSorter) QuickSortTexts(texts []Text, less func(i, j int) bool)

QuickSortTexts implements quicksort for text collections

func (*OptimizedSorter) SortTextHorizontalByOptimized ¶

func (os *OptimizedSorter) SortTextHorizontalByOptimized(th TextHorizontal)

SortTextHorizontalByOptimized sorts TextHorizontal using optimized algorithm

func (*OptimizedSorter) SortTextVerticalByOptimized ¶

func (os *OptimizedSorter) SortTextVerticalByOptimized(tv TextVertical)

SortTextVerticalByOptimized sorts TextVertical using optimized algorithm

func (*OptimizedSorter) SortTexts ¶

func (os *OptimizedSorter) SortTexts(texts []Text, less func(i, j int) bool)

SortTexts sorts a collection of texts using the most appropriate algorithm

func (*OptimizedSorter) SortTextsWithAlgorithm ¶

func (os *OptimizedSorter) SortTextsWithAlgorithm(texts []Text, less func(i, j int) bool, algorithm string)

SortTextsWithAlgorithm allows choosing a specific sorting algorithm

type OptimizedTextClusterSorter ¶

type OptimizedTextClusterSorter struct {
	// contains filtered or unexported fields
}

OptimizedTextClusterSorter provides optimized sorting for text clusters

func NewOptimizedTextClusterSorter ¶

func NewOptimizedTextClusterSorter() *OptimizedTextClusterSorter

NewOptimizedTextClusterSorter creates a new optimized cluster sorter

func (*OptimizedTextClusterSorter) SortTextBlocks ¶

func (otcs *OptimizedTextClusterSorter) SortTextBlocks(blocks []*TextBlock, sortBy string)

SortTextBlocks sorts text blocks by various criteria

type Outline ¶

type Outline struct {
	Title string    // title for this element
	Child []Outline // child elements
}

An Outline is a tree describing the outline (also known as the table of contents) of a document.

type PDFCompatibilityInfo ¶ added in v1.2.0

type PDFCompatibilityInfo struct {
	Version             PDFVersion
	IsLinearized        bool
	LinearizationParams map[string]interface{}
	SubFormat           string // "PDF/A", "PDF/X", or ""
	Encryption          string
	HasTransparency     bool
	HasLayers           bool
	HasForms            bool
	HasJavaScript       bool
	Warnings            []string
	Errors              []string
}

PDFCompatibilityInfo holds compatibility information

func CheckPDFCompatibility ¶ added in v1.2.0

func CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)

CheckPDFCompatibility analyzes a PDF file for compatibility

type PDFEncryptionInfo ¶ added in v1.2.8

type PDFEncryptionInfo struct {
	Version   EncryptionVersion
	Revision  EncryptionRevision
	Method    EncryptionMethod
	KeyLength int    // in bits
	O         []byte // Owner password hash
	U         []byte // User password hash
	P         uint32 // Permissions
	ID        []byte // Document ID
	OE        []byte // Owner encryption key (V5)
	UE        []byte // User encryption key (V5)
	Perms     []byte // Encrypted permissions (V5)
}

PDFEncryptionInfo contains encryption parameters

type PDFError ¶

type PDFError struct {
	Op   string // Operation that failed (e.g., "extract text", "parse font")
	Page int    // Page number where error occurred (0 if not page-specific)
	Path string // File path if applicable
	Err  error  // Underlying error
}

PDFError represents an error that occurred during PDF processing. It includes contextual information about where the error occurred.

func (*PDFError) Error ¶

func (e *PDFError) Error() string

func (*PDFError) Unwrap ¶

func (e *PDFError) Unwrap() error

type PDFVersion ¶ added in v1.2.0

type PDFVersion struct {
	Major int
	Minor int
}

PDFVersion represents a PDF version

func (PDFVersion) IsSupported ¶ added in v1.2.0

func (v PDFVersion) IsSupported() bool

IsSupported checks if a version is supported

func (PDFVersion) String ¶ added in v1.2.0

func (v PDFVersion) String() string

String returns the version string

type Page ¶

type Page struct {
	V Value
	// contains filtered or unexported fields
}

A Page represent a single page in a PDF file. The methods interpret a Page dictionary stored in V.

func (Page) ClassifyTextBlocks ¶

func (p Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)

ClassifyTextBlocks is a convenience function that creates a classifier and runs classification

func (*Page) Cleanup ¶ added in v1.0.7

func (p *Page) Cleanup()

Cleanup releases resources held by the Page, specifically the fontCache reference. Call this after processing a page to prevent memory leaks in batch operations. This method is safe to call multiple times.

func (Page) Content ¶

func (p Page) Content() Content

Content returns the page's content.

func (Page) Font ¶

func (p Page) Font(name string) Font

Font returns the font with the given name associated with the page.

func (Page) Fonts ¶

func (p Page) Fonts() []string

Fonts returns a list of the fonts associated with the page.

func (*Page) GetPlainText ¶

func (p *Page) GetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)

GetPlainText returns the page's all text without format. fonts can be passed in (to improve parsing performance) or left nil ctx can be used to cancel the extraction operation (pass context.Background() if not needed)

func (*Page) GetPlainTextWithSmartOrdering ¶

func (p *Page) GetPlainTextWithSmartOrdering(ctx context.Context, fonts map[string]*Font) (string, error)

GetPlainTextWithSmartOrdering extracts plain text using an improved text ordering algorithm that handles multi-column layouts and complex reading orders. ctx can be used to cancel the extraction operation (pass context.Background() if not needed)

func (Page) GetTextByColumn ¶

func (p Page) GetTextByColumn() (Columns, error)

GetTextByColumn returns the page's all text grouped by column

func (Page) GetTextByRow ¶

func (p Page) GetTextByRow() (Rows, error)

GetTextByRow returns the page's all text grouped by rows

func (Page) OptimizedGetPlainText ¶

func (p Page) OptimizedGetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)

OptimizedGetPlainText returns the page's all text using optimized string building. This version uses object pools and pre-allocation to reduce memory allocations. ctx can be used to cancel the extraction operation (pass context.Background() if not needed)

func (Page) OptimizedGetTextByColumn ¶

func (p Page) OptimizedGetTextByColumn() (Columns, error)

OptimizedGetTextByColumn returns the page's all text grouped by column using optimized allocation

func (Page) OptimizedGetTextByRow ¶

func (p Page) OptimizedGetTextByRow() (Rows, error)

OptimizedGetTextByRow returns the page's all text grouped by rows using optimized allocation

func (Page) Resources ¶

func (p Page) Resources() Value

Resources returns the resources dictionary associated with the page.

func (*Page) SetFontCache ¶ added in v1.0.1

func (p *Page) SetFontCache(cache *GlobalFontCache)

SetFontCache sets a font cache for this page to improve performance during text extraction by reusing parsed fonts. Deprecated: Use SetFontCacheInterface for better flexibility.

func (*Page) SetFontCacheInterface ¶ added in v1.0.1

func (p *Page) SetFontCacheInterface(cache FontCacheInterface)

SetFontCacheInterface sets a font cache using the interface This supports both GlobalFontCache and OptimizedFontCache

type PageStream ¶

type PageStream struct {
	Page      Page
	PageNum   int
	HasText   bool
	TextCount int
}

PageStream represents a stream of pages

type ParallelExtractor ¶ added in v1.0.2

type ParallelExtractor struct {
	// contains filtered or unexported fields
}

ParallelExtractor parallel extractor Advanced extraction interface combining all optimizations

Example (Basic) ¶

ExampleParallelExtractor_basic basic usage example

// Create parallel extractor
extractor := NewParallelExtractor(4) // use 4 worker goroutines
defer extractor.Close()

// Note: actual usage requires creating Page objects
// pages := []Page{...}

ctx := context.Background()

// Simulate empty page list
var pages []Page

// Extract all pages
results, err := extractor.ExtractAllPages(ctx, pages)
if err != nil {
	fmt.Printf("Error: %v\n", err)
	return
}

fmt.Printf("Extracted %d pages\n", len(results))

Output:

Extracted 0 pages

func NewParallelExtractor ¶ added in v1.0.2

func NewParallelExtractor(workers int) *ParallelExtractor

NewParallelExtractor creates parallel extractor

func (*ParallelExtractor) Close ¶ added in v1.0.2

func (pe *ParallelExtractor) Close()

Close closes and cleans up resources

func (*ParallelExtractor) ExtractAllPages ¶ added in v1.0.2

func (pe *ParallelExtractor) ExtractAllPages(
	ctx context.Context,
	pages []Page,
) ([][]Text, error)

ExtractAllPages extracts all pages (using all optimizations)

func (*ParallelExtractor) GetCacheStats ¶ added in v1.0.2

func (pe *ParallelExtractor) GetCacheStats() ShardedCacheStats

GetCacheStats gets cache statistics

func (*ParallelExtractor) GetPrefetchStats ¶ added in v1.0.2

func (pe *ParallelExtractor) GetPrefetchStats() PrefetchStats

GetPrefetchStats gets prefetch statistics

type ParallelProcessor ¶

type ParallelProcessor struct {
	// contains filtered or unexported fields
}

ParallelProcessor handles multi-level parallel processing for PDF text extraction

func NewParallelProcessor ¶

func NewParallelProcessor(workers int) *ParallelProcessor

NewParallelProcessor creates a new parallel processor with the specified number of workers

func (*ParallelProcessor) ProcessPages ¶

func (pp *ParallelProcessor) ProcessPages(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)

ProcessPages processes multiple pages in parallel

func (*ParallelProcessor) ProcessTextBlocks ¶

func (pp *ParallelProcessor) ProcessTextBlocks(ctx context.Context, blocks []*TextBlock, processorFunc func(*TextBlock) (*TextBlock, error)) ([]*TextBlock, error)

ProcessTextBlocks processes multiple text blocks in parallel

func (*ParallelProcessor) ProcessTextInParallel ¶

func (pp *ParallelProcessor) ProcessTextInParallel(ctx context.Context, texts []Text, processorFunc func(Text) (Text, error)) ([]Text, error)

ProcessTextInParallel processes individual text elements in parallel

type ParallelTextExtractor ¶

type ParallelTextExtractor struct {
	// contains filtered or unexported fields
}

ParallelTextExtractor provides multi-level parallel extraction

func NewParallelTextExtractor ¶

func NewParallelTextExtractor(workers int) *ParallelTextExtractor

NewParallelTextExtractor creates a new parallel text extractor

func (*ParallelTextExtractor) ExtractWithParallelProcessing ¶

func (pte *ParallelTextExtractor) ExtractWithParallelProcessing(ctx context.Context, reader *Reader) ([]Text, error)

ExtractWithParallelProcessing extracts text using multi-level parallel processing

func (*ParallelTextExtractor) ParallelSort ¶

func (pte *ParallelTextExtractor) ParallelSort(ctx context.Context, texts []Text, less func(i, j int) bool) error

ParallelSort provides parallel sorting for large text collections

type ParseLimits ¶ added in v1.1.5

type ParseLimits struct {
	// MaxParseTime is the maximum time allowed for parsing a single page (0 = no limit)
	MaxParseTime time.Duration

	// MaxHexStringBytes is the maximum size for a single hex string (0 = no limit, default 10MB)
	MaxHexStringBytes int

	// MaxStreamBytes is the maximum size for a single stream (0 = no limit)
	MaxStreamBytes int64

	// CheckInterval specifies how often to check for cancellation during intensive loops
	// Higher values improve performance but reduce responsiveness to cancellation
	// Default: 1000 iterations
	CheckInterval int
}

ParseLimits defines resource limits for PDF parsing operations

func DefaultParseLimits ¶ added in v1.1.5

func DefaultParseLimits() ParseLimits

DefaultParseLimits returns sensible default limits

type PasswordAuth ¶ added in v1.2.8

type PasswordAuth struct {
	// contains filtered or unexported fields
}

PasswordAuth authenticates a password using the appropriate algorithm

func NewPasswordAuth ¶ added in v1.2.8

func NewPasswordAuth(info *PDFEncryptionInfo) *PasswordAuth

NewPasswordAuth creates a new password authenticator

func (*PasswordAuth) Authenticate ¶ added in v1.2.8

func (pa *PasswordAuth) Authenticate(password string) ([]byte, error)

Authenticate tries to authenticate with the given password as either user or owner

func (*PasswordAuth) AuthenticateOwner ¶ added in v1.2.8

func (pa *PasswordAuth) AuthenticateOwner(password string) ([]byte, error)

AuthenticateOwner authenticates an owner password

func (*PasswordAuth) AuthenticateUser ¶ added in v1.2.8

func (pa *PasswordAuth) AuthenticateUser(password string) ([]byte, error)

AuthenticateUser authenticates a user password

func (*PasswordAuth) ValidatePermissions ¶ added in v1.2.8

func (pa *PasswordAuth) ValidatePermissions(key []byte) error

ValidatePermissions validates the permissions field for V5 encryption

type PerformanceMetrics ¶

type PerformanceMetrics struct {
	ExtractDuration atomic.Int64 // nanoseconds
	ParseDuration   atomic.Int64
	SortDuration    atomic.Int64
	TotalAllocs     atomic.Uint64
	BytesAllocated  atomic.Uint64
	GoroutineCount  atomic.Int32
	CacheHitRate    atomic.Uint64 // percentage * 100
}

PerformanceMetrics performance metrics collector

func (*PerformanceMetrics) GetMetrics ¶

func (pm *PerformanceMetrics) GetMetrics() map[string]interface{}

GetMetrics get current metrics snapshot

func (*PerformanceMetrics) RecordAllocation ¶

func (pm *PerformanceMetrics) RecordAllocation(bytes uint64)

RecordAllocation record memory allocation

func (*PerformanceMetrics) RecordExtractDuration ¶

func (pm *PerformanceMetrics) RecordExtractDuration(d time.Duration)

RecordExtractDuration record extraction duration

type Point ¶

type Point struct {
	X float64
	Y float64
}

A Point represents an X, Y pair.

type PoolStats ¶ added in v1.0.1

type PoolStats struct {
	BucketSize int
	InUse      int // approximation, not perfectly accurate
}

GetStats returns statistics about pool usage (for debugging/monitoring)

type PoolWarmer ¶ added in v1.0.2

type PoolWarmer struct {
	// contains filtered or unexported fields
}

PoolWarmer memory pool warmer Pre-allocate and fill memory pools at application startup to reduce runtime allocation overhead

func (*PoolWarmer) GetWarmupStats ¶ added in v1.0.2

func (pw *PoolWarmer) GetWarmupStats() WarmupStats

GetWarmupStats gets warmup statistics

func (*PoolWarmer) IsWarmed ¶ added in v1.0.2

func (pw *PoolWarmer) IsWarmed() bool

IsWarmed checks if warmed up

func (*PoolWarmer) Reset ¶ added in v1.0.2

func (pw *PoolWarmer) Reset()

Reset resets warmup state

func (*PoolWarmer) Warmup ¶ added in v1.0.2

func (pw *PoolWarmer) Warmup(config *WarmupConfig) error

Warmup performs memory pool warmup

type PredefinedCMap ¶ added in v1.2.8

type PredefinedCMap struct {
	*CMap
}

PredefinedCMap represents a predefined Adobe CMap

func GetPredefinedCMap ¶ added in v1.2.8

func GetPredefinedCMap(name string) *PredefinedCMap

GetPredefinedCMap retrieves a predefined CMap by name

type PrefetchItem ¶ added in v1.0.2

type PrefetchItem struct {
	// contains filtered or unexported fields
}

PrefetchItem prefetch item

type PrefetchQueue ¶ added in v1.0.2

type PrefetchQueue struct {
	// contains filtered or unexported fields
}

PrefetchQueue prefetch queue (priority queue)

func (*PrefetchQueue) Len ¶ added in v1.0.2

func (pq *PrefetchQueue) Len() int

func (*PrefetchQueue) Less ¶ added in v1.0.2

func (pq *PrefetchQueue) Less(i, j int) bool

func (*PrefetchQueue) Pop ¶ added in v1.0.2

func (pq *PrefetchQueue) Pop() interface{}

func (*PrefetchQueue) Push ¶ added in v1.0.2

func (pq *PrefetchQueue) Push(x interface{})

func (*PrefetchQueue) Swap ¶ added in v1.0.2

func (pq *PrefetchQueue) Swap(i, j int)

type PrefetchStats ¶ added in v1.0.2

type PrefetchStats struct {
	PatternsTracked int
	QueueSize       int
	Enabled         bool
}

PrefetchStats prefetch statistics

type RTreeNode ¶

type RTreeNode struct {
	// contains filtered or unexported fields
}

RTreeNode represents a node in the R-tree

type RTreeSpatialIndex ¶

type RTreeSpatialIndex struct {
	// contains filtered or unexported fields
}

RTreeSpatialIndex provides a more sophisticated spatial index using a proper R-tree implementation

func NewRTreeSpatialIndex ¶

func NewRTreeSpatialIndex(texts []Text) *RTreeSpatialIndex

NewRTreeSpatialIndex creates a new R-tree based spatial index

func (*RTreeSpatialIndex) Insert ¶

func (rt *RTreeSpatialIndex) Insert(text Text)

Insert adds a text element to the R-tree

func (*RTreeSpatialIndex) Query ¶

func (rt *RTreeSpatialIndex) Query(bounds Rect) []Text

Query returns all text elements that intersect with the given bounds

type Reader ¶

type Reader struct {
	// contains filtered or unexported fields
}

A Reader is a single PDF file open for reading.

func NewReader ¶

func NewReader(f io.ReaderAt, size int64) (*Reader, error)

NewReader opens a file for reading, using the data in f with the given total size.

func NewReaderEncrypted ¶

func NewReaderEncrypted(f io.ReaderAt, size int64, pw func() string) (*Reader, error)

NewReaderEncrypted opens a file for reading, using the data in f with the given total size. If the PDF is encrypted, NewReaderEncrypted calls pw repeatedly to obtain passwords to try. If pw returns the empty string, NewReaderEncrypted stops trying to decrypt the file and returns an error.

func NewReaderEncryptedWithMmap ¶

func NewReaderEncryptedWithMmap(f io.ReaderAt, size int64, pw func() string) (*Reader, error)

NewReaderEncryptedWithMmap opens a file for reading with memory mapping for large files. If the file size exceeds 10MB, it uses memory mapping to reduce memory usage. This is a wrapper around NewReaderEncrypted that optimizes for large files.

func NewReaderLinearized ¶ added in v1.2.0

func NewReaderLinearized(f io.ReaderAt, size int64, pw func() string) (*Reader, error)

NewReaderLinearized creates a reader optimized for linearized PDFs

func Open ¶

func Open(file string) (*os.File, *Reader, error)

Open opens a file for reading.

func RecoverPDF ¶ added in v1.2.2

func RecoverPDF(f io.ReaderAt, size int64, opts *RecoveryOptions) (*Reader, error)

RecoverPDF attempts to recover a malformed PDF

func (*Reader) BatchExtractText ¶

func (r *Reader) BatchExtractText(pageNums []int, useLazy bool) (map[int]string, error)

BatchExtractText extracts text from multiple pages using lazy loading and object pooling This is optimized for processing many pages without keeping all in memory

func (*Reader) ClearCache ¶ added in v1.0.6

func (r *Reader) ClearCache()

ClearCache clears the object cache, releasing all cached objects. This is useful for freeing memory after batch processing large PDFs.

func (*Reader) Close ¶ added in v1.0.2

func (r *Reader) Close() error

Close closes the Reader and releases associated resources. If the underlying ReaderAt implements io.Closer, it will be closed.

func (*Reader) ExtractAllPagesParallel ¶ added in v1.0.2

func (r *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)

ExtractAllPagesParallel extract all page texts using enhanced parallel extractor This method integrates all performance optimizations: sharded cache, font prefetch, zero-copy, etc.

Example ¶

ExampleReader_ExtractAllPagesParallel uses Reader's parallel extraction method

// Note: this example requires actual PDF files
// Here only shows API usage

/*
	// Open PDF file
	f, r, err := Open("document.pdf")
	if err != nil {
		panic(err)
	}
	defer f.Close()

	// Create context
	ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
	defer cancel()

	// Parallel extract all page texts
	pages, err := r.ExtractAllPagesParallel(ctx, 0) // 0 = auto-detect CPU core count
	if err != nil {
		panic(err)
	}

	// Output text for each page
	for i, pageText := range pages {
		fmt.Printf("Page %d has %d characters\n", i+1, len(pageText))
	}
*/

func (*Reader) ExtractPagesBatch ¶ added in v1.0.1

func (r *Reader) ExtractPagesBatch(opts BatchExtractOptions) <-chan BatchResult

ExtractPagesBatch extracts text from multiple pages in batches This is optimized for high-throughput scenarios with many pages

Example ¶

// This example shows how to use batch extraction
// (requires a real PDF file to run)

// r, err := Open("document.pdf")
// if err != nil {
//     log.Fatal(err)
// }
// defer r.Close()
//
// opts := BatchExtractOptions{
//     Workers: 4,
//     Pages:   []int{1, 2, 3, 4, 5}, // Extract first 5 pages
// }
//
// for result := range r.ExtractPagesBatch(opts) {
//     if result.Error != nil {
//         log.Printf("Error on page %d: %v", result.PageNum, result.Error)
//         continue
//     }
//     fmt.Printf("Page %d: %d characters\n", result.PageNum, len(result.Text))
// }

func (*Reader) ExtractPagesBatchToString ¶ added in v1.0.1

func (r *Reader) ExtractPagesBatchToString(opts BatchExtractOptions) (string, error)

ExtractPagesBatchToString is a convenience function that collects all results into a single string

Example ¶

// This example shows how to extract all pages to a single string

// r, err := Open("document.pdf")
// if err != nil {
//     log.Fatal(err)
// }
// defer r.Close()
//
// opts := BatchExtractOptions{
//     Workers:       8,
//     SmartOrdering: true,
// }
//
// text, err := r.ExtractPagesBatchToString(opts)
// if err != nil {
//     log.Fatal(err)
// }
//
// fmt.Printf("Extracted %d characters from %d pages\n", len(text), r.NumPage())

func (*Reader) ExtractStructuredBatch ¶ added in v1.0.1

func (r *Reader) ExtractStructuredBatch(opts BatchExtractOptions) <-chan StructuredBatchResult

ExtractStructuredBatch extracts structured text in batches

func (*Reader) ExtractWithContext ¶

func (r *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)

ExtractWithContext extracts plain text from all pages with cancellation support

func (*Reader) GetCacheCapacity ¶ added in v1.0.6

func (r *Reader) GetCacheCapacity() int

GetCacheCapacity returns the current object cache capacity. Returns 0 if no capacity limit is set (unbounded cache).

func (*Reader) GetCompatibilityInfo ¶ added in v1.2.0

func (r *Reader) GetCompatibilityInfo() *PDFCompatibilityInfo

GetCompatibilityInfo returns compatibility information for the PDF

func (*Reader) GetMetadata ¶

func (r *Reader) GetMetadata() (Metadata, error)

GetMetadata extracts metadata from the PDF document

func (*Reader) GetPlainText ¶

func (r *Reader) GetPlainText() (reader io.Reader, err error)

GetPlainText returns all the text in the PDF file

func (*Reader) GetPlainTextConcurrent ¶

func (r *Reader) GetPlainTextConcurrent(workers int) (io.Reader, error)

GetPlainTextConcurrent extracts all pages concurrently using the specified number of workers.

func (*Reader) GetStyledTexts ¶

func (r *Reader) GetStyledTexts() (sentences []Text, err error)

GetStyledTexts returns list all sentences in an array, that are included styles

func (*Reader) NumPage ¶

func (r *Reader) NumPage() int

NumPage returns the number of pages in the PDF file.

func (*Reader) Outline ¶

func (r *Reader) Outline() Outline

Outline returns the document outline. The Outline returned is the root of the outline tree and typically has no Title itself. That is, the children of the returned root are the top-level entries in the outline.

func (*Reader) Page ¶

func (r *Reader) Page(num int) Page

Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns a Page with p.V.IsNull().

func (*Reader) SetCacheCapacity ¶

func (r *Reader) SetCacheCapacity(n int)

func (*Reader) SetMetadata ¶

func (r *Reader) SetMetadata(meta Metadata) error

SetMetadata sets metadata fields in the PDF (for future write support) Currently not implemented as the library is read-only

func (*Reader) Trailer ¶

func (r *Reader) Trailer() Value

Trailer returns the file's Trailer value.

type RecoveryOptions ¶ added in v1.2.2

type RecoveryOptions struct {
	// MaxSearchSize limits how many bytes to search for recovery
	MaxSearchSize int64
	// AllowTruncated attempts to recover truncated files
	AllowTruncated bool
	// AllowMissingXref attempts to rebuild xref from object markers
	AllowMissingXref bool
	// AllowMissingTrailer attempts to recover without trailer
	AllowMissingTrailer bool
	// Verbose enables detailed recovery logging
	Verbose bool
}

RecoveryOptions controls how PDF recovery is attempted

func DefaultRecoveryOptions ¶ added in v1.2.2

func DefaultRecoveryOptions() *RecoveryOptions

DefaultRecoveryOptions returns sensible defaults for recovery

type Rect ¶

type Rect struct {
	Min, Max Point
}

A Rect represents a rectangle.

type ResourceManager ¶

type ResourceManager struct {
	// contains filtered or unexported fields
}

ResourceManager provides automatic resource cleanup

func NewResourceManager ¶

func NewResourceManager() *ResourceManager

NewResourceManager creates a new resource manager

func (*ResourceManager) Add ¶

func (rm *ResourceManager) Add(resource io.Closer)

Add adds a resource to be managed

func (*ResourceManager) Close ¶

func (rm *ResourceManager) Close() error

Close closes all managed resources

type ResultCache ¶

type ResultCache struct {
	// contains filtered or unexported fields
}

ResultCache provides caching for parsed and classified results

func GetGlobalCache ¶

func GetGlobalCache() *ResultCache

GetGlobalCache returns a singleton cache instance

func NewResultCache ¶

func NewResultCache(maxSize int64, ttl time.Duration, policy string) *ResultCache

NewResultCache creates a new result cache with specified parameters

func (*ResultCache) Clear ¶

func (rc *ResultCache) Clear()

Clear removes all items from the cache

func (*ResultCache) Close ¶ added in v1.0.5

func (rc *ResultCache) Close()

Close stops the cleanup goroutine and releases resources

func (*ResultCache) Get ¶

func (rc *ResultCache) Get(key string) (interface{}, bool)

Get retrieves an item from the cache

func (*ResultCache) GetHitRatio ¶

func (rc *ResultCache) GetHitRatio() float64

GetHitRatio returns the cache hit ratio

func (*ResultCache) GetStats ¶

func (rc *ResultCache) GetStats() CacheStats

GetStats returns cache statistics

func (*ResultCache) Has ¶

func (rc *ResultCache) Has(key string) bool

Has checks if a key exists in the cache (without updating access stats)

func (*ResultCache) Put ¶

func (rc *ResultCache) Put(key string, value interface{})

Put adds an item to the cache

func (*ResultCache) Remove ¶

func (rc *ResultCache) Remove(key string) bool

Remove removes an item from the cache

type Row ¶

type Row struct {
	Position int64
	Content  TextHorizontal
}

Row represents the contents of a row

type Rows ¶

type Rows []*Row

Rows is a list of rows

type ShardedCache ¶ added in v1.0.2

type ShardedCache struct {
	// contains filtered or unexported fields
}

ShardedCache implements a high-performance sharded cache with the following features: - 256 shards to minimize lock contention - Independent locks and LRU linked lists for each shard - Statistics implemented with atomic operations - Adaptive eviction strategy

func NewShardedCache ¶ added in v1.0.2

func NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache

NewShardedCache creates a new sharded cache

func (*ShardedCache) Clear ¶ added in v1.0.2

func (sc *ShardedCache) Clear()

Clear clears all cache

func (*ShardedCache) Close ¶ added in v1.0.5

func (sc *ShardedCache) Close()

Close stops cleanup goroutine and releases resources

func (*ShardedCache) Delete ¶ added in v1.0.2

func (sc *ShardedCache) Delete(key string)

Delete deletes cache entry

func (*ShardedCache) Get ¶ added in v1.0.2

func (sc *ShardedCache) Get(key string) (interface{}, bool)

Get gets value from cache

func (*ShardedCache) GetStats ¶ added in v1.0.2

func (sc *ShardedCache) GetStats() ShardedCacheStats

GetStats gets cache statistics

func (*ShardedCache) Set ¶ added in v1.0.2

func (sc *ShardedCache) Set(key string, value interface{}, size int64)

Set sets cache value

type ShardedCacheEntry ¶ added in v1.0.2

type ShardedCacheEntry struct {
	// contains filtered or unexported fields
}

ShardedCacheEntry represents a cache entry

type ShardedCacheStats ¶ added in v1.0.2

type ShardedCacheStats struct {
	Hits      uint64
	Misses    uint64
	Evictions uint64
	Entries   int64
	Size      int64
}

ShardedCacheStats cache statistics

type SizedBytePool ¶ added in v1.0.1

type SizedBytePool struct {
	// contains filtered or unexported fields
}

SizedBytePool implements a multi-level size-bucketed object pool for byte slices. It reduces memory allocation overhead by reusing buffers of appropriate sizes.

Size buckets: 16B, 32B, 64B, 128B, 256B, 512B, 1KB, 4KB

func NewSizedBytePool ¶ added in v1.0.1

func NewSizedBytePool() *SizedBytePool

NewSizedBytePool creates a new sized byte pool with 8 size buckets

func (*SizedBytePool) Get ¶ added in v1.0.1

func (sp *SizedBytePool) Get(size int) []byte

Get retrieves a byte slice from the appropriate size bucket Returns a buffer with at least the requested capacity

func (*SizedBytePool) Put ¶ added in v1.0.1

func (sp *SizedBytePool) Put(buf []byte)

Put returns a byte slice to the appropriate pool The slice is cleared before being returned to the pool

type SizedPool ¶

type SizedPool struct {
	// contains filtered or unexported fields
}

1. Fine-grained object pool - multi-level size bucketing

func NewSizedPool ¶

func NewSizedPool() *SizedPool

func (*SizedPool) Get ¶

func (sp *SizedPool) Get(size int) []byte

func (*SizedPool) Put ¶

func (sp *SizedPool) Put(bufPtr *[]byte)

type SizedTextSlicePool ¶ added in v1.0.1

type SizedTextSlicePool struct {
	// contains filtered or unexported fields
}

SizedTextSlicePool implements a size-bucketed pool for Text slices Similar to SizedBytePool but for []Text instead of []byte

func NewSizedTextSlicePool ¶ added in v1.0.1

func NewSizedTextSlicePool() *SizedTextSlicePool

NewSizedTextSlicePool creates a new sized text slice pool Buckets: 8, 16, 32, 64, 128, 256 texts

func (*SizedTextSlicePool) Get ¶ added in v1.0.1

func (sp *SizedTextSlicePool) Get(size int) []Text

Get retrieves a Text slice from the appropriate size bucket

func (*SizedTextSlicePool) Put ¶ added in v1.0.1

func (sp *SizedTextSlicePool) Put(slice []Text)

Put returns a Text slice to the appropriate pool

type SortStrategy ¶ added in v1.0.1

type SortStrategy int

SortStrategy represents different sorting algorithms available

const (
	StrategyAuto      SortStrategy = iota // Automatically select best algorithm
	StrategyRadix                         // Radix sort for numeric keys
	StrategyQuick                         // Quicksort for general comparison
	StrategyInsertion                     // Insertion sort for small arrays
	StrategyStandard                      // Go standard library sort
)

type SortingMetrics ¶ added in v1.0.1

type SortingMetrics struct {
	RadixSortCount     int
	QuickSortCount     int
	InsertionSortCount int
	StandardSortCount  int
}

SortingMetrics tracks performance of different sorting strategies

func GetSortingMetrics ¶ added in v1.0.1

func GetSortingMetrics() SortingMetrics

GetSortingMetrics returns current sorting metrics

type SpatialGrid ¶ added in v1.2.3

type SpatialGrid struct {
	// contains filtered or unexported fields
}

SpatialGrid is a spatial partitioning structure for efficient neighbor search. It divides 2D space into a grid of cells, allowing O(1) cell lookup and reducing neighbor search from O(n²) to O(n) for uniformly distributed points.

func NewSpatialGrid ¶ added in v1.2.3

func NewSpatialGrid(blocks []*TextBlock, cellSize float64) *SpatialGrid

NewSpatialGrid creates a new spatial grid for the given blocks. cellSize determines the granularity of the grid; typically should be around 2-3x the expected cluster radius for optimal performance.

func (*SpatialGrid) GetNearbyBlocks ¶ added in v1.2.3

func (g *SpatialGrid) GetNearbyBlocks(blockIdx int) []int

GetNearbyBlocks returns indices of blocks in the same cell and neighboring cells. This is much faster than searching all blocks when they're uniformly distributed. Memory optimized: reuses internal buffer to reduce allocations. WARNING: The returned slice is reused on next call - copy if needed.

type SpatialIndex ¶

type SpatialIndex struct {
	// contains filtered or unexported fields
}

SpatialIndex provides spatial indexing for efficient text location queries This is a simple implementation using a grid-based approach; for production use, consider a more sophisticated structure like R-tree

func NewSpatialIndex ¶

func NewSpatialIndex(texts []Text) *SpatialIndex

NewSpatialIndex creates a new spatial index from text elements

func (*SpatialIndex) Query ¶

func (si *SpatialIndex) Query(bounds Rect) []Text

Query returns all text elements that potentially intersect with the given bounds

type SpatialIndexInterface ¶

type SpatialIndexInterface interface {
	Query(bounds Rect) []Text
	Insert(text Text)
}

SpatialIndex interface to allow using either grid or R-tree implementation

func NewSpatialIndexInterface ¶

func NewSpatialIndexInterface(texts []Text) SpatialIndexInterface

NewSpatialIndexInterface creates a spatial index interface (can be switched between implementations)

type Stack ¶

type Stack struct {
	// contains filtered or unexported fields
}

A Stack represents a stack of values.

func (*Stack) DrainTo ¶ added in v1.2.4

func (stk *Stack) DrainTo(dst []Value) []Value

DrainTo copies current stack content into dst (growing dst if needed) and resets the stack without extra allocations. The returned slice length equals the drained element count.

func (*Stack) Len ¶

func (stk *Stack) Len() int

func (*Stack) Pop ¶

func (stk *Stack) Pop() Value

func (*Stack) Push ¶

func (stk *Stack) Push(v Value)

type StartupConfig ¶ added in v1.0.2

type StartupConfig struct {
	WarmupPools       bool
	WarmupConfig      *WarmupConfig
	PreallocateCaches bool
	FontCacheSize     int
	ResultCacheSize   int
	TuneGC            bool
	GCPercent         int
	MemoryBallast     int64
	SetMaxProcs       bool
	MaxProcs          int
}

StartupConfig startup configuration

func DefaultStartupConfig ¶ added in v1.0.2

func DefaultStartupConfig() *StartupConfig

DefaultStartupConfig default startup configuration Optimized for lower memory footprint

type StreamProcessor ¶

type StreamProcessor struct {
	// contains filtered or unexported fields
}

StreamProcessor handles streaming processing of PDF content to minimize memory usage

func NewStreamProcessor ¶

func NewStreamProcessor(chunkSize, bufferSize int, maxMemory int64) *StreamProcessor

NewStreamProcessor creates a new streaming processor

func (*StreamProcessor) Close ¶

func (sp *StreamProcessor) Close()

Close releases resources used by the stream processor

func (*StreamProcessor) ProcessPageStream ¶

func (sp *StreamProcessor) ProcessPageStream(reader *Reader, handler func(PageStream) error) error

ProcessPageStream processes pages in a streaming fashion

func (*StreamProcessor) ProcessTextBlockStream ¶

func (sp *StreamProcessor) ProcessTextBlockStream(reader *Reader, handler func(TextBlockStream) error) error

ProcessTextBlockStream processes text blocks in a streaming fashion

func (*StreamProcessor) ProcessTextStream ¶

func (sp *StreamProcessor) ProcessTextStream(reader *Reader, handler func(TextStream) error) error

ProcessTextStream processes text in a streaming fashion

type StreamingBatchExtractor ¶ added in v1.0.1

type StreamingBatchExtractor struct {
	// contains filtered or unexported fields
}

StreamingBatchExtractor provides a streaming interface for batch extraction This is useful for very large PDFs where you want to process results as they arrive

Example ¶

// This example shows streaming batch extraction with a callback

// r, err := Open("document.pdf")
// if err != nil {
//     log.Fatal(err)
// }
// defer r.Close()
//
// ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
// defer cancel()
//
// opts := BatchExtractOptions{
//     Context: ctx,
//     Workers: 4,
// }
//
// extractor := NewStreamingBatchExtractor(r, opts)
// extractor.Start()
//
// err = extractor.ProcessAll(func(result BatchResult) error {
//     if result.Error != nil {
//         return result.Error
//     }
//     // Process each page as it arrives
//     fmt.Printf("Processing page %d...\n", result.PageNum)
//     return nil
// })
//
// if err != nil {
//     log.Fatal(err)
// }

func NewStreamingBatchExtractor ¶ added in v1.0.1

func NewStreamingBatchExtractor(r *Reader, opts BatchExtractOptions) *StreamingBatchExtractor

NewStreamingBatchExtractor creates a new streaming batch extractor

func (*StreamingBatchExtractor) Next ¶ added in v1.0.1

func (sbe *StreamingBatchExtractor) Next() *BatchResult

Next returns the next result, or nil if done

func (*StreamingBatchExtractor) ProcessAll ¶ added in v1.0.1

func (sbe *StreamingBatchExtractor) ProcessAll(callback func(BatchResult) error) error

ProcessAll processes all pages with a callback function

func (*StreamingBatchExtractor) Start ¶ added in v1.0.1

func (sbe *StreamingBatchExtractor) Start()

Start begins the extraction process

type StreamingMetadataExtractor ¶

type StreamingMetadataExtractor struct {
	// contains filtered or unexported fields
}

StreamingMetadataExtractor extracts metadata in a streaming fashion

func NewStreamingMetadataExtractor ¶

func NewStreamingMetadataExtractor(chunkSize, bufferSize int, maxMemory int64) *StreamingMetadataExtractor

NewStreamingMetadataExtractor creates a new streaming metadata extractor

func (*StreamingMetadataExtractor) ExtractMetadataStream ¶

func (sme *StreamingMetadataExtractor) ExtractMetadataStream(reader *Reader) (<-chan Metadata, <-chan error)

ExtractMetadataStream extracts metadata in a streaming way

type StreamingTextClassifier ¶

type StreamingTextClassifier struct {
	// contains filtered or unexported fields
}

StreamingTextClassifier classifies text in a streaming fashion to minimize memory usage

func NewStreamingTextClassifier ¶

func NewStreamingTextClassifier(chunkSize, bufferSize int, maxMemory int64) *StreamingTextClassifier

NewStreamingTextClassifier creates a new streaming text classifier

func (*StreamingTextClassifier) ClassifyTextStream ¶

func (stc *StreamingTextClassifier) ClassifyTextStream(reader *Reader) (<-chan ClassifiedBlock, <-chan error)

ClassifyTextStream classifies text in a streaming way

type StreamingTextExtractor ¶

type StreamingTextExtractor struct {
	// contains filtered or unexported fields
}

StreamingTextExtractor provides memory-efficient text extraction for large PDFs

func NewStreamingTextExtractor ¶

func NewStreamingTextExtractor(r *Reader, maxCachedPages int) *StreamingTextExtractor

NewStreamingTextExtractor creates a streaming extractor for large PDFs

func (*StreamingTextExtractor) Close ¶

func (e *StreamingTextExtractor) Close()

Close releases resources used by the extractor

func (*StreamingTextExtractor) GetProgress ¶

func (e *StreamingTextExtractor) GetProgress() float64

GetProgress returns the extraction progress (0.0 to 1.0)

func (*StreamingTextExtractor) NextBatch ¶

func (e *StreamingTextExtractor) NextBatch() (results map[int]string, hasMore bool, err error)

NextBatch extracts text from the next batch of pages

func (*StreamingTextExtractor) NextPage ¶

func (e *StreamingTextExtractor) NextPage() (pageNum int, text string, hasMore bool, err error)

NextPage extracts text from the next page

func (*StreamingTextExtractor) Reset ¶

func (e *StreamingTextExtractor) Reset()

Reset resets the extractor to the beginning

type StringBuffer ¶ added in v1.0.2

type StringBuffer struct {
	// contains filtered or unexported fields
}

StringBuffer string building buffer, optimizes multiple concatenations

Example ¶

ExampleStringBuffer Demonstrate usage of StringBuffer

builder := NewStringBuffer(100)

builder.WriteString("Hello")
builder.WriteByte(' ')
builder.WriteString("World")

result := builder.StringCopy()
fmt.Println(result)

Output:

Hello World

func NewStringBuffer ¶ added in v1.0.2

func NewStringBuffer(capacity int) *StringBuffer

NewStringBuffer create new string buffer

func (*StringBuffer) Bytes ¶ added in v1.0.2

func (sb *StringBuffer) Bytes() []byte

Bytes return underlying byte slice

func (*StringBuffer) Cap ¶ added in v1.0.2

func (sb *StringBuffer) Cap() int

Cap return capacity

func (*StringBuffer) Len ¶ added in v1.0.2

func (sb *StringBuffer) Len() int

Len return current length

func (*StringBuffer) Reset ¶ added in v1.0.2

func (sb *StringBuffer) Reset()

Reset reset buffer

func (*StringBuffer) String ¶ added in v1.0.2

func (sb *StringBuffer) String() string

String zero-copy return string Warning: Do not use StringBuffer after return

func (*StringBuffer) StringCopy ¶ added in v1.0.2

func (sb *StringBuffer) StringCopy() string

StringCopy safely return string copy

func (*StringBuffer) WriteByte ¶ added in v1.0.2

func (sb *StringBuffer) WriteByte(b byte) error

WriteByte write single byte

func (*StringBuffer) WriteBytes ¶ added in v1.0.2

func (sb *StringBuffer) WriteBytes(b []byte)

WriteBytes write byte slice

func (*StringBuffer) WriteString ¶ added in v1.0.2

func (sb *StringBuffer) WriteString(s string)

WriteString write string

type StringBuilderPool ¶ added in v1.0.1

type StringBuilderPool struct {
	// contains filtered or unexported fields
}

StringBuilderPool provides size-aware string builder pooling

type StringPool ¶ added in v1.0.2

type StringPool struct {
	// contains filtered or unexported fields
}

StringPool string pool, reuse common strings

Example ¶

ExampleStringPool Demonstrate usage of string pool

pool := NewStringPool()

// Put commonly used strings into the pool
fontName1 := pool.Intern("Arial")
fontName2 := pool.Intern("Arial") // Repeated strings will be reused

fmt.Println(fontName1 == fontName2) // Pointers are equal
fmt.Println(pool.Size())

Output:

true
1

func NewStringPool ¶ added in v1.0.2

func NewStringPool() *StringPool

NewStringPool create new string pool

func (*StringPool) Clear ¶ added in v1.0.2

func (sp *StringPool) Clear()

Clear clear pool

func (*StringPool) Intern ¶ added in v1.0.2

func (sp *StringPool) Intern(s string) string

Intern add string to pool and return pooled version Strings with same content will share memory

func (*StringPool) Size ¶ added in v1.0.2

func (sp *StringPool) Size() int

Size return number of strings in pool

type StructuredBatchResult ¶ added in v1.0.1

type StructuredBatchResult struct {
	PageNum int
	Blocks  []ClassifiedBlock
	Error   error
}

BatchExtractStructured extracts structured text from multiple pages in batches

type Task ¶

type Task interface {
	Execute() error
}

Task task interface

type Text ¶

type Text struct {
	Font      string  // the font used
	FontSize  float64 // the font size, in points (1/72 of an inch)
	X         float64 // the X coordinate, in points, increasing left to right
	Y         float64 // the Y coordinate, in points, increasing bottom to top
	W         float64 // the width of the text, in points
	S         string  // the actual UTF-8 text
	Vertical  bool    // whether the text is drawn vertically
	Bold      bool    // whether the text is bold
	Italic    bool    // whether the text is italic
	Underline bool    // whether the text is underlined
}

A Text represents a single piece of text drawn on a page.

func ConvertOptimizedSliceToText ¶ added in v1.2.3

func ConvertOptimizedSliceToText(texts []TextOptimized, pool *FontPool) []Text

ConvertOptimizedSliceToText converts a slice of TextOptimized to Text

func ConvertOptimizedToText ¶ added in v1.2.3

func ConvertOptimizedToText(t TextOptimized, pool *FontPool) Text

ConvertOptimizedToText converts a TextOptimized back to Text

func GetSizedTextSlice ¶ added in v1.0.1

func GetSizedTextSlice(size int) []Text

GetSizedTextSlice retrieves a Text slice from the global pool

func GetText ¶

func GetText() *Text

GetText retrieves a Text object from the appropriate pool based on content size

func GetTextBySize ¶

func GetTextBySize(contentLength int) *Text

GetTextBySize retrieves a Text object from the appropriate pool based on content size

func GetTextSlice ¶

func GetTextSlice(minCap int) []Text

GetTextSlice gets a Text slice from pool

type TextBlock ¶

type TextBlock struct {
	Texts       []Text
	MinX        float64
	MaxX        float64
	MinY        float64
	MaxY        float64
	AvgFontSize float64
	// contains filtered or unexported fields
}

TextBlock represents a coherent block of text (like a paragraph or column)

func ClusterTextBlocksOptimized ¶

func ClusterTextBlocksOptimized(texts []Text) []*TextBlock

ClusterTextBlocksOptimized uses KD tree optimized text block clustering Optimized version: reduce temporary object allocation, use object pool

func ClusterTextBlocksOptimizedV2 ¶ added in v1.2.3

func ClusterTextBlocksOptimizedV2(texts []Text) []*TextBlock

ClusterTextBlocksOptimizedV2 uses object pools to reduce GC pressure

func ClusterTextBlocksParallel ¶ added in v1.2.3

func ClusterTextBlocksParallel(texts []Text) []*TextBlock

ClusterTextBlocksParallel delegates to ParallelV2 for large inputs. This is the main entry point for parallel clustering.

func ClusterTextBlocksParallelV2 ¶ added in v1.2.3

func ClusterTextBlocksParallelV2(texts []Text) []*TextBlock

ClusterTextBlocksParallelV2 uses a work-partitioning strategy for parallel clustering. Each worker processes a chunk of blocks independently with local edge collection, then edges are merged sequentially. This avoids all lock contention.

func ClusterTextBlocksUltraOptimized ¶ added in v1.2.3

func ClusterTextBlocksUltraOptimized(texts []Text) []*TextBlock

ClusterTextBlocksUltraOptimized - 极致性能优化版本目标：最小化内存分配和GC压力，同时保持并行性能

func ClusterTextBlocksUltraV2 ¶ added in v1.2.4

func ClusterTextBlocksUltraV2(texts []Text) []*TextBlock

ClusterTextBlocksUltraV2 is an ultra-optimized parallel clustering algorithm Key optimizations: 1. SOA data layout for SIMD-friendly access 2. Compact spatial grid with binary search (no map lookups in hot path) 3. Pre-allocated edge buffers (zero allocation in hot path) 4. Lock-free union-find with path compression 5. Minimized memory copies and indirections

func ClusterTextBlocksV3 ¶ added in v1.2.3

func ClusterTextBlocksV3(texts []Text) []*TextBlock

ClusterTextBlocksV3 is an improved clustering algorithm using spatial grid. Time complexity: O(n) for uniformly distributed blocks (vs O(n²) for naive approach) Space complexity: O(n) for grid structure

func ClusterTextBlocksV3Fast ¶ added in v1.2.3

func ClusterTextBlocksV3Fast(texts []Text, maxClusters int) []*TextBlock

ClusterTextBlocksV3Fast is an even faster version with early termination. Suitable for very large documents where absolute precision is less critical.

func ClusterTextBlocksV4 ¶ added in v1.2.3

func ClusterTextBlocksV4(texts []Text) []*TextBlock

ClusterTextBlocksV4 automatically selects the best algorithm based on input size.

func GetTextBlock ¶ added in v1.2.3

func GetTextBlock() *TextBlock

GetTextBlock gets a TextBlock from pool

func (*TextBlock) Bounds ¶

func (tb *TextBlock) Bounds() Rect

Bounds returns the bounding box of the text block

func (*TextBlock) Center ¶

func (tb *TextBlock) Center() Point

Center returns the center point of the text block

func (*TextBlock) Height ¶

func (tb *TextBlock) Height() float64

Height returns the height of the text block

func (*TextBlock) Width ¶

func (tb *TextBlock) Width() float64

Width returns the width of the text block

type TextBlockStream ¶

type TextBlockStream struct {
	Block   *TextBlock
	PageNum int
	Type    BlockType
	Level   int
	Text    string
}

TextBlockStream represents a stream of text blocks

type TextClassifier ¶

type TextClassifier struct {
	// contains filtered or unexported fields
}

TextClassifier classifies text runs into semantic blocks

func NewTextClassifier ¶

func NewTextClassifier(texts []Text, pageWidth, pageHeight float64) *TextClassifier

NewTextClassifier creates a new text classifier

func (*TextClassifier) ClassifyBlocks ¶

func (tc *TextClassifier) ClassifyBlocks() []ClassifiedBlock

ClassifyBlocks classifies text runs into semantic blocks

type TextEncoding ¶

type TextEncoding interface {
	// Decode returns the UTF-8 text corresponding to
	// the sequence of code points in raw.
	Decode(raw string) (text string)
}

A TextEncoding represents a mapping between font code points and UTF-8 text.

func EnhancedCMapEncoding ¶ added in v1.2.8

func EnhancedCMapEncoding(name string) TextEncoding

EnhancedCMapEncoding returns a TextEncoding for the given CMap name, with enhanced support for CJK encodings

func LookupPredefinedCMap ¶ added in v1.2.8

func LookupPredefinedCMap(name string) TextEncoding

LookupPredefinedCMap looks up a CMap by name, checking both predefined and registered CMaps

type TextHorizontal ¶

type TextHorizontal []Text

TextHorizontal implements sort.Interface for sorting a slice of Text values in horizontal order, left to right, and then top to bottom within a column.

func (TextHorizontal) Len ¶

func (x TextHorizontal) Len() int

func (TextHorizontal) Less ¶

func (x TextHorizontal) Less(i, j int) bool

func (TextHorizontal) Swap ¶

func (x TextHorizontal) Swap(i, j int)

type TextOptimized ¶ added in v1.2.3

type TextOptimized struct {
	FontID   uint32  // Font ID from FontPool (4 bytes vs ~16+ bytes for string)
	X        float32 // X coordinate (4 bytes vs 8 bytes)
	Y        float32 // Y coordinate (4 bytes vs 8 bytes)
	FontSize float32 // Font size (4 bytes vs 8 bytes)
	W        float32 // Width (4 bytes vs 8 bytes)
	S        string  // Text content - unavoidable string allocation
	Flags    uint8   // Packed flags: bit0=Vertical, bit1=Bold, bit2=Italic, bit3=Underline
	// contains filtered or unexported fields
}

TextOptimized is a memory-optimized version of Text structure. It uses uint32 for font IDs (via FontPool) instead of storing full font names, uses float32 where precision allows, and packs boolean flags into a single byte. This reduces memory footprint by ~60% compared to the original Text structure.

func ConvertTextSliceToOptimized ¶ added in v1.2.3

func ConvertTextSliceToOptimized(texts []Text, pool *FontPool) []TextOptimized

ConvertTextSliceToOptimized converts a slice of Text to TextOptimized

func ConvertTextToOptimized ¶ added in v1.2.3

func ConvertTextToOptimized(t Text, pool *FontPool) TextOptimized

ConvertTextToOptimized converts a Text to TextOptimized using the provided font pool

func (*TextOptimized) IsBold ¶ added in v1.2.3

func (t *TextOptimized) IsBold() bool

func (*TextOptimized) IsItalic ¶ added in v1.2.3

func (t *TextOptimized) IsItalic() bool

func (*TextOptimized) IsUnderline ¶ added in v1.2.3

func (t *TextOptimized) IsUnderline() bool

func (*TextOptimized) IsVertical ¶ added in v1.2.3

func (t *TextOptimized) IsVertical() bool

Helper methods for TextOptimized

func (*TextOptimized) SetBold ¶ added in v1.2.3

func (t *TextOptimized) SetBold(v bool)

func (*TextOptimized) SetItalic ¶ added in v1.2.3

func (t *TextOptimized) SetItalic(v bool)

func (*TextOptimized) SetUnderline ¶ added in v1.2.3

func (t *TextOptimized) SetUnderline(v bool)

func (*TextOptimized) SetVertical ¶ added in v1.2.3

func (t *TextOptimized) SetVertical(v bool)

type TextStream ¶

type TextStream struct {
	Text       string
	PageNum    int
	Font       string
	FontSize   float64
	X, Y       float64
	W          float64
	Vertical   bool
	Confidence float64 // Confidence in the text recognition (0-1)
}

TextStream represents a stream of text with metadata

type TextVertical ¶

type TextVertical []Text

TextVertical implements sort.Interface for sorting a slice of Text values in vertical order, top to bottom, and then left to right within a line.

func (TextVertical) Len ¶

func (x TextVertical) Len() int

func (TextVertical) Less ¶

func (x TextVertical) Less(i, j int) bool

func (TextVertical) Swap ¶

func (x TextVertical) Swap(i, j int)

type TextWithLanguage ¶

type TextWithLanguage struct {
	Text       Text
	Language   LanguageInfo
	Confidence float64
}

TextWithLanguage represents text with detected language information

type ToUnicodeCMap ¶ added in v1.2.8

type ToUnicodeCMap struct {
	*CMap
}

ToUnicodeCMap is a specialized CMap for ToUnicode mappings

func NewToUnicodeCMap ¶ added in v1.2.8

func NewToUnicodeCMap() *ToUnicodeCMap

NewToUnicodeCMap creates a new ToUnicode CMap

func ParseToUnicodeCMap ¶ added in v1.2.8

func ParseToUnicodeCMap(r io.Reader) (*ToUnicodeCMap, error)

ParseToUnicodeCMap parses a ToUnicode CMap stream

func (*ToUnicodeCMap) DecodeCID ¶ added in v1.2.8

func (c *ToUnicodeCMap) DecodeCID(cid int) string

DecodeCID decodes a CID value to Unicode using the ToUnicode mapping

type Type1Cache ¶ added in v1.2.8

type Type1Cache struct {
	// contains filtered or unexported fields
}

Type1Cache provides caching for Type1 font parsing operations

func GetGlobalType1Cache ¶ added in v1.2.8

func GetGlobalType1Cache() *Type1Cache

GetGlobalType1Cache returns the global Type1 cache instance

func NewType1Cache ¶ added in v1.2.8

func NewType1Cache(maxSize int, ttl time.Duration) *Type1Cache

NewType1Cache creates a new Type1 cache

func (*Type1Cache) GetFont ¶ added in v1.2.8

func (tc *Type1Cache) GetFont(data []byte) (*Type1Font, bool)

GetFont retrieves a cached Type1 font

func (*Type1Cache) PutFont ¶ added in v1.2.8

func (tc *Type1Cache) PutFont(data []byte, font *Type1Font)

PutFont caches a Type1 font

type Type1CacheEntry ¶ added in v1.2.8

type Type1CacheEntry struct {
	Data        *Type1Font
	Expiration  time.Time
	LastAccess  time.Time
	AccessCount int64
}

Type1CacheEntry represents a cached Type1 font

func (*Type1CacheEntry) IsExpired ¶ added in v1.2.8

func (ce *Type1CacheEntry) IsExpired() bool

IsExpired checks if the cache entry has expired

type Type1Font ¶ added in v1.2.8

type Type1Font struct {
	// contains filtered or unexported fields
}

Type1Font represents a Type1 font

func NewType1Font ¶ added in v1.2.8

func NewType1Font(data []byte) (*Type1Font, error)

NewType1Font creates a new Type1 font from raw font data with caching

func ParseType1FromStream ¶ added in v1.2.8

func ParseType1FromStream(v Value) (*Type1Font, error)

ParseType1FromStream parses Type1 font from a PDF stream

func (*Type1Font) GlyphName ¶ added in v1.2.8

func (f *Type1Font) GlyphName(code byte) string

GlyphName returns the glyph name for a character code

func (*Type1Font) GlyphWidth ¶ added in v1.2.8

func (f *Type1Font) GlyphWidth(name string) float64

GlyphWidth returns the width of a glyph by name

func (*Type1Font) Info ¶ added in v1.2.8

func (f *Type1Font) Info() *Type1FontInfo

Info returns the font info

type Type1FontInfo ¶ added in v1.2.8

type Type1FontInfo struct {
	FontName       string
	FullName       string
	FamilyName     string
	Weight         string
	ItalicAngle    float64
	IsFixedPitch   bool
	UnderlinePos   float64
	UnderlineThick float64
	FontBBox       [4]float64
	UniqueID       int
	XUID           []int

	// Font metrics
	Encoding    string
	PaintType   int
	FontType    int
	FontMatrix  [6]float64
	StrokeWidth float64

	// Private dict values
	BlueValues       []int
	OtherBlues       []int
	FamilyBlues      []int
	FamilyOtherBlues []int
	BlueScale        float64
	BlueShift        int
	BlueFuzz         int
	StdHW            float64
	StdVW            float64
	StemSnapH        []float64
	StemSnapV        []float64
	ForceBold        bool
	LanguageGroup    int
	RndStemUp        bool
	ExpansionFactor  float64
}

Type1FontInfo contains parsed Type1 font header information

func GetType1FontInfo ¶ added in v1.2.8

func GetType1FontInfo(v Value) *Type1FontInfo

GetType1FontInfo extracts Type1 font info from embedded font program

type Value ¶

type Value struct {
	// contains filtered or unexported fields
}

A Value is a single PDF value, such as an integer, dictionary, or array. The zero Value is a PDF null (Kind() == Null, IsNull() = true).

func (Value) Bool ¶

func (v Value) Bool() bool

Bool returns v's boolean value. If v.Kind() != Bool, Bool returns false.

func (Value) Float64 ¶

func (v Value) Float64() float64

Float64 returns v's float64 value, converting from integer if necessary. If v.Kind() != Float64 and v.Kind() != Int64, Float64 returns 0.

func (Value) Index ¶

func (v Value) Index(i int) Value

Index returns the i'th element in the array v. If v.Kind() != Array or if i is outside the array bounds, Index returns a null Value.

func (Value) Int64 ¶

func (v Value) Int64() int64

Int64 returns v's int64 value. If v.Kind() != Int64, Int64 returns 0.

func (Value) IsNull ¶

func (v Value) IsNull() bool

IsNull reports whether the value is a null. It is equivalent to Kind() == Null.

func (Value) Key ¶

func (v Value) Key(key string) Value

Key returns the value associated with the given name key in the dictionary v. Like the result of the Name method, the key should not include a leading slash. If v is a stream, Key applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Key returns a null Value.

func (Value) Keys ¶

func (v Value) Keys() []string

Keys returns a sorted list of the keys in the dictionary v. If v is a stream, Keys applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Keys returns nil.

func (Value) Kind ¶

func (v Value) Kind() ValueKind

Kind reports the kind of value underlying v.

func (Value) Len ¶

func (v Value) Len() int

Len returns the length of the array v. If v.Kind() != Array, Len returns 0.

func (Value) Name ¶

func (v Value) Name() string

Name returns v's name value. If v.Kind() != Name, Name returns the empty string. The returned name does not include the leading slash: if v corresponds to the name written using the syntax /Helvetica, Name() == "Helvetica".

func (Value) RawString ¶

func (v Value) RawString() string

RawString returns v's string value. If v.Kind() != String, RawString returns the empty string.

func (Value) Reader ¶

func (v Value) Reader() io.ReadCloser

Reader returns the data contained in the stream v. If v.Kind() != Stream, Reader returns a ReadCloser that responds to all reads with a “stream not present” error.

func (Value) String ¶

func (v Value) String() string

String returns a textual representation of the value v. Note that String is not the accessor for values with Kind() == String. To access such values, see RawString, Text, and TextFromUTF16.

func (Value) Text ¶

func (v Value) Text() string

Text returns v's string value interpreted as a “text string” (defined in the PDF spec) and converted to UTF-8. If v.Kind() != String, Text returns the empty string.

func (Value) TextFromUTF16 ¶

func (v Value) TextFromUTF16() string

TextFromUTF16 returns v's string value interpreted as big-endian UTF-16 and then converted to UTF-8. If v.Kind() != String or if the data is not valid UTF-16, TextFromUTF16 returns the empty string.

type ValueKind ¶

type ValueKind int

A ValueKind specifies the kind of data underlying a Value.

const (
	Null ValueKind = iota
	Bool
	Integer
	Real
	String
	Name
	Dict
	Array
	Stream
)

The PDF value kinds.

type VerticalTextTransform ¶ added in v1.2.8

type VerticalTextTransform struct {
	Enabled bool
	OriginX float64
	OriginY float64
	Angle   float64 // Rotation angle in degrees (typically 90 or -90)
}

VerticalTextTransform transforms text coordinates for vertical writing

func (*VerticalTextTransform) TransformGlyph ¶ added in v1.2.8

func (vt *VerticalTextTransform) TransformGlyph(x, y, w, h float64) (nx, ny, nw, nh float64)

TransformGlyph transforms a single glyph position for vertical writing

type WSDeque ¶

type WSDeque struct {
	// contains filtered or unexported fields
}

5. Work-Stealing Deque (Chase-Lev algorithm)

func NewWSDeque ¶

func NewWSDeque(size int) *WSDeque

func (*WSDeque) PopBottom ¶

func (d *WSDeque) PopBottom() WSTask

PopBottom - owner thread pops from bottom (LIFO)

func (*WSDeque) PushBottom ¶

func (d *WSDeque) PushBottom(task WSTask)

PushBottom - owner thread pushes from bottom

func (*WSDeque) Steal ¶

func (d *WSDeque) Steal() WSTask

Steal - other threads steal from top (FIFO)

type WSTask ¶

type WSTask interface {
	Execute()
}

type WSWorker ¶

type WSWorker struct {
	// contains filtered or unexported fields
}

type WarmupConfig ¶ added in v1.0.2

type WarmupConfig struct {
	// BytePoolWarmup number of buffers to warmup for each size bucket
	BytePoolWarmup map[int]int

	// TextPoolWarmup number of text slices to warmup for each size bucket
	TextPoolWarmup map[int]int

	// Concurrent whether to warmup concurrently
	Concurrent bool

	// MaxGoroutines maximum number of concurrent goroutines
	MaxGoroutines int
}

WarmupConfig warmup configuration

func AggressiveWarmupConfig ¶ added in v1.0.2

func AggressiveWarmupConfig() *WarmupConfig

AggressiveWarmupConfig returns aggressive warmup configuration (more pre-allocation)

func DefaultWarmupConfig ¶ added in v1.0.2

func DefaultWarmupConfig() *WarmupConfig

DefaultWarmupConfig returns default warmup configuration Reduced pre-allocation to minimize initial memory footprint

func LightWarmupConfig ¶ added in v1.0.2

func LightWarmupConfig() *WarmupConfig

LightWarmupConfig returns light warmup configuration (less pre-allocation)

type WarmupStats ¶ added in v1.0.2

type WarmupStats struct {
	BytePoolSizes  map[int]int
	TextPoolSizes  map[int]int
	TotalAllocated int64
	IsWarmed       bool
}

WarmupStats warmup statistics

type WorkStealingExecutor ¶

type WorkStealingExecutor struct {
	// contains filtered or unexported fields
}

6. Work-Stealing thread pool

func NewWorkStealingExecutor ¶

func NewWorkStealingExecutor(numWorkers int) *WorkStealingExecutor

func (*WorkStealingExecutor) Start ¶

func (p *WorkStealingExecutor) Start()

func (*WorkStealingExecutor) Stop ¶

func (p *WorkStealingExecutor) Stop()

func (*WorkStealingExecutor) Submit ¶

func (p *WorkStealingExecutor) Submit(task WSTask)

type WorkStealingScheduler ¶

type WorkStealingScheduler struct {
	// contains filtered or unexported fields
}

WorkStealingScheduler work stealing scheduler Reduce goroutine creation overhead, improve parallel processing efficiency

func NewWorkStealingScheduler ¶

func NewWorkStealingScheduler(numWorkers int) *WorkStealingScheduler

NewWorkStealingScheduler create work stealing scheduler

func (*WorkStealingScheduler) Start ¶

func (wss *WorkStealingScheduler) Start()

Start start scheduler

func (*WorkStealingScheduler) Stop ¶

func (wss *WorkStealingScheduler) Stop()

Stop stop scheduler

func (*WorkStealingScheduler) Submit ¶

func (wss *WorkStealingScheduler) Submit(task Task)

Submit submit task

func (*WorkStealingScheduler) Wait ¶

func (wss *WorkStealingScheduler) Wait()

Wait wait for all tasks to complete

type Worker ¶

type Worker struct {
	// contains filtered or unexported fields
}

Worker worker thread

type WorkerPool ¶ added in v1.0.2

type WorkerPool struct {
	// contains filtered or unexported fields
}

WorkerPool worker pool

func (*WorkerPool) GetStats ¶ added in v1.0.2

func (wp *WorkerPool) GetStats() WorkerPoolStats

GetStats gets worker pool statistics

type WorkerPoolStats ¶ added in v1.0.2

type WorkerPoolStats struct {
	Workers    int
	ActiveJobs int64
	TotalJobs  int64
}

WorkerPoolStats worker pool statistics

type YBand ¶

type YBand struct {
	MinY, MaxY float64
	Blocks     []*TextBlock
}

YBand represents a horizontal band of text on a page

type ZeroCopyBuilder ¶

type ZeroCopyBuilder struct {
	// contains filtered or unexported fields
}

2. Zero-copy string builder

func NewZeroCopyBuilder ¶

func NewZeroCopyBuilder(cap int) *ZeroCopyBuilder

func (*ZeroCopyBuilder) Reset ¶

func (b *ZeroCopyBuilder) Reset()

func (*ZeroCopyBuilder) UnsafeString ¶

func (b *ZeroCopyBuilder) UnsafeString() string

UnsafeString Zero-copy return string (note: underlying buffer cannot be modified)

func (*ZeroCopyBuilder) WriteByte ¶

func (b *ZeroCopyBuilder) WriteByte(c byte) error

func (*ZeroCopyBuilder) WriteString ¶

func (b *ZeroCopyBuilder) WriteString(s string)

Notes ¶

Bugs ¶

The package is incomplete, although it has been used successfully on some large real-world PDF files.
The library makes no attempt at efficiency beyond the value cache and font cache. Further optimizations could improve performance for large files.
The support for reading encrypted files is limited to basic RC4 and AES encryption.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
cmd
pdfcli command
test_coords command
test_ordering command
examples
batch_fontcache command
extract command Example: Extract text from a PDF file with various methods	Example: Extract text from a PDF file with various methods
extract_text_performance command
performance command Example demonstrating performance optimization features	Example demonstrating performance optimization features
smart_ordering command
pdfpasswd Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.	Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL