Documentation
¶
Overview ¶
compatibility.go - PDF format compatibility handling
Package pdf implements reading of PDF files.
Overview ¶
PDF is Adobe's Portable Document Format, ubiquitous on the internet. A PDF document is a complex data format built on a fairly simple structure. This package exposes the simple structure along with some wrappers to extract basic information. If more complex information is needed, it is possible to extract that information by interpreting the structure exposed by this package.
Specifically, a PDF is a data structure built from Values, each of which has one of the following Kinds:
Null, for the null object. Integer, for an integer. Real, for a floating-point number. Bool, for a boolean value. Name, for a name constant (as in /Helvetica). String, for a string constant. Dict, for a dictionary of name-value pairs. Array, for an array of values. Stream, for an opaque data stream and associated header dictionary.
The accessors on Value—Int64, Float64, Bool, Name, and so on—return a view of the data as the given type. When there is no appropriate view, the accessor returns a zero result. For example, the Name accessor returns the empty string if called on a Value v for which v.Kind() != Name. Returning zero values this way, especially from the Dict and Array accessors, which themselves return Values, makes it possible to traverse a PDF quickly without writing any error checking. On the other hand, it means that mistakes can go unreported.
The basic structure of the PDF file is exposed as the graph of Values.
Most richer data structures in a PDF file are dictionaries with specific interpretations of the name-value pairs. The Font and Page wrappers make the interpretation of a specific Value as the corresponding type easier. They are only helpers, though: they are implemented only in terms of the Value API and could be moved outside the package. Equally important, traversal of other PDF data structures can be implemented in other packages as needed.
Example (ZeroCopyInPDFProcessing) ¶
Demonstrate how to use zero-copy optimization in actual PDF processing
// Assume some text blocks are extracted from PDF
texts := []string{
" First paragraph ",
" Second paragraph ",
" Third paragraph ",
}
// Process using zero-copy operations
builder := NewStringBuffer(1024)
for i, text := range texts {
// Remove leading and trailing spaces (zero-copy)
trimmed := TrimSpaceZeroCopy(text)
builder.WriteString(trimmed)
if i < len(texts)-1 {
builder.WriteString("\n")
}
}
result := builder.StringCopy()
fmt.Println(result)
Output: First paragraph Second paragraph Third paragraph
Index ¶
- Constants
- Variables
- func AutoWarmup() error
- func BatchCompareFloat64(a, b []float64, threshold float64) []bool
- func BatchHexDecode(hexStrings []string) ([][]byte, []error)
- func BenchmarkSortingAlgorithms(texts []Text, getCoord func(Text) float64) map[string]float64
- func BytesToString(b []byte) string
- func ClearGlobalStringPool()
- func CompareStringsZeroCopy(s1, s2 string) int
- func DetectCJKOrdering(fontName string) string
- func EstimateCapacity(currentLen int, growthFactor float64) int
- func ExampleOptimizations()
- func FastHexValidation(hexStr string) bool
- func FastSortTexts(texts []Text, less func(i, j int) bool)
- func FastSortTextsByX(texts []Text)
- func FastSortTextsByY(texts []Text)
- func FastStringConcat(strings ...string) string
- func FastStringConcatZC(parts ...string) string
- func FastStringSearch(haystack, needle string) int
- func GetBuilder() *strings.Builder
- func GetByteBuffer() *[]byte
- func GetCMapWritingMode(name string) int
- func GetContentExtractorSlices() ([]Text, []Rect)
- func GetIntSlice(size int) []int
- func GetPDFBuffer() *buffer
- func GetSizedBuffer(size int) []byte
- func GetVerticalVariant(r rune) rune
- func GlyphNameToRune(name string) rune
- func HasPrefixZeroCopy(s, prefix string) bool
- func HasSuffixZeroCopy(s, suffix string) bool
- func HexDecodeSIMD(hexStr string) ([]byte, error)
- func HilbertXYToIndex(x, y, order uint32) uint64
- func InitPredefinedCMaps()
- func InternRune(r rune) string
- func InternString(s string) string
- func Interpret(strm Value, do func(stk *Stack, op string))
- func InterpretWithContext(ctx context.Context, strm Value, do func(stk *Stack, op string))
- func InterpretWithContextAndLimits(ctx context.Context, strm Value, do func(stk *Stack, op string), ...)
- func IsCJKCMap(name string) bool
- func IsCJKFont(fontName string) bool
- func IsSameSentence(last, current Text) bool
- func IsType1Font(v Value) bool
- func JoinZeroCopy(parts []string, sep string) string
- func ListRegisteredCMaps() []string
- func OptimizedStartup(config *StartupConfig) error
- func PreallocateCache(fontCacheSize, resultCacheSize int)
- func ProcessLargePDF(reader *Reader, chunkSize, bufferSize int, maxMemory int64, ...) error
- func ProcessTextWithMultiLanguage(reader *Reader) (map[Language][]ClassifiedBlock, error)
- func PutBlockSlice(s []ClassifiedBlock)
- func PutBuilder(b *strings.Builder)
- func PutByteBuffer(buf *[]byte)
- func PutContentExtractorSlices(text []Text, rect []Rect)
- func PutIntSlice(s []int)
- func PutPDFBuffer(b *buffer)
- func PutSizedBuffer(buf []byte)
- func PutSizedStringBuilder(sb *FastStringBuilder, estimatedSize int)
- func PutSizedTextSlice(slice []Text)
- func PutText(t *Text)
- func PutTextBlock(tb *TextBlock)
- func PutTextBlocks(blocks []*TextBlock)
- func PutTextSlice(s []Text)
- func RadixSortFloat64(values []float64)
- func RegisterCJKFont(name string, info *CJKFontInfo)
- func RegisterPredefinedCMap(name string, cmap *PredefinedCMap)
- func ResetSortingMetrics()
- func ShouldRotateGlyph(r rune) bool
- func SmartTextRunsToPlain(texts []Text) string
- func SplitZeroCopy(s string, sep byte) []string
- func StringSliceToByteSlice(strings []string) [][]byte
- func StringToBytes(s string) []byte
- func SubstringZeroCopy(s string, start, end int) string
- func TrimSpaceZeroCopy(s string) string
- func ValidatePDFA(data []byte) ([]string, error)
- func ValidatePDFX(data []byte) ([]string, error)
- func WarmupGlobal(config *WarmupConfig) error
- func ZeroCopyStringSlice(data []byte, separators []byte) []string
- type AccessPattern
- type AccessPatternTracker
- type AdaptiveCapacityEstimator
- type AdaptiveProcessor
- type AdaptiveSorter
- type AsyncReader
- func (ar *AsyncReader) AsyncExtractStructured(ctx context.Context) (<-chan []ClassifiedBlock, <-chan error)
- func (ar *AsyncReader) AsyncExtractText(ctx context.Context) (<-chan string, <-chan error)
- func (ar *AsyncReader) AsyncExtractTextWithContext(ctx context.Context, opts ExtractOptions) (<-chan string, <-chan error)
- func (ar *AsyncReader) AsyncStream(ctx context.Context, processor func(Page, int) error) <-chan error
- func (ar *AsyncReader) StreamValueReader(ctx context.Context, v Value) (<-chan []byte, <-chan error)
- type AsyncReaderAt
- type BatchExtractOptions
- type BatchResult
- type BatchStringBuilder
- type BlockType
- type CCITTFaxDecoder
- type CCITTFaxParams
- type CFFCache
- type CFFCacheEntry
- type CFFCharStringDecoder
- type CFFDict
- type CFFFont
- type CFFHeader
- type CFFIndex
- type CFFObjectPool
- type CIDFont
- func (f *CIDFont) DecodeToUnicode(raw string) string
- func (f *CIDFont) GetWidth(cid int) int
- func (f *CIDFont) SetCIDToGIDMap(m *CIDToGIDMap)
- func (f *CIDFont) SetCMap(cmap *CMap)
- func (f *CIDFont) SetDefaultWidth(w int)
- func (f *CIDFont) SetToUnicode(toUnicode *ToUnicodeCMap)
- func (f *CIDFont) SetWidth(cid, width int)
- func (f *CIDFont) SetWritingMode(wMode int)
- func (f *CIDFont) WritingMode() int
- type CIDFontDescriptor
- type CIDSystemInfo
- type CIDToGIDMap
- type CJKFontInfo
- type CJKFontRegistry
- type CJKGlyphMetrics
- type CJKTextProcessor
- type CMap
- func (c *CMap) AddBFChar(orig, repl string)
- func (c *CMap) AddBFRange(low, high string, dst Value)
- func (c *CMap) AddCIDChar(code []byte, cid int)
- func (c *CMap) AddCIDRange(low, high []byte, startCID int)
- func (c *CMap) AddCodeSpaceRange(low, high []byte)
- func (c *CMap) Decode(raw string) string
- func (c *CMap) LookupCID(code []byte) (int, bool)
- func (c *CMap) OptimizeCIDLookup()
- func (c *CMap) SetCIDSystemInfo(registry, ordering string, supplement int)
- func (c *CMap) SetUseCMap(parent TextEncoding)
- func (c *CMap) String() string
- type CMapInfo
- type CMapParser
- type CMapType
- type CacheContext
- type CacheEntry
- type CacheKeyGenerator
- func (ckg *CacheKeyGenerator) GenerateFullHash(data string) string
- func (ckg *CacheKeyGenerator) GeneratePageContentKey(pageNum int, readerHash string) string
- func (ckg *CacheKeyGenerator) GenerateReaderHash(reader *Reader) string
- func (ckg *CacheKeyGenerator) GenerateTextClassificationKey(pageNum int, readerHash string, processorParams string) string
- func (ckg *CacheKeyGenerator) GenerateTextOrderingKey(pageNum int, readerHash string, orderingParams string) string
- type CacheLineAlignedCounter
- type CacheLinePadded
- type CacheManager
- type CacheShard
- type CacheStats
- type CachedReader
- type ClassifiedBlock
- type ClassifiedBlockWithLanguage
- type Column
- type Columns
- type ConnectionPool
- type Content
- type CryptoEngine
- type EncryptionMethod
- type EncryptionRevision
- type EncryptionVersion
- type EnhancedParallelProcessor
- func (epp *EnhancedParallelProcessor) ProcessPagesEnhanced(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (epp *EnhancedParallelProcessor) ProcessWithLoadBalancing(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (epp *EnhancedParallelProcessor) ProcessWithPipeline(ctx context.Context, pages []Page, stages []func(Page, []Text) ([]Text, error)) ([][]Text, error)
- type ExtendedCIDFont
- func (cf *ExtendedCIDFont) Descriptor() *CIDFontDescriptor
- func (cf *ExtendedCIDFont) GID(cid int) uint16
- func (cf *ExtendedCIDFont) Info() *CJKFontInfo
- func (cf *ExtendedCIDFont) IsVertical() bool
- func (cf *ExtendedCIDFont) VerticalOrigin(cid int) (float64, float64)
- func (cf *ExtendedCIDFont) VerticalWidth(cid int) float64
- type ExtractMode
- type ExtractOptions
- type ExtractResult
- type Extractor
- func (e *Extractor) Context(ctx context.Context) *Extractor
- func (e *Extractor) Extract() (*ExtractResult, error)
- func (e *Extractor) ExtractStructured() ([]ClassifiedBlock, error)
- func (e *Extractor) ExtractStyledTexts() ([]Text, error)
- func (e *Extractor) ExtractText() (string, error)
- func (e *Extractor) Mode(mode ExtractMode) *Extractor
- func (e *Extractor) Pages(pages ...int) *Extractor
- func (e *Extractor) SmartOrdering(enabled bool) *Extractor
- func (e *Extractor) Workers(n int) *Extractor
- type FastStringBuilder
- type Font
- type FontCache
- type FontCacheInterface
- type FontCacheStats
- type FontCacheType
- type FontPool
- type FontPrefetcher
- type GlobalFontCache
- func (gfc *GlobalFontCache) Cleanup() int
- func (gfc *GlobalFontCache) Clear()
- func (gfc *GlobalFontCache) Get(key string) (*Font, bool)
- func (gfc *GlobalFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)
- func (gfc *GlobalFontCache) GetStats() FontCacheStats
- func (gfc *GlobalFontCache) Remove(key string)
- func (gfc *GlobalFontCache) Set(key string, font *Font)
- func (gfc *GlobalFontCache) StartCleanupRoutine(interval time.Duration) chan struct{}
- type GridKey
- type InplaceStringBuilder
- type IntegrityStatus
- type JBIG2Decoder
- type JBIG2Params
- type KDNode
- type KDTree
- type LZWPredictor
- type LZWPredictorParams
- type Language
- type LanguageInfo
- type LanguageTextExtractor
- type LazyPage
- type LazyPageManager
- type LockFreeRingBuffer
- type MemoryArena
- type MemoryEfficientExtractor
- type Metadata
- type MultiLangProcessor
- func (mlp *MultiLangProcessor) DetectLanguage(text string) LanguageInfo
- func (mlp *MultiLangProcessor) GetLanguageConfidenceThreshold() float64
- func (mlp *MultiLangProcessor) GetLanguageName(lang Language) string
- func (mlp *MultiLangProcessor) GetSupportedLanguages() []Language
- func (mlp *MultiLangProcessor) IsEnglish(text string) bool
- func (mlp *MultiLangProcessor) IsFrench(text string) bool
- func (mlp *MultiLangProcessor) IsGerman(text string) bool
- func (mlp *MultiLangProcessor) IsSpanish(text string) bool
- func (mlp *MultiLangProcessor) ProcessTextWithLanguageDetection(texts []Text) []TextWithLanguage
- type MultiLanguageTextClassifier
- type MultiLevelCache
- type OptimizedCMapCache
- type OptimizedFontCache
- func (ofc *OptimizedFontCache) Clear()
- func (ofc *OptimizedFontCache) Get(key string) (*Font, bool)
- func (ofc *OptimizedFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)
- func (ofc *OptimizedFontCache) GetStats() FontCacheStats
- func (ofc *OptimizedFontCache) Prefetch(keys []string, compute func(key string) (*Font, error))
- func (ofc *OptimizedFontCache) Remove(key string)
- func (ofc *OptimizedFontCache) Set(key string, font *Font)
- type OptimizedMemoryPool
- type OptimizedSorter
- func (os *OptimizedSorter) QuickSortTexts(texts []Text, less func(i, j int) bool)
- func (os *OptimizedSorter) SortTextHorizontalByOptimized(th TextHorizontal)
- func (os *OptimizedSorter) SortTextVerticalByOptimized(tv TextVertical)
- func (os *OptimizedSorter) SortTexts(texts []Text, less func(i, j int) bool)
- func (os *OptimizedSorter) SortTextsWithAlgorithm(texts []Text, less func(i, j int) bool, algorithm string)
- type OptimizedTextClusterSorter
- type Outline
- type PDFCompatibilityInfo
- type PDFEncryptionInfo
- type PDFError
- type PDFVersion
- type Page
- func (p Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)
- func (p *Page) Cleanup()
- func (p Page) Content() Content
- func (p Page) Font(name string) Font
- func (p Page) Fonts() []string
- func (p *Page) GetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p *Page) GetPlainTextWithSmartOrdering(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p Page) GetTextByColumn() (Columns, error)
- func (p Page) GetTextByRow() (Rows, error)
- func (p Page) OptimizedGetPlainText(ctx context.Context, fonts map[string]*Font) (string, error)
- func (p Page) OptimizedGetTextByColumn() (Columns, error)
- func (p Page) OptimizedGetTextByRow() (Rows, error)
- func (p Page) Resources() Value
- func (p *Page) SetFontCache(cache *GlobalFontCache)
- func (p *Page) SetFontCacheInterface(cache FontCacheInterface)
- type PageStream
- type ParallelExtractor
- type ParallelProcessor
- func (pp *ParallelProcessor) ProcessPages(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
- func (pp *ParallelProcessor) ProcessTextBlocks(ctx context.Context, blocks []*TextBlock, ...) ([]*TextBlock, error)
- func (pp *ParallelProcessor) ProcessTextInParallel(ctx context.Context, texts []Text, processorFunc func(Text) (Text, error)) ([]Text, error)
- type ParallelTextExtractor
- type ParseLimits
- type PasswordAuth
- type PerformanceMetrics
- type Point
- type PoolStats
- type PoolWarmer
- type PredefinedCMap
- type PrefetchItem
- type PrefetchQueue
- type PrefetchStats
- type RTreeNode
- type RTreeSpatialIndex
- type Reader
- func NewReader(f io.ReaderAt, size int64) (*Reader, error)
- func NewReaderEncrypted(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func NewReaderEncryptedWithMmap(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func NewReaderLinearized(f io.ReaderAt, size int64, pw func() string) (*Reader, error)
- func Open(file string) (*os.File, *Reader, error)
- func RecoverPDF(f io.ReaderAt, size int64, opts *RecoveryOptions) (*Reader, error)
- func (r *Reader) BatchExtractText(pageNums []int, useLazy bool) (map[int]string, error)
- func (r *Reader) ClearCache()
- func (r *Reader) Close() error
- func (r *Reader) ExtractAllPagesParallel(ctx context.Context, workers int) ([]string, error)
- func (r *Reader) ExtractPagesBatch(opts BatchExtractOptions) <-chan BatchResult
- func (r *Reader) ExtractPagesBatchToString(opts BatchExtractOptions) (string, error)
- func (r *Reader) ExtractStructuredBatch(opts BatchExtractOptions) <-chan StructuredBatchResult
- func (r *Reader) ExtractWithContext(ctx context.Context, opts ExtractOptions) (io.Reader, error)
- func (r *Reader) GetCacheCapacity() int
- func (r *Reader) GetCompatibilityInfo() *PDFCompatibilityInfo
- func (r *Reader) GetMetadata() (Metadata, error)
- func (r *Reader) GetPlainText() (reader io.Reader, err error)
- func (r *Reader) GetPlainTextConcurrent(workers int) (io.Reader, error)
- func (r *Reader) GetStyledTexts() (sentences []Text, err error)
- func (r *Reader) NumPage() int
- func (r *Reader) Outline() Outline
- func (r *Reader) Page(num int) Page
- func (r *Reader) SetCacheCapacity(n int)
- func (r *Reader) SetMetadata(meta Metadata) error
- func (r *Reader) Trailer() Value
- type RecoveryOptions
- type Rect
- type ResourceManager
- type ResultCache
- func (rc *ResultCache) Clear()
- func (rc *ResultCache) Close()
- func (rc *ResultCache) Get(key string) (interface{}, bool)
- func (rc *ResultCache) GetHitRatio() float64
- func (rc *ResultCache) GetStats() CacheStats
- func (rc *ResultCache) Has(key string) bool
- func (rc *ResultCache) Put(key string, value interface{})
- func (rc *ResultCache) Remove(key string) bool
- type Row
- type Rows
- type ShardedCache
- type ShardedCacheEntry
- type ShardedCacheStats
- type SizedBytePool
- type SizedPool
- type SizedTextSlicePool
- type SortStrategy
- type SortingMetrics
- type SpatialGrid
- type SpatialIndex
- type SpatialIndexInterface
- type Stack
- type StartupConfig
- type StreamProcessor
- func (sp *StreamProcessor) Close()
- func (sp *StreamProcessor) ProcessPageStream(reader *Reader, handler func(PageStream) error) error
- func (sp *StreamProcessor) ProcessTextBlockStream(reader *Reader, handler func(TextBlockStream) error) error
- func (sp *StreamProcessor) ProcessTextStream(reader *Reader, handler func(TextStream) error) error
- type StreamingBatchExtractor
- type StreamingMetadataExtractor
- type StreamingTextClassifier
- type StreamingTextExtractor
- func (e *StreamingTextExtractor) Close()
- func (e *StreamingTextExtractor) GetProgress() float64
- func (e *StreamingTextExtractor) NextBatch() (results map[int]string, hasMore bool, err error)
- func (e *StreamingTextExtractor) NextPage() (pageNum int, text string, hasMore bool, err error)
- func (e *StreamingTextExtractor) Reset()
- type StringBuffer
- func (sb *StringBuffer) Bytes() []byte
- func (sb *StringBuffer) Cap() int
- func (sb *StringBuffer) Len() int
- func (sb *StringBuffer) Reset()
- func (sb *StringBuffer) String() string
- func (sb *StringBuffer) StringCopy() string
- func (sb *StringBuffer) WriteByte(b byte) error
- func (sb *StringBuffer) WriteBytes(b []byte)
- func (sb *StringBuffer) WriteString(s string)
- type StringBuilderPool
- type StringPool
- type StructuredBatchResult
- type Task
- type Text
- type TextBlock
- func ClusterTextBlocksOptimized(texts []Text) []*TextBlock
- func ClusterTextBlocksOptimizedV2(texts []Text) []*TextBlock
- func ClusterTextBlocksParallel(texts []Text) []*TextBlock
- func ClusterTextBlocksParallelV2(texts []Text) []*TextBlock
- func ClusterTextBlocksUltraOptimized(texts []Text) []*TextBlock
- func ClusterTextBlocksUltraV2(texts []Text) []*TextBlock
- func ClusterTextBlocksV3(texts []Text) []*TextBlock
- func ClusterTextBlocksV3Fast(texts []Text, maxClusters int) []*TextBlock
- func ClusterTextBlocksV4(texts []Text) []*TextBlock
- func GetTextBlock() *TextBlock
- type TextBlockStream
- type TextClassifier
- type TextEncoding
- type TextHorizontal
- type TextOptimized
- func (t *TextOptimized) IsBold() bool
- func (t *TextOptimized) IsItalic() bool
- func (t *TextOptimized) IsUnderline() bool
- func (t *TextOptimized) IsVertical() bool
- func (t *TextOptimized) SetBold(v bool)
- func (t *TextOptimized) SetItalic(v bool)
- func (t *TextOptimized) SetUnderline(v bool)
- func (t *TextOptimized) SetVertical(v bool)
- type TextStream
- type TextVertical
- type TextWithLanguage
- type ToUnicodeCMap
- type Type1Cache
- type Type1CacheEntry
- type Type1Font
- type Type1FontInfo
- type Value
- func (v Value) Bool() bool
- func (v Value) Float64() float64
- func (v Value) Index(i int) Value
- func (v Value) Int64() int64
- func (v Value) IsNull() bool
- func (v Value) Key(key string) Value
- func (v Value) Keys() []string
- func (v Value) Kind() ValueKind
- func (v Value) Len() int
- func (v Value) Name() string
- func (v Value) RawString() string
- func (v Value) Reader() io.ReadCloser
- func (v Value) String() string
- func (v Value) Text() string
- func (v Value) TextFromUTF16() string
- type ValueKind
- type VerticalTextTransform
- type WSDeque
- type WSTask
- type WSWorker
- type WarmupConfig
- type WarmupStats
- type WorkStealingExecutor
- type WorkStealingScheduler
- type Worker
- type WorkerPool
- type WorkerPoolStats
- type YBand
- type ZeroCopyBuilder
- Bugs
Examples ¶
- Package (ZeroCopyInPDFProcessing)
- BatchExtractOptions (OptimizedCache)
- BatchExtractOptions (StandardCache)
- FastStringConcatZC
- GetGlobalFontCache
- GlobalFontCache
- JoinZeroCopy
- ParallelExtractor (Basic)
- Reader.ExtractAllPagesParallel
- Reader.ExtractPagesBatch
- Reader.ExtractPagesBatchToString
- SplitZeroCopy
- StreamingBatchExtractor
- StringBuffer
- StringPool
- TrimSpaceZeroCopy
Constants ¶
const ( FlagVertical uint8 = 1 << 0 // 0x01 FlagBold uint8 = 1 << 1 // 0x02 FlagItalic uint8 = 1 << 2 // 0x04 FlagUnderline uint8 = 1 << 3 // 0x08 )
Flag constants for TextOptimized.Flags
Variables ¶
var ( // ErrInvalidFont indicates a font definition is malformed or unsupported ErrInvalidFont = errors.New("invalid or unsupported font") // ErrUnsupportedEncoding indicates the character encoding is not supported ErrUnsupportedEncoding = errors.New("unsupported character encoding") // ErrMalformedStream indicates a content stream is malformed ErrMalformedStream = errors.New("malformed content stream") // ErrInvalidPage indicates an invalid page number or corrupted page ErrInvalidPage = errors.New("invalid page") // ErrEncrypted indicates the PDF is encrypted and cannot be read without a password ErrEncrypted = errors.New("PDF is encrypted") // ErrCorrupted indicates the PDF file structure is corrupted ErrCorrupted = errors.New("PDF file is corrupted") // ErrUnsupportedVersion indicates the PDF version is not supported ErrUnsupportedVersion = errors.New("unsupported PDF version") // ErrNoContent indicates the page has no content ErrNoContent = errors.New("page has no content") )
Common errors
var DebugOn = false
DebugOn is responsible for logging messages into stdout. If problems arise during reading, set it true.
var ErrContextCancelled = errors.New("pdf: context cancelled")
ErrContextCancelled is returned when a context is cancelled during PDF processing
var ErrInvalidPassword = fmt.Errorf("encrypted PDF: invalid password")
var ErrMaxParseTimeExceeded = errors.New("pdf: max parse time exceeded")
ErrMaxParseTimeExceeded is returned when max parse time is exceeded
var ErrMemoryLimitExceeded = errors.New("pdf: stream processor memory limit exceeded")
var ErrTimeout = errors.New("pdf: operation timeout")
ErrTimeout is returned when processing times out
var GlobalMetrics = &PerformanceMetrics{}
Global performance metrics instance
var GlobalPoolWarmer = &PoolWarmer{
bytePool: globalSizedBytePool,
textPool: globalSizedTextSlicePool,
}
GlobalPoolWarmer global pool warmer instance
var SupportedVersions = []PDFVersion{
{1, 0}, {1, 1}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {1, 7},
{2, 0},
}
SupportedVersions defines the supported PDF versions
var Type1GlyphNames = map[string]rune{}/* 183 elements not displayed */
Type1GlyphNames provides glyph name to Unicode mapping
Functions ¶
func AutoWarmup ¶ added in v1.0.2
func AutoWarmup() error
AutoWarmup automatic warmup (selects config based on available memory)
func BatchCompareFloat64 ¶
9. SIMD-friendly batch operations (pseudocode, actual assembly needed)
func BatchHexDecode ¶ added in v1.1.6
BatchHexDecode processes multiple hex strings in parallel using SIMD operations
func BenchmarkSortingAlgorithms ¶ added in v1.0.1
BenchmarkSortingAlgorithms compares performance of different algorithms
func BytesToString ¶ added in v1.0.2
BytesToString zero-copy conversion from []byte to string Warning: The returned string directly references the underlying byte array, do not modify the original []byte
func ClearGlobalStringPool ¶ added in v1.0.2
func ClearGlobalStringPool()
ClearGlobalStringPool clears the global string pool
func CompareStringsZeroCopy ¶ added in v1.0.2
CompareStringsZeroCopy zero-copy string comparison Returns -1 (s1 < s2), 0 (s1 == s2), 1 (s1 > s2)
func DetectCJKOrdering ¶ added in v1.2.8
DetectCJKOrdering detects the CJK ordering from font name
func EstimateCapacity ¶
EstimateCapacity provides better capacity estimation for slices
func FastHexValidation ¶ added in v1.1.6
FastHexValidation performs SIMD-style validation of hex strings
func FastSortTexts ¶ added in v1.0.1
FastSortTexts sorts texts using the fastest algorithm for the comparison function
func FastSortTextsByX ¶ added in v1.0.1
func FastSortTextsByX(texts []Text)
FastSortTextsByX sorts texts by X coordinate using the fastest algorithm
func FastSortTextsByY ¶ added in v1.0.1
func FastSortTextsByY(texts []Text)
FastSortTextsByY sorts texts by Y coordinate using the fastest algorithm
func FastStringConcat ¶
FastStringConcat concatenates strings with optimized memory allocation
func FastStringConcatZC ¶ added in v1.0.2
FastStringConcatZC fast concatenation of multiple strings (zero-copy version)
Example ¶
ExampleFastStringConcatZC Demonstrate fast string concatenation
result := FastStringConcatZC("Hello", " ", "World", "!")
fmt.Println(result)
Output: Hello World!
func FastStringSearch ¶
FastStringSearch performs optimized string search using SIMD-like operations This is a simplified implementation that can be extended with actual SIMD instructions
func GetBuilder ¶
GetBuilder retrieves a strings.Builder from the pool
func GetByteBuffer ¶
func GetByteBuffer() *[]byte
GetByteBuffer retrieves a byte buffer from the pool
func GetCMapWritingMode ¶ added in v1.2.8
GetCMapWritingMode returns the writing mode for a CMap name Returns 0 for horizontal, 1 for vertical, -1 if unknown
func GetContentExtractorSlices ¶ added in v1.2.3
GetContentExtractorSlices gets pre-allocated slices from pool
func GetIntSlice ¶ added in v1.2.3
GetIntSlice gets an int slice from pool
func GetSizedBuffer ¶ added in v1.0.1
GetSizedBuffer retrieves a byte buffer from the global sized pool This is a convenience function for common use cases
func GetVerticalVariant ¶ added in v1.2.8
GetVerticalVariant returns the vertical variant of a character if available
func GlyphNameToRune ¶ added in v1.2.8
GlyphNameToRune converts a glyph name to Unicode rune
func HasPrefixZeroCopy ¶ added in v1.0.2
HasPrefixZeroCopy zero-copy prefix check
func HasSuffixZeroCopy ¶ added in v1.0.2
HasSuffixZeroCopy zero-copy suffix check
func HexDecodeSIMD ¶ added in v1.1.6
HexDecodeSIMD performs SIMD-optimized hex string decoding This function uses vectorized operations to decode hex strings efficiently
func HilbertXYToIndex ¶
8. Hilbert curve calculation (for spatial indexing)
func InitPredefinedCMaps ¶ added in v1.2.8
func InitPredefinedCMaps()
InitPredefinedCMaps initializes common predefined CMaps These provide basic Unicode mappings for CJK character sets
func InternRune ¶ added in v1.2.3
InternRune converts a rune to interned string
func InternString ¶ added in v1.0.2
InternString adds string to global pool
func Interpret ¶
Interpret interprets the content in a stream as a basic PostScript program, pushing values onto a stack and then calling the do function to execute operators. The do function may push or pop values from the stack as needed to implement op.
Interpret handles the operators "dict", "currentdict", "begin", "end", "def", and "pop" itself.
Interpret is not a full-blown PostScript interpreter. Its job is to handle the very limited PostScript found in certain supporting file formats embedded in PDF files, such as cmap files that describe the mapping from font code points to Unicode code points.
A stream can also be represented by an array of streams that has to be handled as a single stream In the case of a simple stream read only once, otherwise get the length of the stream to handle it properly
There is no support for executable blocks, among other limitations.
func InterpretWithContext ¶ added in v1.1.5
InterpretWithContext is like Interpret but accepts a context for cancellation support. When the context is cancelled, interpretation stops and returns.
func InterpretWithContextAndLimits ¶ added in v1.1.5
func InterpretWithContextAndLimits(ctx context.Context, strm Value, do func(stk *Stack, op string), limits *ParseLimits)
InterpretWithContextAndLimits is like InterpretWithContext but also accepts parse limits.
func IsCJKCMap ¶ added in v1.2.8
IsCJKCMap checks if a CMap name is for CJK (Chinese, Japanese, Korean) encoding
func IsSameSentence ¶
isSameSentence checks if the current text segment likely belongs to the same sentence as the last text segment based on font, size, vertical position, and lack of sentence-ending punctuation in the last segment.
func IsType1Font ¶ added in v1.2.8
IsType1Font checks if a font value is a Type1 font
func JoinZeroCopy ¶ added in v1.0.2
JoinZeroCopy zero-copy string joining (single allocation)
Example ¶
ExampleJoinZeroCopy Demonstrate zero-copy joining
parts := []string{"apple", "banana", "cherry"}
result := JoinZeroCopy(parts, ", ")
fmt.Println(result)
Output: apple, banana, cherry
func ListRegisteredCMaps ¶ added in v1.2.8
func ListRegisteredCMaps() []string
ListRegisteredCMaps returns a list of all registered predefined CMap names
func OptimizedStartup ¶ added in v1.0.2
func OptimizedStartup(config *StartupConfig) error
OptimizedStartup optimized startup process includes pool warmup, cache pre-allocation, etc.
func PreallocateCache ¶ added in v1.0.2
func PreallocateCache(fontCacheSize, resultCacheSize int)
PreallocateCache pre-allocates cache (additional feature)
func ProcessLargePDF ¶
func ProcessLargePDF(reader *Reader, chunkSize, bufferSize int, maxMemory int64, handler func(PageStream) error) error
ProcessLargePDF handles very large PDFs with streaming
func ProcessTextWithMultiLanguage ¶
func ProcessTextWithMultiLanguage(reader *Reader) (map[Language][]ClassifiedBlock, error)
ProcessTextWithMultiLanguage handles multi-language text processing for the entire PDF
func PutBlockSlice ¶
func PutBlockSlice(s []ClassifiedBlock)
PutBlockSlice returns a ClassifiedBlock slice to the pool
func PutBuilder ¶
PutBuilder returns a strings.Builder to the pool after resetting it
func PutByteBuffer ¶
func PutByteBuffer(buf *[]byte)
PutByteBuffer returns a byte buffer to the pool
func PutContentExtractorSlices ¶ added in v1.2.3
PutContentExtractorSlices returns slices to pool
func PutIntSlice ¶ added in v1.2.3
func PutIntSlice(s []int)
PutIntSlice returns an int slice to pool
func PutPDFBuffer ¶
func PutPDFBuffer(b *buffer)
PutPDFBuffer returns a PDF buffer to the pool after resetting
func PutSizedBuffer ¶ added in v1.0.1
func PutSizedBuffer(buf []byte)
PutSizedBuffer returns a byte buffer to the global sized pool This is a convenience function for common use cases
func PutSizedStringBuilder ¶ added in v1.0.1
func PutSizedStringBuilder(sb *FastStringBuilder, estimatedSize int)
PutSizedStringBuilder returns a string builder to the appropriate pool
func PutSizedTextSlice ¶ added in v1.0.1
func PutSizedTextSlice(slice []Text)
PutSizedTextSlice returns a Text slice to the global pool
func PutTextBlock ¶ added in v1.2.3
func PutTextBlock(tb *TextBlock)
PutTextBlock returns a TextBlock to pool
func PutTextBlocks ¶ added in v1.2.3
func PutTextBlocks(blocks []*TextBlock)
PutTextBlocks returns multiple TextBlocks to pool
func RegisterCJKFont ¶ added in v1.2.8
func RegisterCJKFont(name string, info *CJKFontInfo)
RegisterCJKFont registers a CJK font
func RegisterPredefinedCMap ¶ added in v1.2.8
func RegisterPredefinedCMap(name string, cmap *PredefinedCMap)
RegisterPredefinedCMap registers a predefined CMap
func ResetSortingMetrics ¶ added in v1.0.1
func ResetSortingMetrics()
ResetSortingMetrics resets the sorting metrics
func ShouldRotateGlyph ¶ added in v1.2.8
ShouldRotateGlyph returns true if the glyph should be rotated in vertical text
func SmartTextRunsToPlain ¶
SmartTextRunsToPlain converts text runs to plain text using improved ordering
func SplitZeroCopy ¶ added in v1.0.2
SplitZeroCopy zero-copy string splitting Strings in the returned slice are all slices of the original string
Example ¶
ExampleSplitZeroCopy Demonstrate zero-copy splitting
str := "a,b,c,d"
parts := SplitZeroCopy(str, ',')
for _, part := range parts {
fmt.Println(part)
}
Output: a b c d
func StringSliceToByteSlice ¶ added in v1.0.2
StringSliceToByteSlice zero-copy conversion of each string in []string Each element in the returned [][]byte is read-only
func StringToBytes ¶ added in v1.0.2
StringToBytes zero-copy conversion from string to []byte Warning: The returned []byte is read-only, do not modify
func SubstringZeroCopy ¶ added in v1.0.2
SubstringZeroCopy zero-copy substring extraction Actually all string slicing in Go is already zero-copy
func TrimSpaceZeroCopy ¶ added in v1.0.2
TrimSpaceZeroCopy zero-copy trim leading and trailing spaces
Example ¶
ExampleTrimSpaceZeroCopy Demonstrate zero-copy space trimming
str := " hello world " result := TrimSpaceZeroCopy(str) fmt.Println(result)
Output: hello world
func ValidatePDFA ¶ added in v1.2.0
ValidatePDFA validates PDF/A compliance
func ValidatePDFX ¶ added in v1.2.0
ValidatePDFX validates PDF/X compliance
func WarmupGlobal ¶ added in v1.0.2
func WarmupGlobal(config *WarmupConfig) error
WarmupGlobal warms up global memory pool (convenience function)
func ZeroCopyStringSlice ¶
ZeroCopyStringSlice creates a string slice without copying data WARNING: This is unsafe and the returned strings share memory with the input
Types ¶
type AccessPattern ¶ added in v1.0.2
type AccessPattern struct {
// contains filtered or unexported fields
}
AccessPattern records access pattern of single font
type AccessPatternTracker ¶ added in v1.0.2
type AccessPatternTracker struct {
// contains filtered or unexported fields
}
AccessPatternTracker tracks font access patterns
type AdaptiveCapacityEstimator ¶
type AdaptiveCapacityEstimator struct {
// contains filtered or unexported fields
}
AdaptiveCapacityEstimator adaptive capacity estimator Dynamically adjusts pre-allocated capacity based on historical data, reducing reallocation
func NewAdaptiveCapacityEstimator ¶
func NewAdaptiveCapacityEstimator(maxSamples int) *AdaptiveCapacityEstimator
NewAdaptiveCapacityEstimator creates new adaptive estimator
func (*AdaptiveCapacityEstimator) Estimate ¶
func (ace *AdaptiveCapacityEstimator) Estimate(hint int) int
Estimate estimates required capacity based on historical data
func (*AdaptiveCapacityEstimator) Record ¶
func (ace *AdaptiveCapacityEstimator) Record(actual int)
Record records actual capacity used
type AdaptiveProcessor ¶ added in v1.0.2
type AdaptiveProcessor struct {
// contains filtered or unexported fields
}
AdaptiveProcessor adaptive processor Automatically adjusts concurrency level based on system load
func NewAdaptiveProcessor ¶ added in v1.0.2
func NewAdaptiveProcessor(min, max int) *AdaptiveProcessor
NewAdaptiveProcessor creates adaptive processor
func (*AdaptiveProcessor) AdjustWorkers ¶ added in v1.0.2
func (ap *AdaptiveProcessor) AdjustWorkers()
AdjustWorkers adjusts worker count based on system load
func (*AdaptiveProcessor) GetWorkerCount ¶ added in v1.0.2
func (ap *AdaptiveProcessor) GetWorkerCount() int
GetWorkerCount gets current worker goroutine count
type AdaptiveSorter ¶ added in v1.0.1
type AdaptiveSorter struct {
// contains filtered or unexported fields
}
AdaptiveSorter selects the best sorting algorithm based on data characteristics
func NewAdaptiveSorter ¶ added in v1.0.1
func NewAdaptiveSorter() *AdaptiveSorter
NewAdaptiveSorter creates a new adaptive sorter with default thresholds
func (*AdaptiveSorter) SortTextsByComparison ¶ added in v1.0.1
func (as *AdaptiveSorter) SortTextsByComparison(texts []Text, less func(i, j int) bool)
SortTextsByComparison sorts texts using a comparison function
func (*AdaptiveSorter) SortTextsByCoordinate ¶ added in v1.0.1
func (as *AdaptiveSorter) SortTextsByCoordinate(texts []Text, getCoord func(Text) float64)
SortTextsByCoordinate sorts texts by a numeric coordinate using the best algorithm
type AsyncReader ¶
type AsyncReader struct {
*Reader
// contains filtered or unexported fields
}
AsyncReader wraps a Reader to provide asynchronous operations
func NewAsyncReader ¶
func NewAsyncReader(reader *Reader) *AsyncReader
NewAsyncReader creates a new async reader with async I/O support
func (*AsyncReader) AsyncExtractStructured ¶
func (ar *AsyncReader) AsyncExtractStructured(ctx context.Context) (<-chan []ClassifiedBlock, <-chan error)
AsyncExtractStructured extracts structured text asynchronously
func (*AsyncReader) AsyncExtractText ¶
func (ar *AsyncReader) AsyncExtractText(ctx context.Context) (<-chan string, <-chan error)
AsyncExtractText extracts text from all pages asynchronously
func (*AsyncReader) AsyncExtractTextWithContext ¶
func (ar *AsyncReader) AsyncExtractTextWithContext(ctx context.Context, opts ExtractOptions) (<-chan string, <-chan error)
AsyncExtractTextWithContext extracts text with cancellation and timeout support
func (*AsyncReader) AsyncStream ¶
func (ar *AsyncReader) AsyncStream(ctx context.Context, processor func(Page, int) error) <-chan error
AsyncStream processes the PDF file with async I/O operations
func (*AsyncReader) StreamValueReader ¶
func (ar *AsyncReader) StreamValueReader(ctx context.Context, v Value) (<-chan []byte, <-chan error)
StreamValueReader provides async streaming of value data
type AsyncReaderAt ¶
type AsyncReaderAt struct {
// contains filtered or unexported fields
}
AsyncReaderAt provides async I/O for low-level file operations
func NewAsyncReaderAt ¶
func NewAsyncReaderAt(reader io.ReaderAt) *AsyncReaderAt
NewAsyncReaderAt creates a new async reader with async I/O support
func (*AsyncReaderAt) ReadAtAsync ¶
func (ara *AsyncReaderAt) ReadAtAsync(ctx context.Context, buf []byte, offset int64) (<-chan int, <-chan error)
ReadAtAsync reads from the file asynchronously
type BatchExtractOptions ¶ added in v1.0.1
type BatchExtractOptions struct {
// Pages to extract (nil means all pages)
Pages []int
// Number of concurrent workers (0 = NumCPU)
Workers int
// Whether to use smart text ordering
SmartOrdering bool
// Context for cancellation
Context context.Context
// Buffer size for each page result (0 = default 2KB)
PageBufferSize int
// Whether to enable font cache for this batch (default: false)
// When enabled, a temporary font cache is created for the batch
// to reduce redundant font parsing across pages
UseFontCache bool
// Maximum number of fonts to cache (0 = default 1000)
// Only used when UseFontCache is true
FontCacheSize int
// FontCacheType specifies which cache implementation to use
// - FontCacheStandard: Standard implementation (default)
// - FontCacheOptimized: High-performance optimized cache (10-85x faster)
// Only used when UseFontCache is true
FontCacheType FontCacheType
// PageTimeout is the maximum time allowed for processing a single page
// If zero, defaults to 30 seconds. Set to negative value to disable.
PageTimeout time.Duration
// ParseLimits configures resource limits for parsing operations
// If nil, uses DefaultParseLimits()
ParseLimits *ParseLimits
}
BatchExtractOptions configures batch extraction behavior
Example (OptimizedCache) ¶
ExampleBatchExtractOptions_optimizedCache demonstrates using optimized cache
// This example shows how to use the optimized cache
opts := BatchExtractOptions{
Workers: 8,
SmartOrdering: true,
UseFontCache: true,
FontCacheType: FontCacheOptimized, // Optimized cache (10-85x faster)
FontCacheSize: 2000,
}
fmt.Printf("Cache type: Optimized, Size: %d\n", opts.FontCacheSize)
Output: Cache type: Optimized, Size: 2000
Example (StandardCache) ¶
ExampleBatchExtractOptions_standardCache demonstrates using standard cache
// This example shows how to use the standard cache
opts := BatchExtractOptions{
Workers: 4,
SmartOrdering: true,
UseFontCache: true,
FontCacheType: FontCacheStandard, // Standard cache
FontCacheSize: 1000,
}
fmt.Printf("Cache type: Standard, Size: %d\n", opts.FontCacheSize)
Output: Cache type: Standard, Size: 1000
type BatchResult ¶ added in v1.0.1
BatchResult contains the result of extracting a single page
type BatchStringBuilder ¶
type BatchStringBuilder struct {
// contains filtered or unexported fields
}
BatchStringBuilder batch string builder Avoids multiple reallocations by precisely calculating required capacity
func NewBatchStringBuilder ¶
func NewBatchStringBuilder(texts []Text) *BatchStringBuilder
NewBatchStringBuilder creates batch string builder
func (*BatchStringBuilder) AppendTexts ¶
func (bsb *BatchStringBuilder) AppendTexts(texts []Text) string
AppendTexts appends text content in batch
func (*BatchStringBuilder) Reset ¶
func (bsb *BatchStringBuilder) Reset()
Reset resets builder for reuse
func (*BatchStringBuilder) String ¶
func (bsb *BatchStringBuilder) String() string
String returns built string
type CCITTFaxDecoder ¶ added in v1.2.8
type CCITTFaxDecoder struct {
// contains filtered or unexported fields
}
CCITTFaxDecoder decodes CCITT Group 3 and Group 4 fax encoded data as specified in PDF 32000-1:2008, Section 7.4.6
func NewCCITTFaxDecoder ¶ added in v1.2.8
func NewCCITTFaxDecoder(r io.Reader, params CCITTFaxParams) *CCITTFaxDecoder
NewCCITTFaxDecoder creates a new CCITT fax decoder
type CCITTFaxParams ¶ added in v1.2.8
type CCITTFaxParams struct {
K int // <0: pure 2D (Group 4), 0: pure 1D (Group 3), >0: mixed
EndOfLine bool // If true, require EOL alignment bits
EncodedByteAlign bool // If true, encoded data is byte-aligned after each row
Columns int // Width of image in pixels (default: 1728)
Rows int // Height of image (0 = unknown)
EndOfBlock bool // If true, expect EOFB sequence
BlackIs1 bool // If true, 1 bits represent black pixels
DamagedRowsBeforeError int // Max consecutive damaged rows before error
}
CCITTFaxParams contains parameters for CCITT fax decoding
func DefaultCCITTFaxParams ¶ added in v1.2.8
func DefaultCCITTFaxParams() CCITTFaxParams
DefaultCCITTFaxParams returns default CCITT fax parameters
func ParseCCITTFaxParams ¶ added in v1.2.8
func ParseCCITTFaxParams(param Value) CCITTFaxParams
ParseCCITTFaxParams parses CCITT fax parameters from a Value
type CFFCache ¶ added in v1.2.8
type CFFCache struct {
// contains filtered or unexported fields
}
CFFCache provides caching for CFF font parsing and decoding operations
func GetGlobalCFFCache ¶ added in v1.2.8
func GetGlobalCFFCache() *CFFCache
GetGlobalCFFCache returns the global CFF cache instance
func NewCFFCache ¶ added in v1.2.8
NewCFFCache creates a new CFF cache
func (*CFFCache) GetDecoding ¶ added in v1.2.8
GetDecoding retrieves cached character string decoding results
func (*CFFCache) PutDecoding ¶ added in v1.2.8
PutDecoding caches character string decoding results
type CFFCacheEntry ¶ added in v1.2.8
type CFFCacheEntry struct {
Data interface{}
Expiration time.Time
LastAccess time.Time
AccessCount int64
}
CFFCacheEntry represents a cached CFF font or decoded result
func (*CFFCacheEntry) IsExpired ¶ added in v1.2.8
func (ce *CFFCacheEntry) IsExpired() bool
IsExpired checks if the cache entry has expired
type CFFCharStringDecoder ¶ added in v1.2.8
type CFFCharStringDecoder struct {
// contains filtered or unexported fields
}
CFFCharStringDecoder decodes CFF CharString data
func NewCFFCharStringDecoder ¶ added in v1.2.8
func NewCFFCharStringDecoder(data []byte) *CFFCharStringDecoder
NewCFFCharStringDecoder creates a new CharString decoder with pooled objects
func (*CFFCharStringDecoder) Decode ¶ added in v1.2.8
func (d *CFFCharStringDecoder) Decode() ([]interface{}, error)
Decode decodes the CharString and returns the path commands with caching and pooling
func (*CFFCharStringDecoder) GetWidth ¶ added in v1.2.8
func (d *CFFCharStringDecoder) GetWidth() (float64, bool)
GetWidth returns the glyph width if available
type CFFDict ¶ added in v1.2.8
type CFFDict struct {
Data map[int]interface{}
}
CFFDict represents a CFF DICT data structure
type CFFFont ¶ added in v1.2.8
type CFFFont struct {
Header *CFFHeader
NameIndex *CFFIndex
TopDict *CFFDict
StringIndex *CFFIndex
GlobalSubrs *CFFIndex
CharStrings *CFFIndex
PrivateDict *CFFDict
LocalSubrs *CFFIndex
FDArray []*CFFDict // For CID-keyed fonts
FDSelect []byte // For CID-keyed fonts
// contains filtered or unexported fields
}
CFFFont represents a parsed CFF font
func NewCFFFont ¶ added in v1.2.8
NewCFFFont parses CFF font data with caching
func (*CFFFont) GetCharString ¶ added in v1.2.8
GetCharString returns the CharString for a given glyph index
func (*CFFFont) GetFDIndex ¶ added in v1.2.8
GetFDIndex returns the Font DICT index for a CID (CID-keyed fonts)
func (*CFFFont) GetFontName ¶ added in v1.2.8
GetFontName returns the font name
type CFFObjectPool ¶ added in v1.2.8
type CFFObjectPool struct {
// contains filtered or unexported fields
}
CFFObjectPool provides object pooling for CFF decoding operations
func GetGlobalCFFPool ¶ added in v1.2.8
func GetGlobalCFFPool() *CFFObjectPool
GetGlobalCFFPool returns the global CFF object pool instance
func NewCFFObjectPool ¶ added in v1.2.8
func NewCFFObjectPool() *CFFObjectPool
NewCFFObjectPool creates a new CFF object pool
func (*CFFObjectPool) GetCommandSlice ¶ added in v1.2.8
func (p *CFFObjectPool) GetCommandSlice() []interface{}
GetCommandSlice retrieves a command slice from the pool
func (*CFFObjectPool) GetStack ¶ added in v1.2.8
func (p *CFFObjectPool) GetStack() []float64
GetStack retrieves a stack slice from the pool
func (*CFFObjectPool) PutCommandSlice ¶ added in v1.2.8
func (p *CFFObjectPool) PutCommandSlice(commands []interface{})
PutCommandSlice returns a command slice to the pool
func (*CFFObjectPool) PutStack ¶ added in v1.2.8
func (p *CFFObjectPool) PutStack(stack []float64)
PutStack returns a stack slice to the pool
type CIDFont ¶ added in v1.2.8
type CIDFont struct {
// contains filtered or unexported fields
}
CIDFont represents a CID-keyed font
func (*CIDFont) DecodeToUnicode ¶ added in v1.2.8
DecodeToUnicode decodes a string using the CMap and ToUnicode
func (*CIDFont) SetCIDToGIDMap ¶ added in v1.2.8
func (f *CIDFont) SetCIDToGIDMap(m *CIDToGIDMap)
SetCIDToGIDMap sets the CID to GID mapping
func (*CIDFont) SetDefaultWidth ¶ added in v1.2.8
SetDefaultWidth sets the default glyph width
func (*CIDFont) SetToUnicode ¶ added in v1.2.8
func (f *CIDFont) SetToUnicode(toUnicode *ToUnicodeCMap)
SetToUnicode sets the ToUnicode CMap
func (*CIDFont) SetWritingMode ¶ added in v1.2.8
SetWritingMode sets the writing mode (0=horizontal, 1=vertical)
func (*CIDFont) WritingMode ¶ added in v1.2.8
WritingMode returns the writing mode
type CIDFontDescriptor ¶ added in v1.2.8
type CIDFontDescriptor struct {
FontName string
FontFamily string
Flags int
FontBBox [4]float64
ItalicAngle float64
Ascent float64
Descent float64
Leading float64
CapHeight float64
XHeight float64
StemV float64
StemH float64
AvgWidth float64
MaxWidth float64
MissingWidth float64
}
CIDFontDescriptor contains font descriptor information for CID fonts
type CIDSystemInfo ¶ added in v1.2.8
CIDSystemInfo represents the CIDSystemInfo dictionary in a CMap
type CIDToGIDMap ¶ added in v1.2.8
type CIDToGIDMap struct {
// contains filtered or unexported fields
}
CIDToGIDMap represents a CIDToGIDMap for CID-keyed fonts
func NewCIDToGIDMap ¶ added in v1.2.8
func NewCIDToGIDMap(data []byte) *CIDToGIDMap
NewCIDToGIDMap creates a CIDToGIDMap from raw data
func NewIdentityCIDToGIDMap ¶ added in v1.2.8
func NewIdentityCIDToGIDMap() *CIDToGIDMap
NewIdentityCIDToGIDMap creates an identity CIDToGIDMap
func (*CIDToGIDMap) IsIdentity ¶ added in v1.2.8
func (m *CIDToGIDMap) IsIdentity() bool
IsIdentity returns true if this is an identity mapping
func (*CIDToGIDMap) LookupGID ¶ added in v1.2.8
func (m *CIDToGIDMap) LookupGID(cid int) int
LookupGID returns the GID for a given CID
type CJKFontInfo ¶ added in v1.2.8
type CJKFontInfo struct {
Name string // Font name
Registry string // Registry (e.g., "Adobe")
Ordering string // Ordering (e.g., "GB1", "CNS1", "Japan1", "Korea1")
Supplement int // Supplement number
IsVertical bool // Whether vertical writing mode
WMode int // Writing mode: 0=horizontal, 1=vertical
}
CJKFontInfo contains information about CJK fonts
func GetCJKFontInfo ¶ added in v1.2.8
func GetCJKFontInfo(name string) *CJKFontInfo
GetCJKFontInfo returns information about a CJK font
type CJKFontRegistry ¶ added in v1.2.8
type CJKFontRegistry struct {
// contains filtered or unexported fields
}
CJKFontRegistry is a registry of CJK fonts
type CJKGlyphMetrics ¶ added in v1.2.8
type CJKGlyphMetrics struct {
Width float64 // Horizontal advance width
Height float64 // Vertical advance height
VOriginX float64 // Vertical origin X
VOriginY float64 // Vertical origin Y
HasVertical bool // Whether vertical metrics are available
}
CJKGlyphMetrics contains glyph metrics for CJK fonts
type CJKTextProcessor ¶ added in v1.2.8
type CJKTextProcessor struct {
// contains filtered or unexported fields
}
CJKTextProcessor processes CJK text for proper rendering
func NewCJKTextProcessor ¶ added in v1.2.8
func NewCJKTextProcessor(font *ExtendedCIDFont, isVertical bool) *CJKTextProcessor
NewCJKTextProcessor creates a new CJK text processor
func (*CJKTextProcessor) GetGlyphMetrics ¶ added in v1.2.8
func (p *CJKTextProcessor) GetGlyphMetrics(cid int) CJKGlyphMetrics
GetGlyphMetrics returns the metrics for a glyph in the current writing mode
func (*CJKTextProcessor) ProcessText ¶ added in v1.2.8
func (p *CJKTextProcessor) ProcessText(text string) string
ProcessText processes CJK text, handling vertical writing and character variants
type CMap ¶ added in v1.2.8
type CMap struct {
Name string
Type CMapType
CIDSystemInfo CIDSystemInfo
WMode int // 0: horizontal, 1: vertical
// contains filtered or unexported fields
}
CMap represents a character code to CID or Unicode mapping
func (*CMap) AddBFChar ¶ added in v1.2.8
AddBFChar adds a single base font character mapping (for ToUnicode)
func (*CMap) AddBFRange ¶ added in v1.2.8
AddBFRange adds a base font range mapping (for ToUnicode)
func (*CMap) AddCIDChar ¶ added in v1.2.8
AddCIDChar adds a single CID character mapping
func (*CMap) AddCIDRange ¶ added in v1.2.8
AddCIDRange adds a CID range mapping
func (*CMap) AddCodeSpaceRange ¶ added in v1.2.8
AddCodeSpaceRange adds a code space range to the CMap
func (*CMap) Decode ¶ added in v1.2.8
Decode implements TextEncoding interface for ToUnicode CMaps (lock-free)
func (*CMap) LookupCID ¶ added in v1.2.8
LookupCID looks up the CID for a given character code (lock-free)
func (*CMap) OptimizeCIDLookup ¶ added in v1.2.8
func (c *CMap) OptimizeCIDLookup()
OptimizeCIDLookup precomputes CID mappings for fast lookup This should be called after all CID mappings are added to the CMap
func (*CMap) SetCIDSystemInfo ¶ added in v1.2.8
SetCIDSystemInfo sets the CIDSystemInfo for the CMap
func (*CMap) SetUseCMap ¶ added in v1.2.8
func (c *CMap) SetUseCMap(parent TextEncoding)
SetUseCMap sets the parent CMap to use for unmapped codes
type CMapInfo ¶ added in v1.2.8
type CMapInfo struct {
Name string
Registry string
Ordering string
Supplement int
WMode int
Type CMapType
}
CMapInfo contains information about a CMap
func GetCMapInfo ¶ added in v1.2.8
GetCMapInfo returns information about a registered CMap
type CMapParser ¶ added in v1.2.8
type CMapParser struct {
// contains filtered or unexported fields
}
CMapParser parses CMap files/streams
type CacheContext ¶
type CacheContext struct {
// contains filtered or unexported fields
}
CacheContext provides a context-aware cache with automatic cleanup
func NewCacheContext ¶
func NewCacheContext(parent context.Context, cache *ResultCache) *CacheContext
NewCacheContext creates a new context-aware cache
func (*CacheContext) Close ¶
func (cc *CacheContext) Close()
Close releases resources used by the cache context
func (*CacheContext) GetWithTimeout ¶
func (cc *CacheContext) GetWithTimeout(key string, timeout time.Duration) (interface{}, bool, error)
GetWithTimeout gets a value with timeout
type CacheEntry ¶
type CacheEntry struct {
Data interface{}
Expiration time.Time
AccessCount int64
LastAccess time.Time
Size int64 // Estimated size in bytes
}
CacheEntry represents a cached item
func (*CacheEntry) IsExpired ¶
func (ce *CacheEntry) IsExpired() bool
IsExpired checks if the cache entry has expired
type CacheKeyGenerator ¶
type CacheKeyGenerator struct{}
CacheKeyGenerator provides functions to generate cache keys
func NewCacheKeyGenerator ¶
func NewCacheKeyGenerator() *CacheKeyGenerator
NewCacheKeyGenerator creates a new key generator
func (*CacheKeyGenerator) GenerateFullHash ¶
func (ckg *CacheKeyGenerator) GenerateFullHash(data string) string
GenerateFullHash generates a hash from arbitrary data
func (*CacheKeyGenerator) GeneratePageContentKey ¶
func (ckg *CacheKeyGenerator) GeneratePageContentKey(pageNum int, readerHash string) string
GeneratePageContentKey generates a cache key for page content
func (*CacheKeyGenerator) GenerateReaderHash ¶
func (ckg *CacheKeyGenerator) GenerateReaderHash(reader *Reader) string
GenerateReaderHash generates a hash for the reader object (simplified)
func (*CacheKeyGenerator) GenerateTextClassificationKey ¶
func (ckg *CacheKeyGenerator) GenerateTextClassificationKey(pageNum int, readerHash string, processorParams string) string
GenerateTextClassificationKey generates a cache key for text classification
func (*CacheKeyGenerator) GenerateTextOrderingKey ¶
func (ckg *CacheKeyGenerator) GenerateTextOrderingKey(pageNum int, readerHash string, orderingParams string) string
GenerateTextOrderingKey generates a cache key for text ordering
type CacheLineAlignedCounter ¶
type CacheLineAlignedCounter struct {
// contains filtered or unexported fields
}
func NewCacheLineAlignedCounter ¶
func NewCacheLineAlignedCounter(n int) *CacheLineAlignedCounter
func (*CacheLineAlignedCounter) Add ¶
func (c *CacheLineAlignedCounter) Add(idx int, delta uint64)
func (*CacheLineAlignedCounter) Get ¶
func (c *CacheLineAlignedCounter) Get(idx int) uint64
type CacheLinePadded ¶
type CacheLinePadded struct {
// contains filtered or unexported fields
}
4. Cache line aligned structure
type CacheManager ¶
type CacheManager struct {
// contains filtered or unexported fields
}
CacheManager provides centralized cache management
func NewCacheManager ¶
func NewCacheManager() *CacheManager
NewCacheManager creates a new cache manager with separate caches for different data types
func (*CacheManager) GetClassificationCache ¶
func (cm *CacheManager) GetClassificationCache() *ResultCache
GetClassificationCache returns the classification cache
func (*CacheManager) GetMetadataCache ¶
func (cm *CacheManager) GetMetadataCache() *ResultCache
GetMetadataCache returns the metadata cache
func (*CacheManager) GetPageCache ¶
func (cm *CacheManager) GetPageCache() *ResultCache
GetPageCache returns the page content cache
func (*CacheManager) GetTextOrderingCache ¶
func (cm *CacheManager) GetTextOrderingCache() *ResultCache
GetTextOrderingCache returns the text ordering cache
func (*CacheManager) GetTotalStats ¶
func (cm *CacheManager) GetTotalStats() CacheStats
GetTotalStats returns combined statistics for all caches
type CacheShard ¶ added in v1.0.2
type CacheShard struct {
// contains filtered or unexported fields
}
CacheShard represents a single shard of the cache
type CacheStats ¶
type CacheStats struct {
Hits int64
Misses int64
Evictions int64
CurrentSize int64
MaxSize int64
Entries int64
}
CacheStats provides statistics about cache performance
type CachedReader ¶
type CachedReader struct {
*Reader
// contains filtered or unexported fields
}
CachedReader wraps a Reader to provide caching functionality
func NewCachedReader ¶
func NewCachedReader(reader *Reader, cache *ResultCache) *CachedReader
NewCachedReader creates a new cached reader
func (*CachedReader) CachedClassifyTextBlocks ¶
func (cr *CachedReader) CachedClassifyTextBlocks(pageNum int) ([]ClassifiedBlock, error)
CachedClassifyTextBlocks returns classified text blocks with caching
func (*CachedReader) CachedPage ¶
func (cr *CachedReader) CachedPage(pageNum int) ([]Text, error)
CachedPage returns page content with caching
type ClassifiedBlock ¶
type ClassifiedBlock struct {
Type BlockType // Semantic type of the block
Level int // Hierarchy level (for titles: 1=h1, 2=h2, etc.)
Content []Text // Text runs in this block
Bounds Rect // Bounding box
Text string // Concatenated text content
}
ClassifiedBlock represents a classified block of text with semantic information
func GetBlockSlice ¶
func GetBlockSlice() []ClassifiedBlock
GetBlockSlice retrieves a ClassifiedBlock slice from the pool
func GetTextByType ¶
func GetTextByType(blocks []ClassifiedBlock, blockType BlockType) []ClassifiedBlock
GetTextByType returns all text blocks of a specific type
func GetTitles ¶
func GetTitles(blocks []ClassifiedBlock, level int) []ClassifiedBlock
GetTitles returns all title blocks, optionally filtered by level
type ClassifiedBlockWithLanguage ¶
type ClassifiedBlockWithLanguage struct {
ClassifiedBlock
Language LanguageInfo
}
ClassifiedBlockWithLanguage represents a classified block with language information
type Column ¶
type Column struct {
Position int64
Content TextVertical
}
Column represents the contents of a column
type ConnectionPool ¶
type ConnectionPool struct {
// contains filtered or unexported fields
}
ConnectionPool manages a pool of connections/resources
func NewConnectionPool ¶
func NewConnectionPool(maxSize int, newFunc func() interface{}, closeFunc func(interface{})) *ConnectionPool
NewConnectionPool creates a new connection pool
func (*ConnectionPool) Close ¶
func (cp *ConnectionPool) Close()
Close closes all connections in the pool
func (*ConnectionPool) Get ¶
func (cp *ConnectionPool) Get() interface{}
Get retrieves a connection from the pool
func (*ConnectionPool) Put ¶
func (cp *ConnectionPool) Put(conn interface{})
Put returns a connection to the pool
type CryptoEngine ¶ added in v1.2.8
type CryptoEngine struct {
// contains filtered or unexported fields
}
CryptoEngine provides encryption/decryption functionality
func NewCryptoEngine ¶ added in v1.2.8
func NewCryptoEngine(info *PDFEncryptionInfo) *CryptoEngine
NewCryptoEngine creates a new crypto engine
func (*CryptoEngine) DecryptData ¶ added in v1.2.8
func (e *CryptoEngine) DecryptData(data []byte, objID, genID int) ([]byte, error)
DecryptData decrypts data using the current encryption method
func (*CryptoEngine) EncryptData ¶ added in v1.2.8
func (e *CryptoEngine) EncryptData(data []byte, objID, genID int) ([]byte, error)
EncryptData encrypts data using the current encryption method
func (*CryptoEngine) SetKey ¶ added in v1.2.8
func (e *CryptoEngine) SetKey(key []byte)
SetKey sets the encryption key
type EncryptionMethod ¶ added in v1.2.8
type EncryptionMethod int
EncryptionMethod represents the encryption method
const ( MethodRC4 EncryptionMethod = 0 MethodAESV2 EncryptionMethod = 1 // AES-128 CBC MethodAESV3 EncryptionMethod = 2 // AES-256 CBC )
type EncryptionRevision ¶ added in v1.2.8
type EncryptionRevision int
EncryptionRevision represents PDF encryption revision
const ( Revision2 EncryptionRevision = 2 // MD5-based Revision3 EncryptionRevision = 3 // MD5-based with key strengthening Revision4 EncryptionRevision = 4 // MD5-based with access permissions Revision5 EncryptionRevision = 5 // SHA-256-based Revision6 EncryptionRevision = 6 // SHA-384/512-based )
type EncryptionVersion ¶ added in v1.2.8
type EncryptionVersion int
EncryptionVersion represents PDF encryption version
const ( EncryptionV1 EncryptionVersion = 1 // RC4 40-bit EncryptionV2 EncryptionVersion = 2 // RC4 40-128-bit EncryptionV4 EncryptionVersion = 4 // RC4 or AES 128-bit EncryptionV5 EncryptionVersion = 5 // AES 256-bit )
type EnhancedParallelProcessor ¶ added in v1.0.2
type EnhancedParallelProcessor struct {
// contains filtered or unexported fields
}
EnhancedParallelProcessor enhanced parallel processor Provides better concurrency control, load balancing, and error handling
func NewEnhancedParallelProcessor ¶ added in v1.0.2
func NewEnhancedParallelProcessor(workers int, batchSize int) *EnhancedParallelProcessor
NewEnhancedParallelProcessor creates enhanced parallel processor
func (*EnhancedParallelProcessor) ProcessPagesEnhanced ¶ added in v1.0.2
func (epp *EnhancedParallelProcessor) ProcessPagesEnhanced( ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error), ) ([][]Text, error)
ProcessPagesEnhanced processes pages in parallel with enhancements
type ExtendedCIDFont ¶ added in v1.2.8
ExtendedCIDFont extends CIDFont with additional CJK-specific features
func NewExtendedCIDFont ¶ added in v1.2.8
func NewExtendedCIDFont(v Value) *ExtendedCIDFont
NewExtendedCIDFont creates a new ExtendedCIDFont from a PDF value
func (*ExtendedCIDFont) Descriptor ¶ added in v1.2.8
func (cf *ExtendedCIDFont) Descriptor() *CIDFontDescriptor
Descriptor returns the font descriptor
func (*ExtendedCIDFont) GID ¶ added in v1.2.8
func (cf *ExtendedCIDFont) GID(cid int) uint16
GID returns the GID for the given CID
func (*ExtendedCIDFont) Info ¶ added in v1.2.8
func (cf *ExtendedCIDFont) Info() *CJKFontInfo
Info returns the CJK font info
func (*ExtendedCIDFont) IsVertical ¶ added in v1.2.8
func (cf *ExtendedCIDFont) IsVertical() bool
IsVertical returns true if this font uses vertical writing mode
func (*ExtendedCIDFont) VerticalOrigin ¶ added in v1.2.8
func (cf *ExtendedCIDFont) VerticalOrigin(cid int) (float64, float64)
VerticalOrigin returns the vertical origin of the given CID
func (*ExtendedCIDFont) VerticalWidth ¶ added in v1.2.8
func (cf *ExtendedCIDFont) VerticalWidth(cid int) float64
VerticalWidth returns the vertical width of the given CID
type ExtractMode ¶
type ExtractMode int
ExtractMode specifies the type of extraction to perform
const ( ModePlain ExtractMode = iota // Plain text extraction ModeStyled // Text with style information ModeStructured // Structured text with classification )
type ExtractOptions ¶
type ExtractOptions struct {
Workers int // Number of concurrent workers (0 = use NumCPU)
PageRange []int // Specific pages to extract (nil = all pages)
}
ExtractOptions configures text extraction behavior
type ExtractResult ¶
type ExtractResult struct {
Text string // Plain text (for ModePlain)
StyledTexts []Text // Styled texts (for ModeStyled)
ClassifiedBlocks []ClassifiedBlock // Classified blocks (for ModeStructured)
Metadata Metadata // Document metadata
PageCount int // Total number of pages
}
ExtractResult contains the results of text extraction
type Extractor ¶
type Extractor struct {
// contains filtered or unexported fields
}
Extractor provides a builder pattern for configuring and executing extraction
func NewExtractor ¶
NewExtractor creates a new extractor for the given reader
func (*Extractor) Extract ¶
func (e *Extractor) Extract() (*ExtractResult, error)
Extract performs the extraction and returns the result
func (*Extractor) ExtractStructured ¶
func (e *Extractor) ExtractStructured() ([]ClassifiedBlock, error)
ExtractStructured is a convenience method for extracting structured text
func (*Extractor) ExtractStyledTexts ¶
ExtractStyledTexts is a convenience method for extracting styled texts
func (*Extractor) ExtractText ¶
ExtractText is a convenience method for extracting plain text
func (*Extractor) Mode ¶
func (e *Extractor) Mode(mode ExtractMode) *Extractor
Mode sets the extraction mode
func (*Extractor) SmartOrdering ¶
SmartOrdering enables smart text ordering for multi-column layouts
type FastStringBuilder ¶
type FastStringBuilder struct {
// contains filtered or unexported fields
}
FastStringBuilder provides optimized string building with pre-allocation
func GetSizedStringBuilder ¶ added in v1.0.1
func GetSizedStringBuilder(estimatedSize int) *FastStringBuilder
GetSizedStringBuilder retrieves a string builder from the appropriate pool
func NewFastStringBuilder ¶
func NewFastStringBuilder(estimatedSize int) *FastStringBuilder
NewFastStringBuilder creates a builder with estimated capacity
func (*FastStringBuilder) Len ¶
func (b *FastStringBuilder) Len() int
Len returns the current length
func (*FastStringBuilder) String ¶
func (b *FastStringBuilder) String() string
func (*FastStringBuilder) WriteByte ¶
func (b *FastStringBuilder) WriteByte(c byte) error
func (*FastStringBuilder) WriteString ¶
func (b *FastStringBuilder) WriteString(s string)
WriteString appends a string
type Font ¶
type Font struct {
V Value
// contains filtered or unexported fields
}
A Font represent a font in a PDF file. The methods interpret a Font dictionary stored in V.
func (*Font) Encoder ¶
func (f *Font) Encoder() TextEncoding
Encoder returns the encoding between font code point sequences and UTF-8. Pointer receiver is required so the computed encoder is cached on the shared Font instance instead of a copy. The previous value-receiver implementation rebuilt the encoder for every call, causing large allocations to pile up during batch extraction.
func (*Font) ExtendedCIDFont ¶ added in v1.2.8
func (f *Font) ExtendedCIDFont() *ExtendedCIDFont
ExtendedCIDFont returns an ExtendedCIDFont for CID-keyed fonts with enhanced CJK support
type FontCache ¶
type FontCache struct {
// contains filtered or unexported fields
}
FontCache stores parsed fonts to avoid re-parsing across pages
type FontCacheInterface ¶ added in v1.0.1
type FontCacheInterface interface {
Get(key string) (*Font, bool)
Set(key string, font *Font)
Clear()
GetStats() FontCacheStats
}
FontCacheInterface defines the common interface for font caches
type FontCacheStats ¶ added in v1.0.1
type FontCacheStats struct {
Entries int
MaxEntries int
Hits uint64
Misses uint64
HitRate float64
AvgAccesses float64
}
Stats returns cache statistics
type FontCacheType ¶ added in v1.0.1
type FontCacheType int
FontCacheType specifies which font cache implementation to use
const ( // FontCacheStandard uses the standard GlobalFontCache (default) // - Stable and well-tested // - Good performance for most use cases // - Simpler implementation FontCacheStandard FontCacheType = iota // FontCacheOptimized uses the OptimizedFontCache // - 10-85x faster than standard (depending on workload) // - Lock-free read path with 16 shards // - Best for high-concurrency scenarios (>1000 qps) // - Recommended for production environments with heavy load FontCacheOptimized )
type FontPool ¶ added in v1.2.3
type FontPool struct {
// contains filtered or unexported fields
}
FontPool manages a pool of font names and provides compact IDs. Thread-safe for concurrent access.
func GetGlobalFontPool ¶ added in v1.2.3
func GetGlobalFontPool() *FontPool
GetGlobalFontPool returns the global font pool instance
func (*FontPool) Clear ¶ added in v1.2.3
func (fp *FontPool) Clear()
Clear removes all fonts from the pool. Should only be called when you're sure no TextOptimized objects reference these IDs.
func (*FontPool) GetFont ¶ added in v1.2.3
GetFont returns the font name for an ID. Returns empty string if ID is invalid. Thread-safe.
type FontPrefetcher ¶ added in v1.0.2
type FontPrefetcher struct {
// contains filtered or unexported fields
}
FontPrefetcher implements intelligent font prefetch strategy Based on access pattern prediction and preloading potentially needed fonts
func NewFontPrefetcher ¶ added in v1.0.2
func NewFontPrefetcher(cache *OptimizedFontCache) *FontPrefetcher
NewFontPrefetcher create new font prefetcher
func (*FontPrefetcher) ClearPatterns ¶ added in v1.0.2
func (fp *FontPrefetcher) ClearPatterns()
ClearPatterns clears access patterns
func (*FontPrefetcher) Close ¶ added in v1.0.2
func (fp *FontPrefetcher) Close()
Close closes the prefetcher
func (*FontPrefetcher) Disable ¶ added in v1.0.2
func (fp *FontPrefetcher) Disable()
Disable disables prefetching
func (*FontPrefetcher) Enable ¶ added in v1.0.2
func (fp *FontPrefetcher) Enable()
Enable enables prefetching
func (*FontPrefetcher) GetStats ¶ added in v1.0.2
func (fp *FontPrefetcher) GetStats() PrefetchStats
GetStats gets prefetch statistics
func (*FontPrefetcher) RecordAccess ¶ added in v1.0.2
func (fp *FontPrefetcher) RecordAccess(fontKey string, relatedKeys []string)
RecordAccess record font access
type GlobalFontCache ¶ added in v1.0.1
type GlobalFontCache struct {
// contains filtered or unexported fields
}
GlobalFontCache implements an enhanced global font cache with: - LRU eviction for memory control - Hit/miss statistics for monitoring - Content-based hashing for accurate cache keys
Example ¶
// Create a cache with max 100 entries and 1 hour expiration
cache := NewGlobalFontCache(100, 1*time.Hour)
// Store a font
font := &Font{}
cache.Set("MyFont", font)
// Retrieve the font
retrieved, ok := cache.Get("MyFont")
if ok {
fmt.Println("Font found in cache")
_ = retrieved
}
// Get statistics
stats := cache.GetStats()
fmt.Printf("Cache entries: %d, Hit rate: %.2f%%\n",
stats.Entries, stats.HitRate*100)
func GetGlobalFontCache ¶ added in v1.0.1
func GetGlobalFontCache() *GlobalFontCache
GetGlobalFontCache returns the global font cache instance
Example ¶
// Get the global singleton instance
cache := GetGlobalFontCache()
font := &Font{}
cache.Set("GlobalFont", font)
// The same instance can be accessed from anywhere
sameCacheInstance := GetGlobalFontCache()
retrieved, _ := sameCacheInstance.Get("GlobalFont")
_ = retrieved
func NewGlobalFontCache ¶ added in v1.0.1
func NewGlobalFontCache(maxEntries int, maxAge time.Duration) *GlobalFontCache
NewGlobalFontCache creates a new global font cache
func (*GlobalFontCache) Cleanup ¶ added in v1.0.1
func (gfc *GlobalFontCache) Cleanup() int
Cleanup removes expired entries
func (*GlobalFontCache) Clear ¶ added in v1.0.1
func (gfc *GlobalFontCache) Clear()
Clear removes all fonts from the cache
func (*GlobalFontCache) Get ¶ added in v1.0.1
func (gfc *GlobalFontCache) Get(key string) (*Font, bool)
Get retrieves a font from the cache
func (*GlobalFontCache) GetOrCompute ¶ added in v1.0.1
GetOrCompute retrieves a font from cache or computes it if not present This is a convenience function that combines Get and Set
func (*GlobalFontCache) GetStats ¶ added in v1.0.1
func (gfc *GlobalFontCache) GetStats() FontCacheStats
GetStats returns current cache statistics
func (*GlobalFontCache) Remove ¶ added in v1.0.1
func (gfc *GlobalFontCache) Remove(key string)
Remove removes a font from the cache
func (*GlobalFontCache) Set ¶ added in v1.0.1
func (gfc *GlobalFontCache) Set(key string, font *Font)
Set stores a font in the cache
func (*GlobalFontCache) StartCleanupRoutine ¶ added in v1.0.1
func (gfc *GlobalFontCache) StartCleanupRoutine(interval time.Duration) chan struct{}
StartCleanupRoutine starts a background goroutine to periodically clean up expired entries
type InplaceStringBuilder ¶ added in v1.0.2
type InplaceStringBuilder struct {
// contains filtered or unexported fields
}
InplaceStringBuilder in-place string builder Avoid intermediate allocations
func NewInplaceStringBuilder ¶ added in v1.0.2
func NewInplaceStringBuilder(capacity int) *InplaceStringBuilder
NewInplaceStringBuilder create new in-place string builder
func (*InplaceStringBuilder) Append ¶ added in v1.0.2
func (isb *InplaceStringBuilder) Append(s string)
Append append string
func (*InplaceStringBuilder) Build ¶ added in v1.0.2
func (isb *InplaceStringBuilder) Build() string
Build build final string (single allocation)
func (*InplaceStringBuilder) Len ¶ added in v1.0.2
func (isb *InplaceStringBuilder) Len() int
Len return total length
func (*InplaceStringBuilder) Reset ¶ added in v1.0.2
func (isb *InplaceStringBuilder) Reset()
Reset reset builder
type IntegrityStatus ¶ added in v1.2.2
type IntegrityStatus struct {
// IsValid indicates whether the PDF is valid enough to parse
IsValid bool
// IsTruncated indicates whether the file appears to be truncated
IsTruncated bool
// HasValidHeader indicates whether a valid PDF header was found
HasValidHeader bool
// HasValidEOF indicates whether a valid %%EOF marker was found
HasValidEOF bool
// HasStartxref indicates whether a startxref marker was found
HasStartxref bool
// HasXref indicates whether xref table or stream was found
HasXref bool
// HasTrailer indicates whether trailer dictionary was found
HasTrailer bool
// EstimatedObjects is the estimated number of objects in the file
EstimatedObjects int
// Issues contains descriptions of any problems found
Issues []string
}
IntegrityStatus represents the result of a PDF integrity check
func CheckIntegrity ¶ added in v1.2.2
func CheckIntegrity(f io.ReaderAt, size int64) *IntegrityStatus
CheckIntegrity performs a quick integrity check on a PDF file
type JBIG2Decoder ¶ added in v1.2.8
type JBIG2Decoder struct {
// contains filtered or unexported fields
}
JBIG2Decoder decodes JBIG2 encoded data JBIG2 is a complex format primarily used for scanned documents This implementation provides basic support for embedded JBIG2 streams
func NewJBIG2Decoder ¶ added in v1.2.8
func NewJBIG2Decoder(r io.Reader, params JBIG2Params) *JBIG2Decoder
NewJBIG2Decoder creates a new JBIG2 decoder
type JBIG2Params ¶ added in v1.2.8
type JBIG2Params struct {
Globals []byte // Data from JBIG2Globals stream
}
JBIG2Params contains parameters for JBIG2 decoding
func ParseJBIG2Params ¶ added in v1.2.8
func ParseJBIG2Params(param Value) JBIG2Params
ParseJBIG2Params parses JBIG2 parameters from a Value
type KDNode ¶
type KDNode struct {
// contains filtered or unexported fields
}
KDNode KD tree node Optimized: Use fixed float64 instead of slice to avoid allocation
type KDTree ¶
type KDTree struct {
// contains filtered or unexported fields
}
KDTree KD tree spatial index For O(log n) time complexity nearest neighbor search
func BuildKDTree ¶
BuildKDTree builds KD tree from text blocks Optimized: pre-allocate indices once, use fixed-size coordinates
func (*KDTree) RangeSearch ¶
RangeSearch range search, returns all text blocks within specified radius of target point Optimized: uses object pool for stack, inlined distance calculation, direct coordinates
func (*KDTree) RangeSearchWithBuffer ¶ added in v1.2.3
func (tree *KDTree) RangeSearchWithBuffer(targetX, targetY, radiusSq float64, buffer []*TextBlock) []*TextBlock
RangeSearchWithBuffer is an optimized version that reuses a provided buffer for results. This eliminates repeated allocations when performing multiple searches. The buffer will be cleared and reused. If buffer is nil, behaves like RangeSearch. Returns the result slice (may be the same as buffer or a new allocation if capacity insufficient).
type LZWPredictor ¶ added in v1.2.8
type LZWPredictor struct {
// contains filtered or unexported fields
}
LZWPredictor implements PNG prediction filters for LZW decoded data
func NewLZWPredictor ¶ added in v1.2.8
func NewLZWPredictor(r io.Reader, params LZWPredictorParams) *LZWPredictor
NewLZWPredictor creates a new LZW predictor filter
type LZWPredictorParams ¶ added in v1.2.8
type LZWPredictorParams struct {
Predictor int // 1=none, 2=TIFF, 10-15=PNG
Colors int // Number of color components (default: 1)
BPC int // Bits per component (default: 8)
Columns int // Pixels per row (default: 1)
}
LZWPredictorParams contains parameters for LZW prediction
func DefaultLZWPredictorParams ¶ added in v1.2.8
func DefaultLZWPredictorParams() LZWPredictorParams
DefaultLZWPredictorParams returns default predictor parameters
func ParseLZWPredictorParams ¶ added in v1.2.8
func ParseLZWPredictorParams(param Value) LZWPredictorParams
ParseLZWPredictorParams parses predictor parameters from a Value
type LanguageInfo ¶
type LanguageInfo struct {
Language Language
Confidence float64 // Confidence level (0.0 to 1.0)
Characters []rune // Unique characters in the text
WordCount int // Number of words in the text
SentenceCount int // Number of sentences in the text
}
LanguageInfo contains information about a detected language
type LanguageTextExtractor ¶
type LanguageTextExtractor struct {
// contains filtered or unexported fields
}
LanguageTextExtractor extracts text while detecting languages
func NewLanguageTextExtractor ¶
func NewLanguageTextExtractor() *LanguageTextExtractor
NewLanguageTextExtractor creates a new language-aware text extractor
func (*LanguageTextExtractor) ExtractTextByLanguage ¶
func (lte *LanguageTextExtractor) ExtractTextByLanguage(reader *Reader) (map[Language][]Text, error)
ExtractTextByLanguage extracts text grouped by detected language
func (*LanguageTextExtractor) GetLanguageStats ¶
func (lte *LanguageTextExtractor) GetLanguageStats(texts []Text) map[Language]int
GetLanguageStats returns statistics about languages detected in the text
func (*LanguageTextExtractor) GetTextsByLanguage ¶
func (lte *LanguageTextExtractor) GetTextsByLanguage(texts []Text, targetLang Language) []Text
GetTextsByLanguage returns text elements filtered by specific language
type LazyPage ¶
type LazyPage struct {
// contains filtered or unexported fields
}
LazyPage provides lazy loading of page content to reduce memory usage for large PDFs where not all pages need to be processed
func NewLazyPage ¶
NewLazyPage creates a lazy-loading page wrapper
func (*LazyPage) GetContent ¶
GetContent loads and returns the page content (cached after first call)
type LazyPageManager ¶
type LazyPageManager struct {
// contains filtered or unexported fields
}
LazyPageManager manages lazy loading of multiple pages
func NewLazyPageManager ¶
func NewLazyPageManager(r *Reader, maxCached int) *LazyPageManager
NewLazyPageManager creates a manager with LRU cache
func (*LazyPageManager) GetPage ¶
func (m *LazyPageManager) GetPage(pageNum int) *LazyPage
GetPage returns a lazy page, loading it if necessary
func (*LazyPageManager) GetStats ¶
func (m *LazyPageManager) GetStats() (totalPages, loadedPages int)
GetStats returns cache statistics
type LockFreeRingBuffer ¶
type LockFreeRingBuffer struct {
// contains filtered or unexported fields
}
3. Lock-free ring buffer (for producer-consumer)
func NewLockFreeRingBuffer ¶
func NewLockFreeRingBuffer(size int) *LockFreeRingBuffer
func (*LockFreeRingBuffer) Pop ¶
func (rb *LockFreeRingBuffer) Pop() (interface{}, bool)
func (*LockFreeRingBuffer) Push ¶
func (rb *LockFreeRingBuffer) Push(item interface{}) bool
type MemoryArena ¶
type MemoryArena struct {
// contains filtered or unexported fields
}
10. Memory pool manager (reduce GC pressure)
func NewMemoryArena ¶
func NewMemoryArena(chunkSize int) *MemoryArena
func (*MemoryArena) Alloc ¶
func (a *MemoryArena) Alloc(size int) []byte
func (*MemoryArena) Reset ¶
func (a *MemoryArena) Reset()
type MemoryEfficientExtractor ¶
type MemoryEfficientExtractor struct {
// contains filtered or unexported fields
}
MemoryEfficientExtractor provides memory-efficient extraction using streaming
func NewMemoryEfficientExtractor ¶
func NewMemoryEfficientExtractor(chunkSize, bufferSize int, maxMemory int64) *MemoryEfficientExtractor
NewMemoryEfficientExtractor creates a new memory-efficient extractor
func (*MemoryEfficientExtractor) ExtractTextStream ¶
func (mee *MemoryEfficientExtractor) ExtractTextStream(reader *Reader) (<-chan TextStream, <-chan error)
ExtractTextStream extracts text in a memory-efficient streaming way
func (*MemoryEfficientExtractor) ExtractTextToWriter ¶
func (mee *MemoryEfficientExtractor) ExtractTextToWriter(reader *Reader, writer io.Writer) (err error)
ExtractTextToWriter extracts text directly to an io.Writer to minimize memory usage
type Metadata ¶
type Metadata struct {
Title string // Document title
Author string // Author name
Subject string // Document subject
Keywords []string // Keywords
Creator string // Application that created the document
Producer string // PDF producer (converter)
CreationDate time.Time // Creation date
ModDate time.Time // Last modification date
Trapped string // Trapping information (True/False/Unknown)
Custom map[string]string // Custom metadata fields
}
Metadata represents PDF document metadata
type MultiLangProcessor ¶
type MultiLangProcessor struct {
// contains filtered or unexported fields
}
MultiLangProcessor provides multi-language text processing
func NewMultiLangProcessor ¶
func NewMultiLangProcessor() *MultiLangProcessor
NewMultiLangProcessor creates a new multi-language processor
func (*MultiLangProcessor) DetectLanguage ¶
func (mlp *MultiLangProcessor) DetectLanguage(text string) LanguageInfo
DetectLanguage detects the language of a given text
func (*MultiLangProcessor) GetLanguageConfidenceThreshold ¶
func (mlp *MultiLangProcessor) GetLanguageConfidenceThreshold() float64
GetLanguageConfidenceThreshold returns a confidence threshold for reliable detection
func (*MultiLangProcessor) GetLanguageName ¶
func (mlp *MultiLangProcessor) GetLanguageName(lang Language) string
GetLanguageName returns the full name of a language
func (*MultiLangProcessor) GetSupportedLanguages ¶
func (mlp *MultiLangProcessor) GetSupportedLanguages() []Language
GetSupportedLanguages returns the list of supported languages
func (*MultiLangProcessor) IsEnglish ¶
func (mlp *MultiLangProcessor) IsEnglish(text string) bool
IsEnglish checks if text is likely English
func (*MultiLangProcessor) IsFrench ¶
func (mlp *MultiLangProcessor) IsFrench(text string) bool
IsFrench checks if text is likely French
func (*MultiLangProcessor) IsGerman ¶
func (mlp *MultiLangProcessor) IsGerman(text string) bool
IsGerman checks if text is likely German
func (*MultiLangProcessor) IsSpanish ¶
func (mlp *MultiLangProcessor) IsSpanish(text string) bool
IsSpanish checks if text is likely Spanish
func (*MultiLangProcessor) ProcessTextWithLanguageDetection ¶
func (mlp *MultiLangProcessor) ProcessTextWithLanguageDetection(texts []Text) []TextWithLanguage
ProcessTextWithLanguageDetection processes text with language detection
type MultiLanguageTextClassifier ¶
type MultiLanguageTextClassifier struct {
*TextClassifier
// contains filtered or unexported fields
}
MultiLanguageTextClassifier extends the text classifier with language awareness
func NewMultiLanguageTextClassifier ¶
func NewMultiLanguageTextClassifier(texts []Text, pageWidth, pageHeight float64) *MultiLanguageTextClassifier
NewMultiLanguageTextClassifier creates a new multi-language text classifier
func (*MultiLanguageTextClassifier) ClassifyBlocksWithLanguage ¶
func (mltc *MultiLanguageTextClassifier) ClassifyBlocksWithLanguage() []ClassifiedBlockWithLanguage
ClassifyBlocksWithLanguage extends the classification with language information
type MultiLevelCache ¶
type MultiLevelCache struct {
// contains filtered or unexported fields
}
MultiLevelCache multi-level cache manager
func NewMultiLevelCache ¶
func NewMultiLevelCache() *MultiLevelCache
NewMultiLevelCache create multi-level cache
func (*MultiLevelCache) Get ¶
func (mlc *MultiLevelCache) Get(key string) (interface{}, bool)
Get get data from cache
func (*MultiLevelCache) Prefetch ¶
func (mlc *MultiLevelCache) Prefetch(keys []string)
Prefetch prefetch page data
func (*MultiLevelCache) Put ¶
func (mlc *MultiLevelCache) Put(key string, value interface{})
Put store in cache
func (*MultiLevelCache) Stats ¶
func (mlc *MultiLevelCache) Stats() map[string]uint64
Stats get cache statistics
type OptimizedCMapCache ¶ added in v1.2.8
type OptimizedCMapCache struct {
// contains filtered or unexported fields
}
OptimizedCMapCache provides high-performance CMap caching with: - Lock-free read path using atomic operations - Sharded design to reduce lock contention (8 shards) - Zero-allocation fast path for cache hits - LRU eviction with atomic operations
func GetGlobalCMapCache ¶ added in v1.2.8
func GetGlobalCMapCache() *OptimizedCMapCache
GetGlobalCMapCache returns the global CMap cache
func NewOptimizedCMapCache ¶ added in v1.2.8
func NewOptimizedCMapCache(maxEntries int) *OptimizedCMapCache
NewOptimizedCMapCache creates a new optimized CMap cache
func (*OptimizedCMapCache) Get ¶ added in v1.2.8
func (c *OptimizedCMapCache) Get(key string) (*CMap, bool)
Get retrieves a CMap from cache with lock-free fast path
func (*OptimizedCMapCache) GetStats ¶ added in v1.2.8
func (c *OptimizedCMapCache) GetStats() (hits, misses uint64)
GetStats returns cache statistics
func (*OptimizedCMapCache) Put ¶ added in v1.2.8
func (c *OptimizedCMapCache) Put(key string, cmap *CMap)
Put adds a CMap to the cache
func (*OptimizedCMapCache) Release ¶ added in v1.2.8
func (c *OptimizedCMapCache) Release(key string)
Release decrements reference count
type OptimizedFontCache ¶ added in v1.0.1
type OptimizedFontCache struct {
// contains filtered or unexported fields
}
OptimizedFontCache implements an ultra-high-performance font cache with: - Lock-free read path using atomic operations - Sharded design to reduce lock contention (16 shards) - Zero-allocation fast path for cache hits - Inline LRU using lock-free linked list approximation - Pre-allocated pools for metadata structs - SIMD-friendly memory layout
func NewOptimizedFontCache ¶ added in v1.0.1
func NewOptimizedFontCache(totalCapacity int) *OptimizedFontCache
NewOptimizedFontCache creates a new optimized font cache
func (*OptimizedFontCache) Clear ¶ added in v1.0.1
func (ofc *OptimizedFontCache) Clear()
Clear removes all entries from all shards
func (*OptimizedFontCache) Get ¶ added in v1.0.1
func (ofc *OptimizedFontCache) Get(key string) (*Font, bool)
Get retrieves a font from the cache (lock-free fast path)
func (*OptimizedFontCache) GetOrCompute ¶ added in v1.0.1
func (ofc *OptimizedFontCache) GetOrCompute(key string, compute func() (*Font, error)) (*Font, error)
GetOrCompute retrieves a font from cache or computes it if not present
func (*OptimizedFontCache) GetStats ¶ added in v1.0.1
func (ofc *OptimizedFontCache) GetStats() FontCacheStats
GetStats returns aggregated statistics across all shards
func (*OptimizedFontCache) Prefetch ¶ added in v1.0.1
func (ofc *OptimizedFontCache) Prefetch(keys []string, compute func(key string) (*Font, error))
Prefetch warms up the cache with multiple keys concurrently
func (*OptimizedFontCache) Remove ¶ added in v1.0.1
func (ofc *OptimizedFontCache) Remove(key string)
Remove removes a specific key from the cache
func (*OptimizedFontCache) Set ¶ added in v1.0.1
func (ofc *OptimizedFontCache) Set(key string, font *Font)
Set stores a font in the cache
type OptimizedMemoryPool ¶
type OptimizedMemoryPool struct {
// contains filtered or unexported fields
}
OptimizedMemoryPool provides better memory pool management
func NewOptimizedMemoryPool ¶
func NewOptimizedMemoryPool(size int) *OptimizedMemoryPool
NewOptimizedMemoryPool creates a pool with size tracking
func (*OptimizedMemoryPool) Get ¶
func (omp *OptimizedMemoryPool) Get() []byte
Get retrieves a buffer from the pool
func (*OptimizedMemoryPool) Put ¶
func (omp *OptimizedMemoryPool) Put(bufPtr *[]byte)
Put returns a buffer to the pool, resetting it
type OptimizedSorter ¶
type OptimizedSorter struct {
// contains filtered or unexported fields
}
OptimizedSorter provides optimized sorting algorithms for large text collections
func NewOptimizedSorter ¶
func NewOptimizedSorter() *OptimizedSorter
NewOptimizedSorter creates a new optimized sorter
func (*OptimizedSorter) QuickSortTexts ¶
func (os *OptimizedSorter) QuickSortTexts(texts []Text, less func(i, j int) bool)
QuickSortTexts implements quicksort for text collections
func (*OptimizedSorter) SortTextHorizontalByOptimized ¶
func (os *OptimizedSorter) SortTextHorizontalByOptimized(th TextHorizontal)
SortTextHorizontalByOptimized sorts TextHorizontal using optimized algorithm
func (*OptimizedSorter) SortTextVerticalByOptimized ¶
func (os *OptimizedSorter) SortTextVerticalByOptimized(tv TextVertical)
SortTextVerticalByOptimized sorts TextVertical using optimized algorithm
func (*OptimizedSorter) SortTexts ¶
func (os *OptimizedSorter) SortTexts(texts []Text, less func(i, j int) bool)
SortTexts sorts a collection of texts using the most appropriate algorithm
func (*OptimizedSorter) SortTextsWithAlgorithm ¶
func (os *OptimizedSorter) SortTextsWithAlgorithm(texts []Text, less func(i, j int) bool, algorithm string)
SortTextsWithAlgorithm allows choosing a specific sorting algorithm
type OptimizedTextClusterSorter ¶
type OptimizedTextClusterSorter struct {
// contains filtered or unexported fields
}
OptimizedTextClusterSorter provides optimized sorting for text clusters
func NewOptimizedTextClusterSorter ¶
func NewOptimizedTextClusterSorter() *OptimizedTextClusterSorter
NewOptimizedTextClusterSorter creates a new optimized cluster sorter
func (*OptimizedTextClusterSorter) SortTextBlocks ¶
func (otcs *OptimizedTextClusterSorter) SortTextBlocks(blocks []*TextBlock, sortBy string)
SortTextBlocks sorts text blocks by various criteria
type Outline ¶
An Outline is a tree describing the outline (also known as the table of contents) of a document.
type PDFCompatibilityInfo ¶ added in v1.2.0
type PDFCompatibilityInfo struct {
Version PDFVersion
IsLinearized bool
LinearizationParams map[string]interface{}
SubFormat string // "PDF/A", "PDF/X", or ""
Encryption string
HasTransparency bool
HasLayers bool
HasForms bool
HasJavaScript bool
Warnings []string
Errors []string
}
PDFCompatibilityInfo holds compatibility information
func CheckPDFCompatibility ¶ added in v1.2.0
func CheckPDFCompatibility(data []byte) (*PDFCompatibilityInfo, error)
CheckPDFCompatibility analyzes a PDF file for compatibility
type PDFEncryptionInfo ¶ added in v1.2.8
type PDFEncryptionInfo struct {
Version EncryptionVersion
Revision EncryptionRevision
Method EncryptionMethod
KeyLength int // in bits
O []byte // Owner password hash
U []byte // User password hash
P uint32 // Permissions
ID []byte // Document ID
OE []byte // Owner encryption key (V5)
UE []byte // User encryption key (V5)
Perms []byte // Encrypted permissions (V5)
}
PDFEncryptionInfo contains encryption parameters
type PDFError ¶
type PDFError struct {
Op string // Operation that failed (e.g., "extract text", "parse font")
Page int // Page number where error occurred (0 if not page-specific)
Path string // File path if applicable
Err error // Underlying error
}
PDFError represents an error that occurred during PDF processing. It includes contextual information about where the error occurred.
type PDFVersion ¶ added in v1.2.0
PDFVersion represents a PDF version
func (PDFVersion) IsSupported ¶ added in v1.2.0
func (v PDFVersion) IsSupported() bool
IsSupported checks if a version is supported
func (PDFVersion) String ¶ added in v1.2.0
func (v PDFVersion) String() string
String returns the version string
type Page ¶
type Page struct {
V Value
// contains filtered or unexported fields
}
A Page represent a single page in a PDF file. The methods interpret a Page dictionary stored in V.
func (Page) ClassifyTextBlocks ¶
func (p Page) ClassifyTextBlocks() ([]ClassifiedBlock, error)
ClassifyTextBlocks is a convenience function that creates a classifier and runs classification
func (*Page) Cleanup ¶ added in v1.0.7
func (p *Page) Cleanup()
Cleanup releases resources held by the Page, specifically the fontCache reference. Call this after processing a page to prevent memory leaks in batch operations. This method is safe to call multiple times.
func (*Page) GetPlainText ¶
GetPlainText returns the page's all text without format. fonts can be passed in (to improve parsing performance) or left nil ctx can be used to cancel the extraction operation (pass context.Background() if not needed)
func (*Page) GetPlainTextWithSmartOrdering ¶
func (p *Page) GetPlainTextWithSmartOrdering(ctx context.Context, fonts map[string]*Font) (string, error)
GetPlainTextWithSmartOrdering extracts plain text using an improved text ordering algorithm that handles multi-column layouts and complex reading orders. ctx can be used to cancel the extraction operation (pass context.Background() if not needed)
func (Page) GetTextByColumn ¶
GetTextByColumn returns the page's all text grouped by column
func (Page) GetTextByRow ¶
GetTextByRow returns the page's all text grouped by rows
func (Page) OptimizedGetPlainText ¶
OptimizedGetPlainText returns the page's all text using optimized string building. This version uses object pools and pre-allocation to reduce memory allocations. ctx can be used to cancel the extraction operation (pass context.Background() if not needed)
func (Page) OptimizedGetTextByColumn ¶
OptimizedGetTextByColumn returns the page's all text grouped by column using optimized allocation
func (Page) OptimizedGetTextByRow ¶
OptimizedGetTextByRow returns the page's all text grouped by rows using optimized allocation
func (*Page) SetFontCache ¶ added in v1.0.1
func (p *Page) SetFontCache(cache *GlobalFontCache)
SetFontCache sets a font cache for this page to improve performance during text extraction by reusing parsed fonts. Deprecated: Use SetFontCacheInterface for better flexibility.
func (*Page) SetFontCacheInterface ¶ added in v1.0.1
func (p *Page) SetFontCacheInterface(cache FontCacheInterface)
SetFontCacheInterface sets a font cache using the interface This supports both GlobalFontCache and OptimizedFontCache
type PageStream ¶
PageStream represents a stream of pages
type ParallelExtractor ¶ added in v1.0.2
type ParallelExtractor struct {
// contains filtered or unexported fields
}
ParallelExtractor parallel extractor Advanced extraction interface combining all optimizations
Example (Basic) ¶
ExampleParallelExtractor_basic basic usage example
// Create parallel extractor
extractor := NewParallelExtractor(4) // use 4 worker goroutines
defer extractor.Close()
// Note: actual usage requires creating Page objects
// pages := []Page{...}
ctx := context.Background()
// Simulate empty page list
var pages []Page
// Extract all pages
results, err := extractor.ExtractAllPages(ctx, pages)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Extracted %d pages\n", len(results))
Output: Extracted 0 pages
func NewParallelExtractor ¶ added in v1.0.2
func NewParallelExtractor(workers int) *ParallelExtractor
NewParallelExtractor creates parallel extractor
func (*ParallelExtractor) Close ¶ added in v1.0.2
func (pe *ParallelExtractor) Close()
Close closes and cleans up resources
func (*ParallelExtractor) ExtractAllPages ¶ added in v1.0.2
func (pe *ParallelExtractor) ExtractAllPages( ctx context.Context, pages []Page, ) ([][]Text, error)
ExtractAllPages extracts all pages (using all optimizations)
func (*ParallelExtractor) GetCacheStats ¶ added in v1.0.2
func (pe *ParallelExtractor) GetCacheStats() ShardedCacheStats
GetCacheStats gets cache statistics
func (*ParallelExtractor) GetPrefetchStats ¶ added in v1.0.2
func (pe *ParallelExtractor) GetPrefetchStats() PrefetchStats
GetPrefetchStats gets prefetch statistics
type ParallelProcessor ¶
type ParallelProcessor struct {
// contains filtered or unexported fields
}
ParallelProcessor handles multi-level parallel processing for PDF text extraction
func NewParallelProcessor ¶
func NewParallelProcessor(workers int) *ParallelProcessor
NewParallelProcessor creates a new parallel processor with the specified number of workers
func (*ParallelProcessor) ProcessPages ¶
func (pp *ParallelProcessor) ProcessPages(ctx context.Context, pages []Page, processorFunc func(Page) ([]Text, error)) ([][]Text, error)
ProcessPages processes multiple pages in parallel
type ParallelTextExtractor ¶
type ParallelTextExtractor struct {
// contains filtered or unexported fields
}
ParallelTextExtractor provides multi-level parallel extraction
func NewParallelTextExtractor ¶
func NewParallelTextExtractor(workers int) *ParallelTextExtractor
NewParallelTextExtractor creates a new parallel text extractor
func (*ParallelTextExtractor) ExtractWithParallelProcessing ¶
func (pte *ParallelTextExtractor) ExtractWithParallelProcessing(ctx context.Context, reader *Reader) ([]Text, error)
ExtractWithParallelProcessing extracts text using multi-level parallel processing
func (*ParallelTextExtractor) ParallelSort ¶
func (pte *ParallelTextExtractor) ParallelSort(ctx context.Context, texts []Text, less func(i, j int) bool) error
ParallelSort provides parallel sorting for large text collections
type ParseLimits ¶ added in v1.1.5
type ParseLimits struct {
// MaxParseTime is the maximum time allowed for parsing a single page (0 = no limit)
MaxParseTime time.Duration
// MaxHexStringBytes is the maximum size for a single hex string (0 = no limit, default 10MB)
MaxHexStringBytes int
// MaxStreamBytes is the maximum size for a single stream (0 = no limit)
MaxStreamBytes int64
// CheckInterval specifies how often to check for cancellation during intensive loops
// Higher values improve performance but reduce responsiveness to cancellation
// Default: 1000 iterations
CheckInterval int
}
ParseLimits defines resource limits for PDF parsing operations
func DefaultParseLimits ¶ added in v1.1.5
func DefaultParseLimits() ParseLimits
DefaultParseLimits returns sensible default limits
type PasswordAuth ¶ added in v1.2.8
type PasswordAuth struct {
// contains filtered or unexported fields
}
PasswordAuth authenticates a password using the appropriate algorithm
func NewPasswordAuth ¶ added in v1.2.8
func NewPasswordAuth(info *PDFEncryptionInfo) *PasswordAuth
NewPasswordAuth creates a new password authenticator
func (*PasswordAuth) Authenticate ¶ added in v1.2.8
func (pa *PasswordAuth) Authenticate(password string) ([]byte, error)
Authenticate tries to authenticate with the given password as either user or owner
func (*PasswordAuth) AuthenticateOwner ¶ added in v1.2.8
func (pa *PasswordAuth) AuthenticateOwner(password string) ([]byte, error)
AuthenticateOwner authenticates an owner password
func (*PasswordAuth) AuthenticateUser ¶ added in v1.2.8
func (pa *PasswordAuth) AuthenticateUser(password string) ([]byte, error)
AuthenticateUser authenticates a user password
func (*PasswordAuth) ValidatePermissions ¶ added in v1.2.8
func (pa *PasswordAuth) ValidatePermissions(key []byte) error
ValidatePermissions validates the permissions field for V5 encryption
type PerformanceMetrics ¶
type PerformanceMetrics struct {
ExtractDuration atomic.Int64 // nanoseconds
ParseDuration atomic.Int64
SortDuration atomic.Int64
TotalAllocs atomic.Uint64
BytesAllocated atomic.Uint64
GoroutineCount atomic.Int32
CacheHitRate atomic.Uint64 // percentage * 100
}
PerformanceMetrics performance metrics collector
func (*PerformanceMetrics) GetMetrics ¶
func (pm *PerformanceMetrics) GetMetrics() map[string]interface{}
GetMetrics get current metrics snapshot
func (*PerformanceMetrics) RecordAllocation ¶
func (pm *PerformanceMetrics) RecordAllocation(bytes uint64)
RecordAllocation record memory allocation
func (*PerformanceMetrics) RecordExtractDuration ¶
func (pm *PerformanceMetrics) RecordExtractDuration(d time.Duration)
RecordExtractDuration record extraction duration
type PoolStats ¶ added in v1.0.1
GetStats returns statistics about pool usage (for debugging/monitoring)
type PoolWarmer ¶ added in v1.0.2
type PoolWarmer struct {
// contains filtered or unexported fields
}
PoolWarmer memory pool warmer Pre-allocate and fill memory pools at application startup to reduce runtime allocation overhead
func (*PoolWarmer) GetWarmupStats ¶ added in v1.0.2
func (pw *PoolWarmer) GetWarmupStats() WarmupStats
GetWarmupStats gets warmup statistics
func (*PoolWarmer) IsWarmed ¶ added in v1.0.2
func (pw *PoolWarmer) IsWarmed() bool
IsWarmed checks if warmed up
func (*PoolWarmer) Warmup ¶ added in v1.0.2
func (pw *PoolWarmer) Warmup(config *WarmupConfig) error
Warmup performs memory pool warmup
type PredefinedCMap ¶ added in v1.2.8
type PredefinedCMap struct {
*CMap
}
PredefinedCMap represents a predefined Adobe CMap
func GetPredefinedCMap ¶ added in v1.2.8
func GetPredefinedCMap(name string) *PredefinedCMap
GetPredefinedCMap retrieves a predefined CMap by name
type PrefetchItem ¶ added in v1.0.2
type PrefetchItem struct {
// contains filtered or unexported fields
}
PrefetchItem prefetch item
type PrefetchQueue ¶ added in v1.0.2
type PrefetchQueue struct {
// contains filtered or unexported fields
}
PrefetchQueue prefetch queue (priority queue)
func (*PrefetchQueue) Len ¶ added in v1.0.2
func (pq *PrefetchQueue) Len() int
func (*PrefetchQueue) Less ¶ added in v1.0.2
func (pq *PrefetchQueue) Less(i, j int) bool
func (*PrefetchQueue) Pop ¶ added in v1.0.2
func (pq *PrefetchQueue) Pop() interface{}
func (*PrefetchQueue) Push ¶ added in v1.0.2
func (pq *PrefetchQueue) Push(x interface{})
func (*PrefetchQueue) Swap ¶ added in v1.0.2
func (pq *PrefetchQueue) Swap(i, j int)
type PrefetchStats ¶ added in v1.0.2
PrefetchStats prefetch statistics
type RTreeNode ¶
type RTreeNode struct {
// contains filtered or unexported fields
}
RTreeNode represents a node in the R-tree
type RTreeSpatialIndex ¶
type RTreeSpatialIndex struct {
// contains filtered or unexported fields
}
RTreeSpatialIndex provides a more sophisticated spatial index using a proper R-tree implementation
func NewRTreeSpatialIndex ¶
func NewRTreeSpatialIndex(texts []Text) *RTreeSpatialIndex
NewRTreeSpatialIndex creates a new R-tree based spatial index
func (*RTreeSpatialIndex) Insert ¶
func (rt *RTreeSpatialIndex) Insert(text Text)
Insert adds a text element to the R-tree
func (*RTreeSpatialIndex) Query ¶
func (rt *RTreeSpatialIndex) Query(bounds Rect) []Text
Query returns all text elements that intersect with the given bounds
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
A Reader is a single PDF file open for reading.
func NewReaderEncrypted ¶
NewReaderEncrypted opens a file for reading, using the data in f with the given total size. If the PDF is encrypted, NewReaderEncrypted calls pw repeatedly to obtain passwords to try. If pw returns the empty string, NewReaderEncrypted stops trying to decrypt the file and returns an error.
func NewReaderEncryptedWithMmap ¶
NewReaderEncryptedWithMmap opens a file for reading with memory mapping for large files. If the file size exceeds 10MB, it uses memory mapping to reduce memory usage. This is a wrapper around NewReaderEncrypted that optimizes for large files.
func NewReaderLinearized ¶ added in v1.2.0
NewReaderLinearized creates a reader optimized for linearized PDFs
func RecoverPDF ¶ added in v1.2.2
RecoverPDF attempts to recover a malformed PDF
func (*Reader) BatchExtractText ¶
BatchExtractText extracts text from multiple pages using lazy loading and object pooling This is optimized for processing many pages without keeping all in memory
func (*Reader) ClearCache ¶ added in v1.0.6
func (r *Reader) ClearCache()
ClearCache clears the object cache, releasing all cached objects. This is useful for freeing memory after batch processing large PDFs.
func (*Reader) Close ¶ added in v1.0.2
Close closes the Reader and releases associated resources. If the underlying ReaderAt implements io.Closer, it will be closed.
func (*Reader) ExtractAllPagesParallel ¶ added in v1.0.2
ExtractAllPagesParallel extract all page texts using enhanced parallel extractor This method integrates all performance optimizations: sharded cache, font prefetch, zero-copy, etc.
Example ¶
ExampleReader_ExtractAllPagesParallel uses Reader's parallel extraction method
// Note: this example requires actual PDF files
// Here only shows API usage
/*
// Open PDF file
f, r, err := Open("document.pdf")
if err != nil {
panic(err)
}
defer f.Close()
// Create context
ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute)
defer cancel()
// Parallel extract all page texts
pages, err := r.ExtractAllPagesParallel(ctx, 0) // 0 = auto-detect CPU core count
if err != nil {
panic(err)
}
// Output text for each page
for i, pageText := range pages {
fmt.Printf("Page %d has %d characters\n", i+1, len(pageText))
}
*/
func (*Reader) ExtractPagesBatch ¶ added in v1.0.1
func (r *Reader) ExtractPagesBatch(opts BatchExtractOptions) <-chan BatchResult
ExtractPagesBatch extracts text from multiple pages in batches This is optimized for high-throughput scenarios with many pages
Example ¶
// This example shows how to use batch extraction
// (requires a real PDF file to run)
// r, err := Open("document.pdf")
// if err != nil {
// log.Fatal(err)
// }
// defer r.Close()
//
// opts := BatchExtractOptions{
// Workers: 4,
// Pages: []int{1, 2, 3, 4, 5}, // Extract first 5 pages
// }
//
// for result := range r.ExtractPagesBatch(opts) {
// if result.Error != nil {
// log.Printf("Error on page %d: %v", result.PageNum, result.Error)
// continue
// }
// fmt.Printf("Page %d: %d characters\n", result.PageNum, len(result.Text))
// }
func (*Reader) ExtractPagesBatchToString ¶ added in v1.0.1
func (r *Reader) ExtractPagesBatchToString(opts BatchExtractOptions) (string, error)
ExtractPagesBatchToString is a convenience function that collects all results into a single string
Example ¶
// This example shows how to extract all pages to a single string
// r, err := Open("document.pdf")
// if err != nil {
// log.Fatal(err)
// }
// defer r.Close()
//
// opts := BatchExtractOptions{
// Workers: 8,
// SmartOrdering: true,
// }
//
// text, err := r.ExtractPagesBatchToString(opts)
// if err != nil {
// log.Fatal(err)
// }
//
// fmt.Printf("Extracted %d characters from %d pages\n", len(text), r.NumPage())
func (*Reader) ExtractStructuredBatch ¶ added in v1.0.1
func (r *Reader) ExtractStructuredBatch(opts BatchExtractOptions) <-chan StructuredBatchResult
ExtractStructuredBatch extracts structured text in batches
func (*Reader) ExtractWithContext ¶
ExtractWithContext extracts plain text from all pages with cancellation support
func (*Reader) GetCacheCapacity ¶ added in v1.0.6
GetCacheCapacity returns the current object cache capacity. Returns 0 if no capacity limit is set (unbounded cache).
func (*Reader) GetCompatibilityInfo ¶ added in v1.2.0
func (r *Reader) GetCompatibilityInfo() *PDFCompatibilityInfo
GetCompatibilityInfo returns compatibility information for the PDF
func (*Reader) GetMetadata ¶
GetMetadata extracts metadata from the PDF document
func (*Reader) GetPlainText ¶
GetPlainText returns all the text in the PDF file
func (*Reader) GetPlainTextConcurrent ¶
GetPlainTextConcurrent extracts all pages concurrently using the specified number of workers.
func (*Reader) GetStyledTexts ¶
GetStyledTexts returns list all sentences in an array, that are included styles
func (*Reader) Outline ¶
Outline returns the document outline. The Outline returned is the root of the outline tree and typically has no Title itself. That is, the children of the returned root are the top-level entries in the outline.
func (*Reader) Page ¶
Page returns the page for the given page number. Page numbers are indexed starting at 1, not 0. If the page is not found, Page returns a Page with p.V.IsNull().
func (*Reader) SetCacheCapacity ¶
func (*Reader) SetMetadata ¶
SetMetadata sets metadata fields in the PDF (for future write support) Currently not implemented as the library is read-only
type RecoveryOptions ¶ added in v1.2.2
type RecoveryOptions struct {
// MaxSearchSize limits how many bytes to search for recovery
MaxSearchSize int64
// AllowTruncated attempts to recover truncated files
AllowTruncated bool
// AllowMissingXref attempts to rebuild xref from object markers
AllowMissingXref bool
// AllowMissingTrailer attempts to recover without trailer
AllowMissingTrailer bool
// Verbose enables detailed recovery logging
Verbose bool
}
RecoveryOptions controls how PDF recovery is attempted
func DefaultRecoveryOptions ¶ added in v1.2.2
func DefaultRecoveryOptions() *RecoveryOptions
DefaultRecoveryOptions returns sensible defaults for recovery
type ResourceManager ¶
type ResourceManager struct {
// contains filtered or unexported fields
}
ResourceManager provides automatic resource cleanup
func NewResourceManager ¶
func NewResourceManager() *ResourceManager
NewResourceManager creates a new resource manager
func (*ResourceManager) Add ¶
func (rm *ResourceManager) Add(resource io.Closer)
Add adds a resource to be managed
func (*ResourceManager) Close ¶
func (rm *ResourceManager) Close() error
Close closes all managed resources
type ResultCache ¶
type ResultCache struct {
// contains filtered or unexported fields
}
ResultCache provides caching for parsed and classified results
func GetGlobalCache ¶
func GetGlobalCache() *ResultCache
GetGlobalCache returns a singleton cache instance
func NewResultCache ¶
func NewResultCache(maxSize int64, ttl time.Duration, policy string) *ResultCache
NewResultCache creates a new result cache with specified parameters
func (*ResultCache) Close ¶ added in v1.0.5
func (rc *ResultCache) Close()
Close stops the cleanup goroutine and releases resources
func (*ResultCache) Get ¶
func (rc *ResultCache) Get(key string) (interface{}, bool)
Get retrieves an item from the cache
func (*ResultCache) GetHitRatio ¶
func (rc *ResultCache) GetHitRatio() float64
GetHitRatio returns the cache hit ratio
func (*ResultCache) GetStats ¶
func (rc *ResultCache) GetStats() CacheStats
GetStats returns cache statistics
func (*ResultCache) Has ¶
func (rc *ResultCache) Has(key string) bool
Has checks if a key exists in the cache (without updating access stats)
func (*ResultCache) Put ¶
func (rc *ResultCache) Put(key string, value interface{})
Put adds an item to the cache
func (*ResultCache) Remove ¶
func (rc *ResultCache) Remove(key string) bool
Remove removes an item from the cache
type Row ¶
type Row struct {
Position int64
Content TextHorizontal
}
Row represents the contents of a row
type ShardedCache ¶ added in v1.0.2
type ShardedCache struct {
// contains filtered or unexported fields
}
ShardedCache implements a high-performance sharded cache with the following features: - 256 shards to minimize lock contention - Independent locks and LRU linked lists for each shard - Statistics implemented with atomic operations - Adaptive eviction strategy
func NewShardedCache ¶ added in v1.0.2
func NewShardedCache(maxSize int, ttl time.Duration) *ShardedCache
NewShardedCache creates a new sharded cache
func (*ShardedCache) Close ¶ added in v1.0.5
func (sc *ShardedCache) Close()
Close stops cleanup goroutine and releases resources
func (*ShardedCache) Delete ¶ added in v1.0.2
func (sc *ShardedCache) Delete(key string)
Delete deletes cache entry
func (*ShardedCache) Get ¶ added in v1.0.2
func (sc *ShardedCache) Get(key string) (interface{}, bool)
Get gets value from cache
func (*ShardedCache) GetStats ¶ added in v1.0.2
func (sc *ShardedCache) GetStats() ShardedCacheStats
GetStats gets cache statistics
func (*ShardedCache) Set ¶ added in v1.0.2
func (sc *ShardedCache) Set(key string, value interface{}, size int64)
Set sets cache value
type ShardedCacheEntry ¶ added in v1.0.2
type ShardedCacheEntry struct {
// contains filtered or unexported fields
}
ShardedCacheEntry represents a cache entry
type ShardedCacheStats ¶ added in v1.0.2
type ShardedCacheStats struct {
Hits uint64
Misses uint64
Evictions uint64
Entries int64
Size int64
}
ShardedCacheStats cache statistics
type SizedBytePool ¶ added in v1.0.1
type SizedBytePool struct {
// contains filtered or unexported fields
}
SizedBytePool implements a multi-level size-bucketed object pool for byte slices. It reduces memory allocation overhead by reusing buffers of appropriate sizes.
Size buckets: 16B, 32B, 64B, 128B, 256B, 512B, 1KB, 4KB
func NewSizedBytePool ¶ added in v1.0.1
func NewSizedBytePool() *SizedBytePool
NewSizedBytePool creates a new sized byte pool with 8 size buckets
func (*SizedBytePool) Get ¶ added in v1.0.1
func (sp *SizedBytePool) Get(size int) []byte
Get retrieves a byte slice from the appropriate size bucket Returns a buffer with at least the requested capacity
func (*SizedBytePool) Put ¶ added in v1.0.1
func (sp *SizedBytePool) Put(buf []byte)
Put returns a byte slice to the appropriate pool The slice is cleared before being returned to the pool
type SizedPool ¶
type SizedPool struct {
// contains filtered or unexported fields
}
1. Fine-grained object pool - multi-level size bucketing
func NewSizedPool ¶
func NewSizedPool() *SizedPool
type SizedTextSlicePool ¶ added in v1.0.1
type SizedTextSlicePool struct {
// contains filtered or unexported fields
}
SizedTextSlicePool implements a size-bucketed pool for Text slices Similar to SizedBytePool but for []Text instead of []byte
func NewSizedTextSlicePool ¶ added in v1.0.1
func NewSizedTextSlicePool() *SizedTextSlicePool
NewSizedTextSlicePool creates a new sized text slice pool Buckets: 8, 16, 32, 64, 128, 256 texts
func (*SizedTextSlicePool) Get ¶ added in v1.0.1
func (sp *SizedTextSlicePool) Get(size int) []Text
Get retrieves a Text slice from the appropriate size bucket
func (*SizedTextSlicePool) Put ¶ added in v1.0.1
func (sp *SizedTextSlicePool) Put(slice []Text)
Put returns a Text slice to the appropriate pool
type SortStrategy ¶ added in v1.0.1
type SortStrategy int
SortStrategy represents different sorting algorithms available
const ( StrategyAuto SortStrategy = iota // Automatically select best algorithm StrategyRadix // Radix sort for numeric keys StrategyQuick // Quicksort for general comparison StrategyInsertion // Insertion sort for small arrays StrategyStandard // Go standard library sort )
type SortingMetrics ¶ added in v1.0.1
type SortingMetrics struct {
RadixSortCount int
QuickSortCount int
InsertionSortCount int
StandardSortCount int
}
SortingMetrics tracks performance of different sorting strategies
func GetSortingMetrics ¶ added in v1.0.1
func GetSortingMetrics() SortingMetrics
GetSortingMetrics returns current sorting metrics
type SpatialGrid ¶ added in v1.2.3
type SpatialGrid struct {
// contains filtered or unexported fields
}
SpatialGrid is a spatial partitioning structure for efficient neighbor search. It divides 2D space into a grid of cells, allowing O(1) cell lookup and reducing neighbor search from O(n²) to O(n) for uniformly distributed points.
func NewSpatialGrid ¶ added in v1.2.3
func NewSpatialGrid(blocks []*TextBlock, cellSize float64) *SpatialGrid
NewSpatialGrid creates a new spatial grid for the given blocks. cellSize determines the granularity of the grid; typically should be around 2-3x the expected cluster radius for optimal performance.
func (*SpatialGrid) GetNearbyBlocks ¶ added in v1.2.3
func (g *SpatialGrid) GetNearbyBlocks(blockIdx int) []int
GetNearbyBlocks returns indices of blocks in the same cell and neighboring cells. This is much faster than searching all blocks when they're uniformly distributed. Memory optimized: reuses internal buffer to reduce allocations. WARNING: The returned slice is reused on next call - copy if needed.
type SpatialIndex ¶
type SpatialIndex struct {
// contains filtered or unexported fields
}
SpatialIndex provides spatial indexing for efficient text location queries This is a simple implementation using a grid-based approach; for production use, consider a more sophisticated structure like R-tree
func NewSpatialIndex ¶
func NewSpatialIndex(texts []Text) *SpatialIndex
NewSpatialIndex creates a new spatial index from text elements
func (*SpatialIndex) Query ¶
func (si *SpatialIndex) Query(bounds Rect) []Text
Query returns all text elements that potentially intersect with the given bounds
type SpatialIndexInterface ¶
SpatialIndex interface to allow using either grid or R-tree implementation
func NewSpatialIndexInterface ¶
func NewSpatialIndexInterface(texts []Text) SpatialIndexInterface
NewSpatialIndexInterface creates a spatial index interface (can be switched between implementations)
type Stack ¶
type Stack struct {
// contains filtered or unexported fields
}
A Stack represents a stack of values.
type StartupConfig ¶ added in v1.0.2
type StartupConfig struct {
WarmupPools bool
WarmupConfig *WarmupConfig
PreallocateCaches bool
FontCacheSize int
ResultCacheSize int
TuneGC bool
GCPercent int
MemoryBallast int64
SetMaxProcs bool
MaxProcs int
}
StartupConfig startup configuration
func DefaultStartupConfig ¶ added in v1.0.2
func DefaultStartupConfig() *StartupConfig
DefaultStartupConfig default startup configuration Optimized for lower memory footprint
type StreamProcessor ¶
type StreamProcessor struct {
// contains filtered or unexported fields
}
StreamProcessor handles streaming processing of PDF content to minimize memory usage
func NewStreamProcessor ¶
func NewStreamProcessor(chunkSize, bufferSize int, maxMemory int64) *StreamProcessor
NewStreamProcessor creates a new streaming processor
func (*StreamProcessor) Close ¶
func (sp *StreamProcessor) Close()
Close releases resources used by the stream processor
func (*StreamProcessor) ProcessPageStream ¶
func (sp *StreamProcessor) ProcessPageStream(reader *Reader, handler func(PageStream) error) error
ProcessPageStream processes pages in a streaming fashion
func (*StreamProcessor) ProcessTextBlockStream ¶
func (sp *StreamProcessor) ProcessTextBlockStream(reader *Reader, handler func(TextBlockStream) error) error
ProcessTextBlockStream processes text blocks in a streaming fashion
func (*StreamProcessor) ProcessTextStream ¶
func (sp *StreamProcessor) ProcessTextStream(reader *Reader, handler func(TextStream) error) error
ProcessTextStream processes text in a streaming fashion
type StreamingBatchExtractor ¶ added in v1.0.1
type StreamingBatchExtractor struct {
// contains filtered or unexported fields
}
StreamingBatchExtractor provides a streaming interface for batch extraction This is useful for very large PDFs where you want to process results as they arrive
Example ¶
// This example shows streaming batch extraction with a callback
// r, err := Open("document.pdf")
// if err != nil {
// log.Fatal(err)
// }
// defer r.Close()
//
// ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
// defer cancel()
//
// opts := BatchExtractOptions{
// Context: ctx,
// Workers: 4,
// }
//
// extractor := NewStreamingBatchExtractor(r, opts)
// extractor.Start()
//
// err = extractor.ProcessAll(func(result BatchResult) error {
// if result.Error != nil {
// return result.Error
// }
// // Process each page as it arrives
// fmt.Printf("Processing page %d...\n", result.PageNum)
// return nil
// })
//
// if err != nil {
// log.Fatal(err)
// }
func NewStreamingBatchExtractor ¶ added in v1.0.1
func NewStreamingBatchExtractor(r *Reader, opts BatchExtractOptions) *StreamingBatchExtractor
NewStreamingBatchExtractor creates a new streaming batch extractor
func (*StreamingBatchExtractor) Next ¶ added in v1.0.1
func (sbe *StreamingBatchExtractor) Next() *BatchResult
Next returns the next result, or nil if done
func (*StreamingBatchExtractor) ProcessAll ¶ added in v1.0.1
func (sbe *StreamingBatchExtractor) ProcessAll(callback func(BatchResult) error) error
ProcessAll processes all pages with a callback function
func (*StreamingBatchExtractor) Start ¶ added in v1.0.1
func (sbe *StreamingBatchExtractor) Start()
Start begins the extraction process
type StreamingMetadataExtractor ¶
type StreamingMetadataExtractor struct {
// contains filtered or unexported fields
}
StreamingMetadataExtractor extracts metadata in a streaming fashion
func NewStreamingMetadataExtractor ¶
func NewStreamingMetadataExtractor(chunkSize, bufferSize int, maxMemory int64) *StreamingMetadataExtractor
NewStreamingMetadataExtractor creates a new streaming metadata extractor
func (*StreamingMetadataExtractor) ExtractMetadataStream ¶
func (sme *StreamingMetadataExtractor) ExtractMetadataStream(reader *Reader) (<-chan Metadata, <-chan error)
ExtractMetadataStream extracts metadata in a streaming way
type StreamingTextClassifier ¶
type StreamingTextClassifier struct {
// contains filtered or unexported fields
}
StreamingTextClassifier classifies text in a streaming fashion to minimize memory usage
func NewStreamingTextClassifier ¶
func NewStreamingTextClassifier(chunkSize, bufferSize int, maxMemory int64) *StreamingTextClassifier
NewStreamingTextClassifier creates a new streaming text classifier
func (*StreamingTextClassifier) ClassifyTextStream ¶
func (stc *StreamingTextClassifier) ClassifyTextStream(reader *Reader) (<-chan ClassifiedBlock, <-chan error)
ClassifyTextStream classifies text in a streaming way
type StreamingTextExtractor ¶
type StreamingTextExtractor struct {
// contains filtered or unexported fields
}
StreamingTextExtractor provides memory-efficient text extraction for large PDFs
func NewStreamingTextExtractor ¶
func NewStreamingTextExtractor(r *Reader, maxCachedPages int) *StreamingTextExtractor
NewStreamingTextExtractor creates a streaming extractor for large PDFs
func (*StreamingTextExtractor) Close ¶
func (e *StreamingTextExtractor) Close()
Close releases resources used by the extractor
func (*StreamingTextExtractor) GetProgress ¶
func (e *StreamingTextExtractor) GetProgress() float64
GetProgress returns the extraction progress (0.0 to 1.0)
func (*StreamingTextExtractor) NextBatch ¶
func (e *StreamingTextExtractor) NextBatch() (results map[int]string, hasMore bool, err error)
NextBatch extracts text from the next batch of pages
func (*StreamingTextExtractor) NextPage ¶
func (e *StreamingTextExtractor) NextPage() (pageNum int, text string, hasMore bool, err error)
NextPage extracts text from the next page
func (*StreamingTextExtractor) Reset ¶
func (e *StreamingTextExtractor) Reset()
Reset resets the extractor to the beginning
type StringBuffer ¶ added in v1.0.2
type StringBuffer struct {
// contains filtered or unexported fields
}
StringBuffer string building buffer, optimizes multiple concatenations
Example ¶
ExampleStringBuffer Demonstrate usage of StringBuffer
builder := NewStringBuffer(100)
builder.WriteString("Hello")
builder.WriteByte(' ')
builder.WriteString("World")
result := builder.StringCopy()
fmt.Println(result)
Output: Hello World
func NewStringBuffer ¶ added in v1.0.2
func NewStringBuffer(capacity int) *StringBuffer
NewStringBuffer create new string buffer
func (*StringBuffer) Bytes ¶ added in v1.0.2
func (sb *StringBuffer) Bytes() []byte
Bytes return underlying byte slice
func (*StringBuffer) Len ¶ added in v1.0.2
func (sb *StringBuffer) Len() int
Len return current length
func (*StringBuffer) String ¶ added in v1.0.2
func (sb *StringBuffer) String() string
String zero-copy return string Warning: Do not use StringBuffer after return
func (*StringBuffer) StringCopy ¶ added in v1.0.2
func (sb *StringBuffer) StringCopy() string
StringCopy safely return string copy
func (*StringBuffer) WriteByte ¶ added in v1.0.2
func (sb *StringBuffer) WriteByte(b byte) error
WriteByte write single byte
func (*StringBuffer) WriteBytes ¶ added in v1.0.2
func (sb *StringBuffer) WriteBytes(b []byte)
WriteBytes write byte slice
func (*StringBuffer) WriteString ¶ added in v1.0.2
func (sb *StringBuffer) WriteString(s string)
WriteString write string
type StringBuilderPool ¶ added in v1.0.1
type StringBuilderPool struct {
// contains filtered or unexported fields
}
StringBuilderPool provides size-aware string builder pooling
type StringPool ¶ added in v1.0.2
type StringPool struct {
// contains filtered or unexported fields
}
StringPool string pool, reuse common strings
Example ¶
ExampleStringPool Demonstrate usage of string pool
pool := NewStringPool()
// Put commonly used strings into the pool
fontName1 := pool.Intern("Arial")
fontName2 := pool.Intern("Arial") // Repeated strings will be reused
fmt.Println(fontName1 == fontName2) // Pointers are equal
fmt.Println(pool.Size())
Output: true 1
func NewStringPool ¶ added in v1.0.2
func NewStringPool() *StringPool
NewStringPool create new string pool
func (*StringPool) Intern ¶ added in v1.0.2
func (sp *StringPool) Intern(s string) string
Intern add string to pool and return pooled version Strings with same content will share memory
func (*StringPool) Size ¶ added in v1.0.2
func (sp *StringPool) Size() int
Size return number of strings in pool
type StructuredBatchResult ¶ added in v1.0.1
type StructuredBatchResult struct {
PageNum int
Blocks []ClassifiedBlock
Error error
}
BatchExtractStructured extracts structured text from multiple pages in batches
type Text ¶
type Text struct {
Font string // the font used
FontSize float64 // the font size, in points (1/72 of an inch)
X float64 // the X coordinate, in points, increasing left to right
Y float64 // the Y coordinate, in points, increasing bottom to top
W float64 // the width of the text, in points
S string // the actual UTF-8 text
Vertical bool // whether the text is drawn vertically
Bold bool // whether the text is bold
Italic bool // whether the text is italic
Underline bool // whether the text is underlined
}
A Text represents a single piece of text drawn on a page.
func ConvertOptimizedSliceToText ¶ added in v1.2.3
func ConvertOptimizedSliceToText(texts []TextOptimized, pool *FontPool) []Text
ConvertOptimizedSliceToText converts a slice of TextOptimized to Text
func ConvertOptimizedToText ¶ added in v1.2.3
func ConvertOptimizedToText(t TextOptimized, pool *FontPool) Text
ConvertOptimizedToText converts a TextOptimized back to Text
func GetSizedTextSlice ¶ added in v1.0.1
GetSizedTextSlice retrieves a Text slice from the global pool
func GetText ¶
func GetText() *Text
GetText retrieves a Text object from the appropriate pool based on content size
func GetTextBySize ¶
GetTextBySize retrieves a Text object from the appropriate pool based on content size
type TextBlock ¶
type TextBlock struct {
Texts []Text
MinX float64
MaxX float64
MinY float64
MaxY float64
AvgFontSize float64
// contains filtered or unexported fields
}
TextBlock represents a coherent block of text (like a paragraph or column)
func ClusterTextBlocksOptimized ¶
ClusterTextBlocksOptimized uses KD tree optimized text block clustering Optimized version: reduce temporary object allocation, use object pool
func ClusterTextBlocksOptimizedV2 ¶ added in v1.2.3
ClusterTextBlocksOptimizedV2 uses object pools to reduce GC pressure
func ClusterTextBlocksParallel ¶ added in v1.2.3
ClusterTextBlocksParallel delegates to ParallelV2 for large inputs. This is the main entry point for parallel clustering.
func ClusterTextBlocksParallelV2 ¶ added in v1.2.3
ClusterTextBlocksParallelV2 uses a work-partitioning strategy for parallel clustering. Each worker processes a chunk of blocks independently with local edge collection, then edges are merged sequentially. This avoids all lock contention.
func ClusterTextBlocksUltraOptimized ¶ added in v1.2.3
ClusterTextBlocksUltraOptimized - 极致性能优化版本 目标:最小化内存分配和GC压力,同时保持并行性能
func ClusterTextBlocksUltraV2 ¶ added in v1.2.4
ClusterTextBlocksUltraV2 is an ultra-optimized parallel clustering algorithm Key optimizations: 1. SOA data layout for SIMD-friendly access 2. Compact spatial grid with binary search (no map lookups in hot path) 3. Pre-allocated edge buffers (zero allocation in hot path) 4. Lock-free union-find with path compression 5. Minimized memory copies and indirections
func ClusterTextBlocksV3 ¶ added in v1.2.3
ClusterTextBlocksV3 is an improved clustering algorithm using spatial grid. Time complexity: O(n) for uniformly distributed blocks (vs O(n²) for naive approach) Space complexity: O(n) for grid structure
func ClusterTextBlocksV3Fast ¶ added in v1.2.3
ClusterTextBlocksV3Fast is an even faster version with early termination. Suitable for very large documents where absolute precision is less critical.
func ClusterTextBlocksV4 ¶ added in v1.2.3
ClusterTextBlocksV4 automatically selects the best algorithm based on input size.
func GetTextBlock ¶ added in v1.2.3
func GetTextBlock() *TextBlock
GetTextBlock gets a TextBlock from pool
type TextBlockStream ¶
TextBlockStream represents a stream of text blocks
type TextClassifier ¶
type TextClassifier struct {
// contains filtered or unexported fields
}
TextClassifier classifies text runs into semantic blocks
func NewTextClassifier ¶
func NewTextClassifier(texts []Text, pageWidth, pageHeight float64) *TextClassifier
NewTextClassifier creates a new text classifier
func (*TextClassifier) ClassifyBlocks ¶
func (tc *TextClassifier) ClassifyBlocks() []ClassifiedBlock
ClassifyBlocks classifies text runs into semantic blocks
type TextEncoding ¶
type TextEncoding interface {
// Decode returns the UTF-8 text corresponding to
// the sequence of code points in raw.
Decode(raw string) (text string)
}
A TextEncoding represents a mapping between font code points and UTF-8 text.
func EnhancedCMapEncoding ¶ added in v1.2.8
func EnhancedCMapEncoding(name string) TextEncoding
EnhancedCMapEncoding returns a TextEncoding for the given CMap name, with enhanced support for CJK encodings
func LookupPredefinedCMap ¶ added in v1.2.8
func LookupPredefinedCMap(name string) TextEncoding
LookupPredefinedCMap looks up a CMap by name, checking both predefined and registered CMaps
type TextHorizontal ¶
type TextHorizontal []Text
TextHorizontal implements sort.Interface for sorting a slice of Text values in horizontal order, left to right, and then top to bottom within a column.
func (TextHorizontal) Len ¶
func (x TextHorizontal) Len() int
func (TextHorizontal) Less ¶
func (x TextHorizontal) Less(i, j int) bool
func (TextHorizontal) Swap ¶
func (x TextHorizontal) Swap(i, j int)
type TextOptimized ¶ added in v1.2.3
type TextOptimized struct {
FontID uint32 // Font ID from FontPool (4 bytes vs ~16+ bytes for string)
X float32 // X coordinate (4 bytes vs 8 bytes)
Y float32 // Y coordinate (4 bytes vs 8 bytes)
FontSize float32 // Font size (4 bytes vs 8 bytes)
W float32 // Width (4 bytes vs 8 bytes)
S string // Text content - unavoidable string allocation
Flags uint8 // Packed flags: bit0=Vertical, bit1=Bold, bit2=Italic, bit3=Underline
// contains filtered or unexported fields
}
TextOptimized is a memory-optimized version of Text structure. It uses uint32 for font IDs (via FontPool) instead of storing full font names, uses float32 where precision allows, and packs boolean flags into a single byte. This reduces memory footprint by ~60% compared to the original Text structure.
func ConvertTextSliceToOptimized ¶ added in v1.2.3
func ConvertTextSliceToOptimized(texts []Text, pool *FontPool) []TextOptimized
ConvertTextSliceToOptimized converts a slice of Text to TextOptimized
func ConvertTextToOptimized ¶ added in v1.2.3
func ConvertTextToOptimized(t Text, pool *FontPool) TextOptimized
ConvertTextToOptimized converts a Text to TextOptimized using the provided font pool
func (*TextOptimized) IsBold ¶ added in v1.2.3
func (t *TextOptimized) IsBold() bool
func (*TextOptimized) IsItalic ¶ added in v1.2.3
func (t *TextOptimized) IsItalic() bool
func (*TextOptimized) IsUnderline ¶ added in v1.2.3
func (t *TextOptimized) IsUnderline() bool
func (*TextOptimized) IsVertical ¶ added in v1.2.3
func (t *TextOptimized) IsVertical() bool
Helper methods for TextOptimized
func (*TextOptimized) SetBold ¶ added in v1.2.3
func (t *TextOptimized) SetBold(v bool)
func (*TextOptimized) SetItalic ¶ added in v1.2.3
func (t *TextOptimized) SetItalic(v bool)
func (*TextOptimized) SetUnderline ¶ added in v1.2.3
func (t *TextOptimized) SetUnderline(v bool)
func (*TextOptimized) SetVertical ¶ added in v1.2.3
func (t *TextOptimized) SetVertical(v bool)
type TextStream ¶
type TextStream struct {
Text string
PageNum int
Font string
FontSize float64
X, Y float64
W float64
Vertical bool
Confidence float64 // Confidence in the text recognition (0-1)
}
TextStream represents a stream of text with metadata
type TextVertical ¶
type TextVertical []Text
TextVertical implements sort.Interface for sorting a slice of Text values in vertical order, top to bottom, and then left to right within a line.
func (TextVertical) Len ¶
func (x TextVertical) Len() int
func (TextVertical) Less ¶
func (x TextVertical) Less(i, j int) bool
func (TextVertical) Swap ¶
func (x TextVertical) Swap(i, j int)
type TextWithLanguage ¶
type TextWithLanguage struct {
Text Text
Language LanguageInfo
Confidence float64
}
TextWithLanguage represents text with detected language information
type ToUnicodeCMap ¶ added in v1.2.8
type ToUnicodeCMap struct {
*CMap
}
ToUnicodeCMap is a specialized CMap for ToUnicode mappings
func NewToUnicodeCMap ¶ added in v1.2.8
func NewToUnicodeCMap() *ToUnicodeCMap
NewToUnicodeCMap creates a new ToUnicode CMap
func ParseToUnicodeCMap ¶ added in v1.2.8
func ParseToUnicodeCMap(r io.Reader) (*ToUnicodeCMap, error)
ParseToUnicodeCMap parses a ToUnicode CMap stream
func (*ToUnicodeCMap) DecodeCID ¶ added in v1.2.8
func (c *ToUnicodeCMap) DecodeCID(cid int) string
DecodeCID decodes a CID value to Unicode using the ToUnicode mapping
type Type1Cache ¶ added in v1.2.8
type Type1Cache struct {
// contains filtered or unexported fields
}
Type1Cache provides caching for Type1 font parsing operations
func GetGlobalType1Cache ¶ added in v1.2.8
func GetGlobalType1Cache() *Type1Cache
GetGlobalType1Cache returns the global Type1 cache instance
func NewType1Cache ¶ added in v1.2.8
func NewType1Cache(maxSize int, ttl time.Duration) *Type1Cache
NewType1Cache creates a new Type1 cache
func (*Type1Cache) GetFont ¶ added in v1.2.8
func (tc *Type1Cache) GetFont(data []byte) (*Type1Font, bool)
GetFont retrieves a cached Type1 font
func (*Type1Cache) PutFont ¶ added in v1.2.8
func (tc *Type1Cache) PutFont(data []byte, font *Type1Font)
PutFont caches a Type1 font
type Type1CacheEntry ¶ added in v1.2.8
type Type1CacheEntry struct {
Data *Type1Font
Expiration time.Time
LastAccess time.Time
AccessCount int64
}
Type1CacheEntry represents a cached Type1 font
func (*Type1CacheEntry) IsExpired ¶ added in v1.2.8
func (ce *Type1CacheEntry) IsExpired() bool
IsExpired checks if the cache entry has expired
type Type1Font ¶ added in v1.2.8
type Type1Font struct {
// contains filtered or unexported fields
}
Type1Font represents a Type1 font
func NewType1Font ¶ added in v1.2.8
NewType1Font creates a new Type1 font from raw font data with caching
func ParseType1FromStream ¶ added in v1.2.8
ParseType1FromStream parses Type1 font from a PDF stream
func (*Type1Font) GlyphWidth ¶ added in v1.2.8
GlyphWidth returns the width of a glyph by name
func (*Type1Font) Info ¶ added in v1.2.8
func (f *Type1Font) Info() *Type1FontInfo
Info returns the font info
type Type1FontInfo ¶ added in v1.2.8
type Type1FontInfo struct {
FontName string
FullName string
FamilyName string
Weight string
ItalicAngle float64
IsFixedPitch bool
UnderlinePos float64
UnderlineThick float64
FontBBox [4]float64
UniqueID int
XUID []int
// Font metrics
Encoding string
PaintType int
FontType int
FontMatrix [6]float64
StrokeWidth float64
// Private dict values
BlueValues []int
OtherBlues []int
FamilyBlues []int
FamilyOtherBlues []int
BlueScale float64
BlueShift int
BlueFuzz int
StdHW float64
StdVW float64
StemSnapH []float64
StemSnapV []float64
ForceBold bool
LanguageGroup int
RndStemUp bool
ExpansionFactor float64
}
Type1FontInfo contains parsed Type1 font header information
func GetType1FontInfo ¶ added in v1.2.8
func GetType1FontInfo(v Value) *Type1FontInfo
GetType1FontInfo extracts Type1 font info from embedded font program
type Value ¶
type Value struct {
// contains filtered or unexported fields
}
A Value is a single PDF value, such as an integer, dictionary, or array. The zero Value is a PDF null (Kind() == Null, IsNull() = true).
func (Value) Float64 ¶
Float64 returns v's float64 value, converting from integer if necessary. If v.Kind() != Float64 and v.Kind() != Int64, Float64 returns 0.
func (Value) Index ¶
Index returns the i'th element in the array v. If v.Kind() != Array or if i is outside the array bounds, Index returns a null Value.
func (Value) IsNull ¶
IsNull reports whether the value is a null. It is equivalent to Kind() == Null.
func (Value) Key ¶
Key returns the value associated with the given name key in the dictionary v. Like the result of the Name method, the key should not include a leading slash. If v is a stream, Key applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Key returns a null Value.
func (Value) Keys ¶
Keys returns a sorted list of the keys in the dictionary v. If v is a stream, Keys applies to the stream's header dictionary. If v.Kind() != Dict and v.Kind() != Stream, Keys returns nil.
func (Value) Name ¶
Name returns v's name value. If v.Kind() != Name, Name returns the empty string. The returned name does not include the leading slash: if v corresponds to the name written using the syntax /Helvetica, Name() == "Helvetica".
func (Value) RawString ¶
RawString returns v's string value. If v.Kind() != String, RawString returns the empty string.
func (Value) Reader ¶
func (v Value) Reader() io.ReadCloser
Reader returns the data contained in the stream v. If v.Kind() != Stream, Reader returns a ReadCloser that responds to all reads with a “stream not present” error.
func (Value) String ¶
String returns a textual representation of the value v. Note that String is not the accessor for values with Kind() == String. To access such values, see RawString, Text, and TextFromUTF16.
func (Value) Text ¶
Text returns v's string value interpreted as a “text string” (defined in the PDF spec) and converted to UTF-8. If v.Kind() != String, Text returns the empty string.
func (Value) TextFromUTF16 ¶
TextFromUTF16 returns v's string value interpreted as big-endian UTF-16 and then converted to UTF-8. If v.Kind() != String or if the data is not valid UTF-16, TextFromUTF16 returns the empty string.
type VerticalTextTransform ¶ added in v1.2.8
type VerticalTextTransform struct {
Enabled bool
OriginX float64
OriginY float64
Angle float64 // Rotation angle in degrees (typically 90 or -90)
}
VerticalTextTransform transforms text coordinates for vertical writing
func (*VerticalTextTransform) TransformGlyph ¶ added in v1.2.8
func (vt *VerticalTextTransform) TransformGlyph(x, y, w, h float64) (nx, ny, nw, nh float64)
TransformGlyph transforms a single glyph position for vertical writing
type WSDeque ¶
type WSDeque struct {
// contains filtered or unexported fields
}
5. Work-Stealing Deque (Chase-Lev algorithm)
func NewWSDeque ¶
func (*WSDeque) PushBottom ¶
PushBottom - owner thread pushes from bottom
type WarmupConfig ¶ added in v1.0.2
type WarmupConfig struct {
// BytePoolWarmup number of buffers to warmup for each size bucket
BytePoolWarmup map[int]int
// TextPoolWarmup number of text slices to warmup for each size bucket
TextPoolWarmup map[int]int
// Concurrent whether to warmup concurrently
Concurrent bool
// MaxGoroutines maximum number of concurrent goroutines
MaxGoroutines int
}
WarmupConfig warmup configuration
func AggressiveWarmupConfig ¶ added in v1.0.2
func AggressiveWarmupConfig() *WarmupConfig
AggressiveWarmupConfig returns aggressive warmup configuration (more pre-allocation)
func DefaultWarmupConfig ¶ added in v1.0.2
func DefaultWarmupConfig() *WarmupConfig
DefaultWarmupConfig returns default warmup configuration Reduced pre-allocation to minimize initial memory footprint
func LightWarmupConfig ¶ added in v1.0.2
func LightWarmupConfig() *WarmupConfig
LightWarmupConfig returns light warmup configuration (less pre-allocation)
type WarmupStats ¶ added in v1.0.2
type WarmupStats struct {
BytePoolSizes map[int]int
TextPoolSizes map[int]int
TotalAllocated int64
IsWarmed bool
}
WarmupStats warmup statistics
type WorkStealingExecutor ¶
type WorkStealingExecutor struct {
// contains filtered or unexported fields
}
6. Work-Stealing thread pool
func NewWorkStealingExecutor ¶
func NewWorkStealingExecutor(numWorkers int) *WorkStealingExecutor
func (*WorkStealingExecutor) Start ¶
func (p *WorkStealingExecutor) Start()
func (*WorkStealingExecutor) Stop ¶
func (p *WorkStealingExecutor) Stop()
func (*WorkStealingExecutor) Submit ¶
func (p *WorkStealingExecutor) Submit(task WSTask)
type WorkStealingScheduler ¶
type WorkStealingScheduler struct {
// contains filtered or unexported fields
}
WorkStealingScheduler work stealing scheduler Reduce goroutine creation overhead, improve parallel processing efficiency
func NewWorkStealingScheduler ¶
func NewWorkStealingScheduler(numWorkers int) *WorkStealingScheduler
NewWorkStealingScheduler create work stealing scheduler
func (*WorkStealingScheduler) Start ¶
func (wss *WorkStealingScheduler) Start()
Start start scheduler
func (*WorkStealingScheduler) Submit ¶
func (wss *WorkStealingScheduler) Submit(task Task)
Submit submit task
func (*WorkStealingScheduler) Wait ¶
func (wss *WorkStealingScheduler) Wait()
Wait wait for all tasks to complete
type WorkerPool ¶ added in v1.0.2
type WorkerPool struct {
// contains filtered or unexported fields
}
WorkerPool worker pool
func (*WorkerPool) GetStats ¶ added in v1.0.2
func (wp *WorkerPool) GetStats() WorkerPoolStats
GetStats gets worker pool statistics
type WorkerPoolStats ¶ added in v1.0.2
WorkerPoolStats worker pool statistics
type ZeroCopyBuilder ¶
type ZeroCopyBuilder struct {
// contains filtered or unexported fields
}
2. Zero-copy string builder
func NewZeroCopyBuilder ¶
func NewZeroCopyBuilder(cap int) *ZeroCopyBuilder
func (*ZeroCopyBuilder) Reset ¶
func (b *ZeroCopyBuilder) Reset()
func (*ZeroCopyBuilder) UnsafeString ¶
func (b *ZeroCopyBuilder) UnsafeString() string
UnsafeString Zero-copy return string (note: underlying buffer cannot be modified)
func (*ZeroCopyBuilder) WriteByte ¶
func (b *ZeroCopyBuilder) WriteByte(c byte) error
func (*ZeroCopyBuilder) WriteString ¶
func (b *ZeroCopyBuilder) WriteString(s string)
Notes ¶
Bugs ¶
The package is incomplete, although it has been used successfully on some large real-world PDF files.
The library makes no attempt at efficiency beyond the value cache and font cache. Further optimizations could improve performance for large files.
The support for reading encrypted files is limited to basic RC4 and AES encryption.
Source Files
¶
- adaptive_sort.go
- ascii85.go
- async_io.go
- batch_extract.go
- caching.go
- clustering_optimized.go
- clustering_parallel.go
- clustering_ultra_opt.go
- clustering_ultra_v2.go
- cmap.go
- compatibility.go
- context_support.go
- crypto.go
- enhanced_parallel.go
- errors.go
- extract.go
- extractor.go
- filter_decode.go
- font_cache_global.go
- font_cache_optimized.go
- font_cff.go
- font_cff_cache.go
- font_cff_pool.go
- font_cjk.go
- font_prefetch.go
- font_type1.go
- font_type1_cache.go
- lex.go
- memory_pools.go
- metadata.go
- multilang.go
- name.go
- optimization_examples.go
- optimizations_advanced.go
- optimized_extraction.go
- optimized_sorting.go
- page.go
- parallel_processing.go
- performance.go
- pool_sized.go
- pool_warmup.go
- ps.go
- read.go
- recovery.go
- sharded_cache.go
- simd_amd64.go
- simd_optimized.go
- simsys_amd64.go
- spatial_index.go
- streaming.go
- text.go
- text_classifier.go
- text_optimized.go
- text_ordering.go
- zero_copy_strings.go
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
pdfcli
command
|
|
|
test_coords
command
|
|
|
test_ordering
command
|
|
|
examples
|
|
|
batch_fontcache
command
|
|
|
extract
command
Example: Extract text from a PDF file with various methods
|
Example: Extract text from a PDF file with various methods |
|
extract_text_performance
command
|
|
|
performance
command
Example demonstrating performance optimization features
|
Example demonstrating performance optimization features |
|
smart_ordering
command
|
|
|
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length.
|
Pdfpasswd searches for the password for an encrypted PDF by trying all strings over a given alphabet up to a given length. |