cyberneticlibrary

Extract and process PDF documents

pdfsubagentsetup L2★0

happyvertical/sdk ↗

What it does

Extract text and metadata from PDFs using text extraction or OCR

Best for

Document processing pipelines that need both text-based and image-based PDF content extraction with confidence scoring.

Inputs

· PDF file path
· OCR language code
· preprocessing options

Outputs

· extracted text
· document metadata
· OCR confidence scores
· page structure

Requires

· unpdf library
· @gutenye/ocr-node
· ONNX Runtime
· C++ stdlib

Preconditions

unpdf or OCR library installed, system dependencies (libstdc++.so.6) available

Failure modes

· corrupted PDF
· password-protected document
· OCR dependency missing
· memory exhaustion on large files
· low OCR confidence on scanned text

Trust signals

· checks latest unpdf/OCR docs before recommending
· validates extraction quality
· handles fallback strategies