cyberneticlibrary

Extract and process PDF documents

pdfsubagentsetup L20
happyvertical/sdk
What it does

Extract text and metadata from PDFs using text extraction or OCR

Best for

Document processing pipelines that need both text-based and image-based PDF content extraction with confidence scoring.

Inputs
  • · PDF file path
  • · OCR language code
  • · preprocessing options
Outputs
  • · extracted text
  • · document metadata
  • · OCR confidence scores
  • · page structure
Requires
  • · unpdf library
  • · @gutenye/ocr-node
  • · ONNX Runtime
  • · C++ stdlib
Preconditions

unpdf or OCR library installed, system dependencies (libstdc++.so.6) available

Failure modes
  • · corrupted PDF
  • · password-protected document
  • · OCR dependency missing
  • · memory exhaustion on large files
  • · low OCR confidence on scanned text
Trust signals
  • · checks latest unpdf/OCR docs before recommending
  • · validates extraction quality
  • · handles fallback strategies