Extract and process PDF documents
pdfsubagentsetup L2★0
happyvertical/sdk ↗What it does
Extract text and metadata from PDFs using text extraction or OCR
Best for
Document processing pipelines that need both text-based and image-based PDF content extraction with confidence scoring.
Inputs
- · PDF file path
- · OCR language code
- · preprocessing options
Outputs
- · extracted text
- · document metadata
- · OCR confidence scores
- · page structure
Requires
- · unpdf library
- · @gutenye/ocr-node
- · ONNX Runtime
- · C++ stdlib
Preconditions
unpdf or OCR library installed, system dependencies (libstdc++.so.6) available
Failure modes
- · corrupted PDF
- · password-protected document
- · OCR dependency missing
- · memory exhaustion on large files
- · low OCR confidence on scanned text
Trust signals
- · checks latest unpdf/OCR docs before recommending
- · validates extraction quality
- · handles fallback strategies