cyberneticlibrary

Extract structured data from PDFs

deepread-ocrskillsetup L20
Sheshiyer/skill-clusters
What it does

Extract text and structured JSON from PDFs with confidence

Best for

Extracting structured data from invoices, forms, receipts where 90% auto-extraction + 10% human review beats 100% manual.

Inputs
  • · PDF file
  • · JSON schema (optional, for structured extraction)
Outputs
  • · Clean markdown text
  • · Structured fields with confidence scores
  • · hil_flag per field (human-in-loop)
Requires
  • · DeepRead REST API
  • · API key
Preconditions

DeepRead API key; PDF accessible; free tier allows 2000 pages/month

Failure modes
  • · Handwritten/obscured text marked hil_flag=true
  • · Monthly quota exhausted
  • · Complex nested array schemas fail
Trust signals
  • · Per-field confidence scores
  • · hil_flag indicates uncertainty
  • · Multi-pass validation
  • · Free tier available