cyberneticlibrary

Extract and parse content from URLs and files

parserskillsetup L20
Sheshiyer/skill-clusters
What it does

Parse structured data from unstructured text or documents

Best for

Semantic parsing extracts meaning from messy human text—outperforms regex on variants and typos.

Inputs
  • · raw text, HTML, PDF, or Markdown
Outputs
  • · structured JSON/schema
  • · field extraction confidence scores
Requires
  • · LLM (for semantic parsing)
  • · optional: regex for known patterns
Preconditions

Input document exists; target schema defined

Failure modes

Hallucinated fields if schema too loose; missed data if format variant not trained on; low confidence on ambiguous input

Trust signals
  • · confidence scores per field
  • · validation against schema constraints
  • · fallback to human review for low-confidence extracts