All Classes and Interfaces
Class
Description
Anthropic Messages API provider.
A single audit event emitted by the library — extraction success, extraction failure,
citation-below-threshold, etc.
Callback the library invokes for every auditable event.
Visual / structural classification of a
TextSection as detected by Layer 1
parsing — a geometric / typographic judgement, NOT a semantic one.The verifiable evidence anchor for a single extracted field.
A confidence score for a single extracted field, plus a free-form rationale.
How a
ParsedDocument is rendered into the bounded context window of an LLM call.Layer 1 entry point: read a CSV file from disk into a
ParsedDocument containing
exactly one TableSection that mirrors the CSV row-major.DeepSeek Chat Completions provider.
Public entry point for the library.
Minimal command-line entry point for build-time migration helpers.
Metadata for a
ParsedDocument: the source filename, total page count, and
(optionally) the timestamp at which the source document was authored / published.Layer 1 entry point: read a DOCX file from disk into a
ParsedDocument with one
TextSection per non-blank paragraph and one TableSection per table.Immutable fluent builder for one extraction call.
Thrown from the public extraction API when an extraction run fails after exhausting retries
or when an invariant is violated mid-flight.
The output of
DocTruth.extract(...).run(): the extracted value plus per-field
citations, per-field confidence scores, and run-level provenance.A figure (image, chart, diagram) recovered from the source document, represented by its
caption text plus a
SourceLocation.Google Gemini
generateContent provider.Multi-level summarisation: condense the document at increasing granularities, hand the LLM
the level that fits the budget.
Immutable fluent builder for JSON Schema-driven extraction.
Caller-supplied JSON Schema for schema-bound extraction.
The Layer 2 backend abstraction: an LLM API client.
Pixel bounding box for one OCR region on a rendered page image.
Optional OCR backend, plugged into
PdfDocumentParser to recover text from scanned
(image-only) pages.Output of one
ocr call.One OCR-recovered text region with its pixel bounding box on the rendered page image.
OpenAI Chat-Completions API provider.
What a
PriorityTruncate strategy does when the priority sections alone exceed the
configured maxChars budget.A single section of a parsed source document.
Thrown by Layer 1 document parsers (PDF / DOCX) when a source file cannot be parsed or
when a structural invariant is violated.
Layer 1 entry point: read a PDF file from disk into a
ParsedDocument with
source locations preserved per detected layout block.Smart-context strategy for keeping priority sections while trimming everything else to fit.
Bi-temporal provenance for an
ExtractionResult: the model that produced it, when
the extraction ran, and (optionally) when the source document was authored, the region
the extraction was processed in, and the retention horizon of the audit record.Supplemental provenance metadata kept behind
Provenance so the public provenance
record stays small while preserving retry, data-residency, and retention semantics.Thrown by Layer 2 LLM providers (Anthropic, OpenAI, Gemini, DeepSeek) when an upstream call
fails.
Per-call knobs passed to an
LlmProvider on every request.What the library hands an
LlmProvider on every call: the system prompt, the user
prompt (rendered from a ParsedDocument by the configured ContextStrategy),
the JSON Schema for the target type, and the per-call options.What an
LlmProvider returns on a successful call: the raw JSON the LLM produced
plus the per-call ProviderUsage.Token-usage and model-version data returned by an LLM provider on every successful call.
Render an
ExtractionResult as W3C PROV-O JSON-LD.Sign / wrap an audit JSON document for tamper-evident persistence.
Fixed-size character windows with optional overlap.
A 1-indexed page + line span into a parsed source document, plus a 0-indexed character
offset into the source page text.
A flat string-cell table recovered from the source document, anchored to a
SourceLocation.A run of plain text recovered from the source document, anchored to a
SourceLocation
and tagged with a BlockKind that classifies the geometric / typographic shape of the
block (HEADING / BODY / LIST / OTHER).Layer 1 entry point: read an XLSX file from disk into a
ParsedDocument with one
TableSection per non-empty sheet.