Package ai.doctruth
package ai.doctruth
Public API of DocTruth: auditable LLM extraction for Java.
Every type in this package (and only this package) is part of the stable public API.
Subpackages under ai.doctruth.internal are explicitly NOT public API and may
change without a major version bump.
See CONTRIBUTING.md in the repository root for the engineering contract.
- Since:
- 0.1.0
-
ClassDescriptionAnthropic Messages API provider.Visual / structural classification of a
TextSectionas detected by Layer 1 parsing — a geometric / typographic judgement, NOT a semantic one.The verifiable evidence anchor for a single extracted field.A confidence score for a single extracted field, plus a free-form rationale.How aParsedDocumentis rendered into the bounded context window of an LLM call.Layer 1 entry point: read a CSV file from disk into aParsedDocumentcontaining exactly oneTableSectionthat mirrors the CSV row-major.DeepSeek Chat Completions provider.Public entry point for the library.Metadata for aParsedDocument: the source filename, total page count, and (optionally) the timestamp at which the source document was authored / published.Layer 1 entry point: read a DOCX file from disk into aParsedDocumentwith oneTextSectionper non-blank paragraph and oneTableSectionper table.Immutable fluent builder for one extraction call.Thrown from the public extraction API when an extraction run fails after exhausting retries or when an invariant is violated mid-flight.The output ofDocTruth.extract(...).run(): the extracted value plus per-field citations, per-field confidence scores, and run-level provenance.A figure (image, chart, diagram) recovered from the source document, represented by its caption text plus aSourceLocation.Google GeminigenerateContentprovider.Multi-level summarisation: condense the document at increasing granularities, hand the LLM the level that fits the budget.Immutable fluent builder for JSON Schema-driven extraction.Caller-supplied JSON Schema for schema-bound extraction.The Layer 2 backend abstraction: an LLM API client.OpenAI Chat-Completions API provider.What aPriorityTruncatestrategy does when the priority sections alone exceed the configuredmaxCharsbudget.A single section of a parsed source document.Thrown by Layer 1 document parsers (PDF / DOCX) when a source file cannot be parsed or when a structural invariant is violated.Layer 1 entry point: read a PDF file from disk into aParsedDocumentwith source locations preserved per detected layout block.Smart-context strategy for keeping priority sections while trimming everything else to fit.Bi-temporal provenance for anExtractionResult: the model that produced it, when the extraction ran, and (optionally) when the source document was authored, the region the extraction was processed in, and the retention horizon of the audit record.Supplemental provenance metadata kept behindProvenanceso the public provenance record stays small while preserving retry, data-residency, and retention semantics.Thrown by Layer 2 LLM providers (Anthropic, OpenAI, Gemini, DeepSeek) when an upstream call fails.Per-call knobs passed to anLlmProvideron every request.What the library hands anLlmProvideron every call: the system prompt, the user prompt (rendered from aParsedDocumentby the configuredContextStrategy), the JSON Schema for the target type, and the per-call options.What anLlmProviderreturns on a successful call: the raw JSON the LLM produced plus the per-callProviderUsage.Token-usage and model-version data returned by an LLM provider on every successful call.Fixed-size character windows with optional overlap.A 1-indexed page + line span into a parsed source document, plus a 0-indexed character offset into the source page text.A flat string-cell table recovered from the source document, anchored to aSourceLocation.A run of plain text recovered from the source document, anchored to aSourceLocationand tagged with aBlockKindthat classifies the geometric / typographic shape of the block (HEADING / BODY / LIST / OTHER).Layer 1 entry point: read an XLSX file from disk into aParsedDocumentwith oneTableSectionper non-empty sheet.