Class DocxDocumentParser

java.lang.Object
ai.doctruth.DocxDocumentParser

public final class DocxDocumentParser extends Object
Layer 1 entry point: read a DOCX file from disk into a ParsedDocument with one TextSection per non-blank paragraph and one TableSection per table. Backed by Apache POI (XWPFDocument) — chosen per CONTRIBUTING.md §4 "Build, don't synthesize" (POI is the canonical Java OOXML lib; hand-rolling a DOCX zip+XML parser would violate the principle).

v0.1.0-alpha intentionally treats DOCX as a single logical page (metadata.pageCount == 1, every section anchored to pageStart == 1). Word page breaks are a render-time concept driven by the consuming reader's font + page-size settings — POI does not expose post-pagination page numbers without a layout engine. Section-break-aware multi-page tracking is intentionally left for a later parser improvement.

The parser is a stateless utility — it has no per-instance config in v0.1.0-alpha (so the static method form is the right level of API surface, per Engineering Principles §5 "elegance over cleverness").

Since:
0.1.0