Class DocxDocumentParser
java.lang.Object
ai.doctruth.DocxDocumentParser
Layer 1 entry point: read a DOCX file from disk into a
ParsedDocument with one
TextSection per non-blank paragraph and one TableSection per table. Backed
by Apache POI (XWPFDocument) — chosen per CONTRIBUTING.md §4 "Build, don't synthesize"
(POI is the canonical Java OOXML lib; hand-rolling a DOCX zip+XML parser would violate the
principle).
v0.1.0-alpha intentionally treats DOCX as a single logical page
(metadata.pageCount == 1, every section anchored to pageStart == 1). Word
page breaks are a render-time concept driven by the consuming reader's font + page-size
settings — POI does not expose post-pagination page numbers without a layout engine.
Section-break-aware multi-page tracking is intentionally left for a later parser
improvement.
The parser is a stateless utility — it has no per-instance config in v0.1.0-alpha (so the static method form is the right level of API surface, per Engineering Principles §5 "elegance over cleverness").
- Since:
- 0.1.0
-
Method Summary
Modifier and TypeMethodDescriptionstatic ParsedDocumentParse the DOCX atdocxPathinto aParsedDocument.
-
Method Details
-
parse
Parse the DOCX atdocxPathinto aParsedDocument.- Throws:
NullPointerException- ifdocxPathis null.ParseException- if the file is missing, is not a DOCX, or POI raises any IO error while reading. Cause-chain preserves the underlyingIOException.
-