Class XlsxDocumentParser
ParsedDocument with one
TableSection per non-empty sheet. Backed by Apache POI (XSSFWorkbook) —
chosen per CONTRIBUTING.md §4 "Build, don't synthesize" (POI is the canonical Java OOXML lib).
v0.1.0-alpha sheet-as-page analogy: spreadsheet workbooks have no native "pages" the way
PDFs do, but every sheet is a self-contained tabular surface. We map each sheet to a logical
page and each row to a logical line so that SourceLocation stays consistent across
formats — a Citation pointing at "page 2 line 5" of an XLSX document means
"sheet index 1 (0-indexed) row index 4". Sheet name is intentionally not part of the
location record (would force a 6th component); downstream consumers can fetch it from
DocumentMetadata extensions in a later phase.
Cell rendering uses POI's DataFormatter.formatCellValue(Cell) so dates,
percentages, and formula-cached values appear as the user sees them in Excel — not as raw
doubles. Empty/null cells render as the empty string ""; trailing all-blank rows
are trimmed from each sheet, but interior all-blank rows are preserved (they convey layout).
The parser is a stateless utility — it has no per-instance config in v0.1.0-alpha (so the static method form is the right level of API surface, per Engineering Principles §5 "elegance over cleverness").
- Since:
- 0.1.0
-
Method Summary
Modifier and TypeMethodDescriptionstatic ParsedDocumentParse the XLSX atxlsxPathinto aParsedDocument.
-
Method Details
-
parse
Parse the XLSX atxlsxPathinto aParsedDocument.- Throws:
NullPointerException- ifxlsxPathis null.ParseException- if the file is missing, is not an XLSX (e.g. legacy.xlsbinary, plain text, PDF mis-renamed), or POI raises any IO error while reading. Cause-chain preserves the underlyingIOException.
-