Interface OcrEngine
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
Optional OCR backend, plugged into
PdfDocumentParser to recover text from scanned
(image-only) pages. DocTruth ships only this interface plus the NOOP
default. Real OCR engines are supplied by callers because OCR model/runtime choices are
outside the generic extraction library boundary:
- Local engines such as Tesseract.
- Cloud OCR adapters such as Textract or Document AI.
- Any custom implementation of this single-method interface.
Threading: implementations MUST be thread-safe — the parser may invoke ocr
concurrently across pages on virtual threads.
- Since:
- 0.1.0
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionocr(BufferedImage pageImage, int pageNumber) OCR a single rendered page image.
-
Field Details
-
NOOP
No-op engine — returns empty text on every page. The default; meansPdfDocumentParsertreats scanned pages as zero-content (matches the v0.1.0-alpha behaviour before this SPI shipped). Callers wanting real OCR plug in a richer impl.
-
-
Method Details
-
ocr
OCR a single rendered page image.- Parameters:
pageImage- the page rendered to a raster image (caller responsibility — typicallyPDFRenderer.renderImageWithDPI(page, 150)).pageNumber- the 1-indexed page number, surfaced for logging / region traceability.- Returns:
- the OCR result; never null. Implementations that cannot OCR should return
OcrPageResult.empty(int)rather than throwing.
-