Interface OcrEngine

Functional Interface:
This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

@FunctionalInterface public interface OcrEngine
Optional OCR backend, plugged into PdfDocumentParser to recover text from scanned (image-only) pages. DocTruth ships only this interface plus the NOOP default. Real OCR engines are supplied by callers because OCR model/runtime choices are outside the generic extraction library boundary:
  • Local engines such as Tesseract.
  • Cloud OCR adapters such as Textract or Document AI.
  • Any custom implementation of this single-method interface.

Threading: implementations MUST be thread-safe — the parser may invoke ocr concurrently across pages on virtual threads.

Since:
0.1.0
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final OcrEngine
    No-op engine — returns empty text on every page.
  • Method Summary

    Modifier and Type
    Method
    Description
    ocr(BufferedImage pageImage, int pageNumber)
    OCR a single rendered page image.
  • Field Details

    • NOOP

      static final OcrEngine NOOP
      No-op engine — returns empty text on every page. The default; means PdfDocumentParser treats scanned pages as zero-content (matches the v0.1.0-alpha behaviour before this SPI shipped). Callers wanting real OCR plug in a richer impl.
  • Method Details

    • ocr

      OcrPageResult ocr(BufferedImage pageImage, int pageNumber)
      OCR a single rendered page image.
      Parameters:
      pageImage - the page rendered to a raster image (caller responsibility — typically PDFRenderer.renderImageWithDPI(page, 150)).
      pageNumber - the 1-indexed page number, surfaced for logging / region traceability.
      Returns:
      the OCR result; never null. Implementations that cannot OCR should return OcrPageResult.empty(int) rather than throwing.