Class PdfExtractor
java.lang.Object
cloud.opencode.base.pdf.operation.PdfExtractor
PDF Content Extractor
PDF 内容提取器
Extracts text and images from PDF documents.
从 PDF 文档提取文本和图像。
Features | 主要功能:
- Extract text from all pages - 从所有页面提取文本
- Extract text from specific pages - 从指定页面提取文本
- Extract images - 提取图像
- Extract metadata - 提取元数据
Usage Examples | 使用示例:
// Extract all text
String text = PdfExtractor.of(Path.of("document.pdf"))
.extractText();
// Extract text from specific pages
String text = PdfExtractor.of(Path.of("document.pdf"))
.extractText(1, 2, 3);
// Extract images
List<ExtractedImage> images = PdfExtractor.of(Path.of("document.pdf"))
.extractImages();
Security | 安全性:
- Thread-safe: No — not designed for concurrent use - 线程安全: 否 — 非并发设计
- Null-safe: Yes — parameters are validated - 空值安全: 是 — 参数已验证
- Since:
- JDK 25, opencode-base-pdf V1.0.0
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordExtracted Image 提取的图像 -
Method Summary
Modifier and TypeMethodDescriptionstatic PdfExtractorcreate()Creates a new extractor.List<byte[]> Extracts images as byte arrays.Extracts all images from PDF.extractImages(int... pageNumbers) Extracts images from specific pages.Extracts all text from PDF.extractText(int... pageNumbers) Extracts text from specific pages.extractTextRange(int startPage, int endPage) Extracts text from a page range.static PdfExtractorof(PdfDocument document) Creates extractor for document.static PdfExtractorCreates extractor for file.saveImages(Path directory, String namePrefix) Saves extracted images to directory.source(PdfDocument document) Sets source PDF document.Sets source PDF file.
-
Method Details
-
source
Sets source PDF file. 设置源 PDF 文件。- Parameters:
path- PDF file path | PDF 文件路径- Returns:
- this extractor | 当前提取器
-
source
Sets source PDF document. 设置源 PDF 文档。- Parameters:
document- PDF document | PDF 文档- Returns:
- this extractor | 当前提取器
-
extractText
Extracts all text from PDF. 从 PDF 提取所有文本。- Returns:
- extracted text | 提取的文本
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
extractText
Extracts text from specific pages. 从指定页面提取文本。- Parameters:
pageNumbers- page numbers (1-based) | 页码(从1开始)- Returns:
- extracted text | 提取的文本
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
extractTextRange
Extracts text from a page range. 从页面范围提取文本。- Parameters:
startPage- start page (1-based) | 起始页码endPage- end page (1-based) | 结束页码- Returns:
- extracted text | 提取的文本
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
extractImages
Extracts all images from PDF. 从 PDF 提取所有图像。- Returns:
- list of extracted images | 提取的图像列表
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
extractImages
Extracts images from specific pages. 从指定页面提取图像。- Parameters:
pageNumbers- page numbers (1-based) | 页码(从1开始)- Returns:
- list of extracted images | 提取的图像列表
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
extractImageBytes
Extracts images as byte arrays. 将图像提取为字节数组。- Returns:
- list of image byte arrays | 图像字节数组列表
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
saveImages
Saves extracted images to directory. 将提取的图像保存到目录。- Parameters:
directory- target directory | 目标目录namePrefix- file name prefix | 文件名前缀- Returns:
- list of saved file paths | 保存的文件路径列表
- Throws:
OpenPdfException- if extraction fails | 提取失败时抛出异常
-
getSourcePath
-
getSourceDocument
-
create
Creates a new extractor. 创建新的提取器。- Returns:
- PDF extractor | PDF 提取器
-
of
Creates extractor for file. 为文件创建提取器。- Parameters:
path- PDF file path | PDF 文件路径- Returns:
- PDF extractor | PDF 提取器
-
of
Creates extractor for document. 为文档创建提取器。- Parameters:
document- PDF document | PDF 文档- Returns:
- PDF extractor | PDF 提取器
-