Interface Tokenizer
- Functional Interface:
- This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.
Tokenizer interface for SimHash text processing
SimHash 文本处理的分词器接口
Defines how to split text into tokens for SimHash computation. Provides built-in implementations for common tokenization strategies.
定义如何将文本拆分为用于 SimHash 计算的标记。 提供常见分词策略的内置实现。
Features | 主要功能:
- Whitespace tokenization - 空格分词
- N-gram tokenization - N-gram 分词
- Character tokenization - 字符分词
- Custom tokenization - 自定义分词
Usage Examples | 使用示例:
// Whitespace tokenizer
Tokenizer tokenizer = Tokenizer.whitespace();
List<String> tokens = tokenizer.tokenize("Hello World");
// N-gram tokenizer
Tokenizer ngram = Tokenizer.ngram(3);
List<String> grams = ngram.tokenize("Hello");
Security | 安全性:
- Thread-safe: Implementation-dependent - 线程安全: 取决于实现
- Null-safe: Yes (validates inputs) - 空值安全: 是(验证输入)
Performance | 性能特性:
- Time complexity: O(L) for whitespace/words/characters tokenization where L=input length; O(L) for ngram(n) producing L-n+1 tokens - 时间复杂度: whitespace/words/characters 分词为 O(L),L 为输入长度;ngram(n) 生成 L-n+1 个 token,时间 O(L)
- Space complexity: O(L) for the resulting token list - 空间复杂度: 结果 token 列表为 O(L)
- Since:
- JDK 25, opencode-base-hash V1.0.0
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionstatic TokenizerCreates a character tokenizer 创建字符分词器static TokenizerCreates a Chinese character tokenizer (single characters for CJK) 创建中文字符分词器(CJK单字符)static TokenizerCombines multiple tokenizers 组合多个分词器static Tokenizerngram(int n) Creates an N-gram tokenizer 创建 N-gram 分词器Tokenizes the input text 分词输入文本static TokenizerCreates a whitespace tokenizer 创建空格分词器static Tokenizerwords()Creates a word boundary tokenizer (alphanumeric words) 创建单词边界分词器(字母数字单词)
-
Method Details
-
tokenize
-
apply
-
whitespace
Creates a whitespace tokenizer 创建空格分词器- Returns:
- whitespace tokenizer | 空格分词器
-
ngram
Creates an N-gram tokenizer 创建 N-gram 分词器- Parameters:
n- gram size | gram 大小- Returns:
- n-gram tokenizer | n-gram 分词器
-
characters
Creates a character tokenizer 创建字符分词器- Returns:
- character tokenizer | 字符分词器
-
words
Creates a word boundary tokenizer (alphanumeric words) 创建单词边界分词器(字母数字单词)- Returns:
- word boundary tokenizer | 单词边界分词器
-
cjkCharacters
Creates a Chinese character tokenizer (single characters for CJK) 创建中文字符分词器(CJK单字符)- Returns:
- Chinese character tokenizer | 中文字符分词器
-
combined
-