Class ChineseSegmenter
java.lang.Object
cloud.opencode.base.string.unicode.ChineseSegmenter
Chinese Word Segmenter - 中文分词器
Provides basic Chinese text segmentation using dictionary-based and rule-based approaches.
提供基于词典和规则的基础中文文本分词功能。
Features | 主要功能:
- Forward Maximum Matching (FMM) - 正向最大匹配
- Backward Maximum Matching (BMM) - 逆向最大匹配
- Bidirectional Maximum Matching - 双向最大匹配
- Custom dictionary support - 自定义词典支持
- Mixed Chinese/English text handling - 中英文混合处理
Usage Examples | 使用示例:
// Basic segmentation
List<String> words = ChineseSegmenter.segment("我爱中华人民共和国");
// Result: ["我", "爱", "中华人民共和国"]
// With custom dictionary
ChineseSegmenter segmenter = ChineseSegmenter.builder()
.addWord("人工智能")
.addWord("机器学习")
.maxWordLength(8)
.build();
List<String> words = segmenter.segment("人工智能和机器学习");
// Different algorithms
List<String> fmm = ChineseSegmenter.segmentFMM("研究生命起源");
List<String> bmm = ChineseSegmenter.segmentBMM("研究生命起源");
Security | 安全性:
- Thread-safe: Yes (immutable instance, ConcurrentHashMap dictionary) - 线程安全: 是(不可变实例,ConcurrentHashMap词典)
- Null-safe: No (input text must not be null) - 空值安全: 否(输入文本不能为空)
- Since:
- JDK 25, opencode-base-string V1.0.0
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final classBuilder for ChineseSegmenter. -
Method Summary
Modifier and TypeMethodDescriptionstatic voidaddToDictionary(String word) Adds a word to the core dictionary (affects all instances).static voidaddToDictionary(Collection<String> words) Adds multiple words to the core dictionary.static ChineseSegmenter.Builderbuilder()Creates a new builder.booleancontainsWord(String word) Checks if a word exists in the dictionary.static ChineseSegmenterGets the default segmenter instance.intGets the dictionary size.static voidloadDictionary(InputStream inputStream) Loads words from an input stream (one word per line).Segments Chinese text using bidirectional maximum matching (default).static StringsegmentAndJoin(String text, String delimiter) Joins segmented words with a delimiter.segmentBackward(String text) Segments text using Backward Maximum Matching (BMM).segmentBMM(String text) Segments Chinese text using Backward Maximum Matching (BMM).segmentFMM(String text) Segments Chinese text using Forward Maximum Matching (FMM).segmentForward(String text) Segments text using Forward Maximum Matching (FMM).segmentText(String text) Segments text using bidirectional maximum matching.
-
Method Details
-
builder
Creates a new builder. 创建新的构建器。- Returns:
- the builder - 构建器
-
getDefault
Gets the default segmenter instance. 获取默认分词器实例。- Returns:
- the default segmenter - 默认分词器
-
segment
-
segmentFMM
-
segmentBMM
-
segmentAndJoin
-
segmentText
-
segmentForward
-
segmentBackward
-
containsWord
Checks if a word exists in the dictionary. 检查词语是否在词典中。- Parameters:
word- the word to check - 待检查词语- Returns:
- true if exists - 如果存在返回true
-
getDictionarySize
public int getDictionarySize()Gets the dictionary size. 获取词典大小。- Returns:
- the dictionary size - 词典大小
-
addToDictionary
Adds a word to the core dictionary (affects all instances). 向核心词典添加词语(影响所有实例)。- Parameters:
word- the word to add - 待添加词语
-
addToDictionary
Adds multiple words to the core dictionary. 向核心词典添加多个词语。- Parameters:
words- the words to add - 待添加词语集合
-
loadDictionary
Loads words from an input stream (one word per line). 从输入流加载词语(每行一个词)。- Parameters:
inputStream- the input stream - 输入流- Throws:
IOException- if reading fails - 如果读取失败
-