Class SimHashBuilder
java.lang.Object
cloud.opencode.base.hash.simhash.SimHashBuilder
Builder for SimHash configuration
SimHash 配置构建器
Provides a fluent API for configuring SimHash instances with custom tokenizers, hash functions, and weighting strategies.
提供流畅的API来配置带有自定义分词器、哈希函数和权重策略的SimHash实例。
Usage Examples | 使用示例:
SimHash simHash = SimHash.builder()
.nGram(3)
.hashFunction(OpenHash.murmur3_64())
.bits(64)
.build();
// With custom tokenizer
SimHash custom = SimHash.builder()
.tokenizer(text -> Arrays.asList(text.split(",")))
.weightFunction(token -> token.length())
.build();
Features | 主要功能:
- Fluent builder API for SimHash construction - 流畅的SimHash构建器API
- Configurable hash bit length - 可配置哈希位长度
- Custom tokenizer support - 自定义分词器支持
Security | 安全性:
- Thread-safe: No (builder pattern, create per-thread) - 线程安全: 否(构建器模式,每线程创建)
- Null-safe: Yes (validates inputs) - 空值安全: 是(验证输入)
Performance | 性能特性:
- Time complexity: O(1) for build() - configuration-only, no hashing performed - 时间复杂度: build() 为 O(1) - 仅配置,不执行哈希计算
- Space complexity: O(1) - stores only configuration references - 空间复杂度: O(1) - 仅存储配置引用
- Since:
- JDK 25, opencode-base-hash V1.0.0
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionbits(int bits) Sets the fingerprint bit size (32 or 64) 设置指纹位大小(32或64)build()Builds the SimHash instance 构建SimHash实例Uses character tokenization 使用字符分词hashFunction(HashFunction hashFunction) Sets the hash function for token hashing 设置用于标记哈希的哈希函数Uses token length as weight 使用标记长度作为权重nGram(int n) Uses N-gram tokenization 使用N-gram分词Sets a Tokenizer instance 设置Tokenizer实例Sets a custom tokenizer 设置自定义分词器Uses uniform weight (1 for all tokens) 使用均匀权重(所有标记为1)weightFunction(Function<String, Integer> weightFunction) Sets the token weight function 设置标记权重函数Uses whitespace tokenization 使用空格分词Uses word tokenization 使用单词分词
-
Method Details
-
tokenizer
Sets a custom tokenizer 设置自定义分词器- Parameters:
tokenizer- tokenizer function | 分词器函数- Returns:
- this builder | 此构建器
-
tokenizer
Sets a Tokenizer instance 设置Tokenizer实例- Parameters:
tokenizer- tokenizer | 分词器- Returns:
- this builder | 此构建器
-
nGram
Uses N-gram tokenization 使用N-gram分词- Parameters:
n- gram size | gram大小- Returns:
- this builder | 此构建器
-
whitespaceTokenizer
Uses whitespace tokenization 使用空格分词- Returns:
- this builder | 此构建器
-
wordTokenizer
-
characterTokenizer
Uses character tokenization 使用字符分词- Returns:
- this builder | 此构建器
-
hashFunction
Sets the hash function for token hashing 设置用于标记哈希的哈希函数- Parameters:
hashFunction- hash function | 哈希函数- Returns:
- this builder | 此构建器
-
bits
Sets the fingerprint bit size (32 or 64) 设置指纹位大小(32或64)- Parameters:
bits- bit size | 位大小- Returns:
- this builder | 此构建器
-
weightFunction
Sets the token weight function 设置标记权重函数- Parameters:
weightFunction- weight function | 权重函数- Returns:
- this builder | 此构建器
-
lengthWeighted
Uses token length as weight 使用标记长度作为权重- Returns:
- this builder | 此构建器
-
uniformWeight
Uses uniform weight (1 for all tokens) 使用均匀权重(所有标记为1)- Returns:
- this builder | 此构建器
-
build
Builds the SimHash instance 构建SimHash实例- Returns:
- SimHash instance | SimHash实例
-