Class SimHashBuilder

java.lang.Object
cloud.opencode.base.hash.simhash.SimHashBuilder

public final class SimHashBuilder extends Object
Builder for SimHash configuration SimHash 配置构建器

Provides a fluent API for configuring SimHash instances with custom tokenizers, hash functions, and weighting strategies.

提供流畅的API来配置带有自定义分词器、哈希函数和权重策略的SimHash实例。

Usage Examples | 使用示例:

SimHash simHash = SimHash.builder()
    .nGram(3)
    .hashFunction(OpenHash.murmur3_64())
    .bits(64)
    .build();

// With custom tokenizer
SimHash custom = SimHash.builder()
    .tokenizer(text -> Arrays.asList(text.split(",")))
    .weightFunction(token -> token.length())
    .build();

Features | 主要功能:

  • Fluent builder API for SimHash construction - 流畅的SimHash构建器API
  • Configurable hash bit length - 可配置哈希位长度
  • Custom tokenizer support - 自定义分词器支持

Security | 安全性:

  • Thread-safe: No (builder pattern, create per-thread) - 线程安全: 否(构建器模式,每线程创建)
  • Null-safe: Yes (validates inputs) - 空值安全: 是(验证输入)

Performance | 性能特性:

  • Time complexity: O(1) for build() - configuration-only, no hashing performed - 时间复杂度: build() 为 O(1) - 仅配置,不执行哈希计算
  • Space complexity: O(1) - stores only configuration references - 空间复杂度: O(1) - 仅存储配置引用
Since:
JDK 25, opencode-base-hash V1.0.0
Author:
Leon Soo www.LeonSoo.com
See Also:
  • Method Details

    • tokenizer

      public SimHashBuilder tokenizer(Function<String, List<String>> tokenizer)
      Sets a custom tokenizer 设置自定义分词器
      Parameters:
      tokenizer - tokenizer function | 分词器函数
      Returns:
      this builder | 此构建器
    • tokenizer

      public SimHashBuilder tokenizer(Tokenizer tokenizer)
      Sets a Tokenizer instance 设置Tokenizer实例
      Parameters:
      tokenizer - tokenizer | 分词器
      Returns:
      this builder | 此构建器
    • nGram

      public SimHashBuilder nGram(int n)
      Uses N-gram tokenization 使用N-gram分词
      Parameters:
      n - gram size | gram大小
      Returns:
      this builder | 此构建器
    • whitespaceTokenizer

      public SimHashBuilder whitespaceTokenizer()
      Uses whitespace tokenization 使用空格分词
      Returns:
      this builder | 此构建器
    • wordTokenizer

      public SimHashBuilder wordTokenizer()
      Uses word tokenization 使用单词分词
      Returns:
      this builder | 此构建器
    • characterTokenizer

      public SimHashBuilder characterTokenizer()
      Uses character tokenization 使用字符分词
      Returns:
      this builder | 此构建器
    • hashFunction

      public SimHashBuilder hashFunction(HashFunction hashFunction)
      Sets the hash function for token hashing 设置用于标记哈希的哈希函数
      Parameters:
      hashFunction - hash function | 哈希函数
      Returns:
      this builder | 此构建器
    • bits

      public SimHashBuilder bits(int bits)
      Sets the fingerprint bit size (32 or 64) 设置指纹位大小(32或64)
      Parameters:
      bits - bit size | 位大小
      Returns:
      this builder | 此构建器
    • weightFunction

      public SimHashBuilder weightFunction(Function<String,Integer> weightFunction)
      Sets the token weight function 设置标记权重函数
      Parameters:
      weightFunction - weight function | 权重函数
      Returns:
      this builder | 此构建器
    • lengthWeighted

      public SimHashBuilder lengthWeighted()
      Uses token length as weight 使用标记长度作为权重
      Returns:
      this builder | 此构建器
    • uniformWeight

      public SimHashBuilder uniformWeight()
      Uses uniform weight (1 for all tokens) 使用均匀权重(所有标记为1)
      Returns:
      this builder | 此构建器
    • build

      public SimHash build()
      Builds the SimHash instance 构建SimHash实例
      Returns:
      SimHash instance | SimHash实例