Class SimHash

java.lang.Object
cloud.opencode.base.hash.simhash.SimHash

public final class SimHash extends Object
SimHash algorithm implementation for text fingerprinting SimHash 算法实现用于文本指纹

SimHash is a locality-sensitive hashing algorithm that produces similar hash values for similar input texts. Useful for detecting near-duplicate content and text similarity.

SimHash 是一种局部敏感哈希算法,为相似的输入文本生成相似的哈希值。 用于检测近似重复内容和文本相似度。

Features | 主要功能:

  • 64-bit and 32-bit fingerprints - 64位和32位指纹
  • Configurable tokenization - 可配置的分词
  • Token weighting support - 标记权重支持
  • Hamming distance calculation - 海明距离计算

Usage Examples | 使用示例:

SimHash simHash = SimHash.builder()
    .nGram(3)
    .build();

long hash1 = simHash.hash("Hello World");
long hash2 = simHash.hash("Hello World!");

int distance = SimHash.hammingDistance(hash1, hash2);
double similarity = SimHash.similarity(hash1, hash2);

Security | 安全性:

  • Thread-safe: Yes (stateless) - 线程安全: 是(无状态)

Performance | 性能特性:

  • Time complexity: O(n * k) where n = tokens, k = hash bits - O(n * k), n为词元数, k为哈希位数
  • Space complexity: O(k) where k = hash bits (default 64) - O(k), k为哈希位数(默认64)
Since:
JDK 25, opencode-base-hash V1.0.0
Author:
Leon Soo www.LeonSoo.com
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    int
    Gets the number of bits in the fingerprint 获取指纹的位数
    Creates a builder 创建构建器
    static SimHash
    Creates a default SimHash instance with 3-gram tokenization 创建使用3-gram分词的默认SimHash实例
    Computes a Fingerprint object for text 计算文本的Fingerprint对象
    static int
    hammingDistance(long hash1, long hash2)
    Calculates Hamming distance between two hash values 计算两个哈希值之间的海明距离
    long
    hash(String text)
    Computes SimHash for text 计算文本的SimHash
    static boolean
    isSimilar(long hash1, long hash2, int threshold)
    Checks if two hashes are similar within threshold 检查两个哈希是否在阈值内相似
    static double
    similarity(long hash1, long hash2)
    Calculates similarity between two hash values 计算两个哈希值之间的相似度
    static double
    similarity(long hash1, long hash2, int bits)
    Calculates similarity with specified bits 使用指定位数计算相似度

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • hash

      public long hash(String text)
      Computes SimHash for text 计算文本的SimHash
      Parameters:
      text - input text | 输入文本
      Returns:
      SimHash value | SimHash值
    • fingerprint

      public Fingerprint fingerprint(String text)
      Computes a Fingerprint object for text 计算文本的Fingerprint对象
      Parameters:
      text - input text | 输入文本
      Returns:
      Fingerprint object | Fingerprint对象
    • bits

      public int bits()
      Gets the number of bits in the fingerprint 获取指纹的位数
      Returns:
      number of bits | 位数
    • hammingDistance

      public static int hammingDistance(long hash1, long hash2)
      Calculates Hamming distance between two hash values 计算两个哈希值之间的海明距离
      Parameters:
      hash1 - first hash | 第一个哈希
      hash2 - second hash | 第二个哈希
      Returns:
      number of different bits | 不同的位数
    • similarity

      public static double similarity(long hash1, long hash2)
      Calculates similarity between two hash values 计算两个哈希值之间的相似度
      Parameters:
      hash1 - first hash | 第一个哈希
      hash2 - second hash | 第二个哈希
      Returns:
      similarity (0.0 - 1.0) | 相似度(0.0 - 1.0)
    • similarity

      public static double similarity(long hash1, long hash2, int bits)
      Calculates similarity with specified bits 使用指定位数计算相似度
      Parameters:
      hash1 - first hash | 第一个哈希
      hash2 - second hash | 第二个哈希
      bits - number of bits | 位数
      Returns:
      similarity (0.0 - 1.0) | 相似度(0.0 - 1.0)
    • isSimilar

      public static boolean isSimilar(long hash1, long hash2, int threshold)
      Checks if two hashes are similar within threshold 检查两个哈希是否在阈值内相似
      Parameters:
      hash1 - first hash | 第一个哈希
      hash2 - second hash | 第二个哈希
      threshold - Hamming distance threshold | 海明距离阈值
      Returns:
      true if similar | 如果相似返回true
    • builder

      public static SimHashBuilder builder()
      Creates a builder 创建构建器
      Returns:
      builder | 构建器
    • create

      public static SimHash create()
      Creates a default SimHash instance with 3-gram tokenization 创建使用3-gram分词的默认SimHash实例
      Returns:
      SimHash instance | SimHash实例