Class SimHash
java.lang.Object
cloud.opencode.base.hash.simhash.SimHash
SimHash algorithm implementation for text fingerprinting
SimHash 算法实现用于文本指纹
SimHash is a locality-sensitive hashing algorithm that produces similar hash values for similar input texts. Useful for detecting near-duplicate content and text similarity.
SimHash 是一种局部敏感哈希算法,为相似的输入文本生成相似的哈希值。 用于检测近似重复内容和文本相似度。
Features | 主要功能:
- 64-bit and 32-bit fingerprints - 64位和32位指纹
- Configurable tokenization - 可配置的分词
- Token weighting support - 标记权重支持
- Hamming distance calculation - 海明距离计算
Usage Examples | 使用示例:
SimHash simHash = SimHash.builder()
.nGram(3)
.build();
long hash1 = simHash.hash("Hello World");
long hash2 = simHash.hash("Hello World!");
int distance = SimHash.hammingDistance(hash1, hash2);
double similarity = SimHash.similarity(hash1, hash2);
Security | 安全性:
- Thread-safe: Yes (stateless) - 线程安全: 是(无状态)
Performance | 性能特性:
- Time complexity: O(n * k) where n = tokens, k = hash bits - O(n * k), n为词元数, k为哈希位数
- Space complexity: O(k) where k = hash bits (default 64) - O(k), k为哈希位数(默认64)
- Since:
- JDK 25, opencode-base-hash V1.0.0
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionintbits()Gets the number of bits in the fingerprint 获取指纹的位数static SimHashBuilderbuilder()Creates a builder 创建构建器static SimHashcreate()Creates a default SimHash instance with 3-gram tokenization 创建使用3-gram分词的默认SimHash实例fingerprint(String text) Computes a Fingerprint object for text 计算文本的Fingerprint对象static inthammingDistance(long hash1, long hash2) Calculates Hamming distance between two hash values 计算两个哈希值之间的海明距离longComputes SimHash for text 计算文本的SimHashstatic booleanisSimilar(long hash1, long hash2, int threshold) Checks if two hashes are similar within threshold 检查两个哈希是否在阈值内相似static doublesimilarity(long hash1, long hash2) Calculates similarity between two hash values 计算两个哈希值之间的相似度static doublesimilarity(long hash1, long hash2, int bits) Calculates similarity with specified bits 使用指定位数计算相似度
-
Method Details
-
hash
Computes SimHash for text 计算文本的SimHash- Parameters:
text- input text | 输入文本- Returns:
- SimHash value | SimHash值
-
fingerprint
Computes a Fingerprint object for text 计算文本的Fingerprint对象- Parameters:
text- input text | 输入文本- Returns:
- Fingerprint object | Fingerprint对象
-
bits
public int bits()Gets the number of bits in the fingerprint 获取指纹的位数- Returns:
- number of bits | 位数
-
hammingDistance
public static int hammingDistance(long hash1, long hash2) Calculates Hamming distance between two hash values 计算两个哈希值之间的海明距离- Parameters:
hash1- first hash | 第一个哈希hash2- second hash | 第二个哈希- Returns:
- number of different bits | 不同的位数
-
similarity
public static double similarity(long hash1, long hash2) Calculates similarity between two hash values 计算两个哈希值之间的相似度- Parameters:
hash1- first hash | 第一个哈希hash2- second hash | 第二个哈希- Returns:
- similarity (0.0 - 1.0) | 相似度(0.0 - 1.0)
-
similarity
public static double similarity(long hash1, long hash2, int bits) Calculates similarity with specified bits 使用指定位数计算相似度- Parameters:
hash1- first hash | 第一个哈希hash2- second hash | 第二个哈希bits- number of bits | 位数- Returns:
- similarity (0.0 - 1.0) | 相似度(0.0 - 1.0)
-
isSimilar
public static boolean isSimilar(long hash1, long hash2, int threshold) Checks if two hashes are similar within threshold 检查两个哈希是否在阈值内相似- Parameters:
hash1- first hash | 第一个哈希hash2- second hash | 第二个哈希threshold- Hamming distance threshold | 海明距离阈值- Returns:
- true if similar | 如果相似返回true
-
builder
-
create
Creates a default SimHash instance with 3-gram tokenization 创建使用3-gram分词的默认SimHash实例- Returns:
- SimHash instance | SimHash实例
-