Interface Tokenizer

All Superinterfaces:
Function<String, List<String>>
Functional Interface:
This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

@FunctionalInterface public interface Tokenizer extends Function<String, List<String>>
Tokenizer interface for SimHash text processing SimHash 文本处理的分词器接口

Defines how to split text into tokens for SimHash computation. Provides built-in implementations for common tokenization strategies.

定义如何将文本拆分为用于 SimHash 计算的标记。 提供常见分词策略的内置实现。

Features | 主要功能:

  • Whitespace tokenization - 空格分词
  • N-gram tokenization - N-gram 分词
  • Character tokenization - 字符分词
  • Custom tokenization - 自定义分词

Usage Examples | 使用示例:

// Whitespace tokenizer
Tokenizer tokenizer = Tokenizer.whitespace();
List<String> tokens = tokenizer.tokenize("Hello World");

// N-gram tokenizer
Tokenizer ngram = Tokenizer.ngram(3);
List<String> grams = ngram.tokenize("Hello");

Security | 安全性:

  • Thread-safe: Implementation-dependent - 线程安全: 取决于实现
  • Null-safe: Yes (validates inputs) - 空值安全: 是(验证输入)

Performance | 性能特性:

  • Time complexity: O(L) for whitespace/words/characters tokenization where L=input length; O(L) for ngram(n) producing L-n+1 tokens - 时间复杂度: whitespace/words/characters 分词为 O(L),L 为输入长度;ngram(n) 生成 L-n+1 个 token,时间 O(L)
  • Space complexity: O(L) for the resulting token list - 空间复杂度: 结果 token 列表为 O(L)
Since:
JDK 25, opencode-base-hash V1.0.0
Author:
Leon Soo www.LeonSoo.com
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    default List<String>
    apply(String text)
     
    static Tokenizer
    Creates a character tokenizer 创建字符分词器
    static Tokenizer
    Creates a Chinese character tokenizer (single characters for CJK) 创建中文字符分词器(CJK单字符)
    static Tokenizer
    combined(Tokenizer... tokenizers)
    Combines multiple tokenizers 组合多个分词器
    static Tokenizer
    ngram(int n)
    Creates an N-gram tokenizer 创建 N-gram 分词器
    Tokenizes the input text 分词输入文本
    static Tokenizer
    Creates a whitespace tokenizer 创建空格分词器
    static Tokenizer
    Creates a word boundary tokenizer (alphanumeric words) 创建单词边界分词器(字母数字单词)

    Methods inherited from interface Function

    andThen, compose
  • Method Details

    • tokenize

      List<String> tokenize(String text)
      Tokenizes the input text 分词输入文本
      Parameters:
      text - input text | 输入文本
      Returns:
      list of tokens | 标记列表
    • apply

      default List<String> apply(String text)
      Specified by:
      apply in interface Function<String, List<String>>
    • whitespace

      static Tokenizer whitespace()
      Creates a whitespace tokenizer 创建空格分词器
      Returns:
      whitespace tokenizer | 空格分词器
    • ngram

      static Tokenizer ngram(int n)
      Creates an N-gram tokenizer 创建 N-gram 分词器
      Parameters:
      n - gram size | gram 大小
      Returns:
      n-gram tokenizer | n-gram 分词器
    • characters

      static Tokenizer characters()
      Creates a character tokenizer 创建字符分词器
      Returns:
      character tokenizer | 字符分词器
    • words

      static Tokenizer words()
      Creates a word boundary tokenizer (alphanumeric words) 创建单词边界分词器(字母数字单词)
      Returns:
      word boundary tokenizer | 单词边界分词器
    • cjkCharacters

      static Tokenizer cjkCharacters()
      Creates a Chinese character tokenizer (single characters for CJK) 创建中文字符分词器(CJK单字符)
      Returns:
      Chinese character tokenizer | 中文字符分词器
    • combined

      static Tokenizer combined(Tokenizer... tokenizers)
      Combines multiple tokenizers 组合多个分词器
      Parameters:
      tokenizers - tokenizers to combine | 要组合的分词器
      Returns:
      combined tokenizer | 组合的分词器