Class CsvSampling

java.lang.Object
cloud.opencode.base.csv.sampling.CsvSampling

public final class CsvSampling extends Object
CSV Sampling - Sampling utilities for CSV documents CSV采样 - CSV文档的采样工具

Provides static methods for random sampling, systematic sampling, and stratified sampling from a CsvDocument. All methods preserve the original document's headers.

提供从 CsvDocument 进行随机采样、系统采样和分层采样的静态方法。 所有方法保留原始文档的标题。

Features | 主要功能:

  • Random sampling (Fisher-Yates, with optional seed) - 随机采样
  • Systematic sampling (every Nth row) - 系统采样(每N行)
  • Stratified sampling (proportional by group) - 分层采样(按组比例)

Usage Examples | 使用示例:

CsvDocument sample = CsvSampling.random(doc, 10);
CsvDocument sample = CsvSampling.random(doc, 10, 42L);
CsvDocument sample = CsvSampling.systematic(doc, 5);
CsvDocument sample = CsvSampling.stratified(doc, "category", 20);

Security | 安全性:

  • Thread-safe: Yes (stateless utility) - 线程安全: 是(无状态工具)
  • Null-safe: Validates all inputs - 空值安全: 验证所有输入
Since:
JDK 25, opencode-base-csv V1.0.3
Author:
Leon Soo www.LeonSoo.com
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    random(CsvDocument doc, int sampleSize)
    Randomly samples rows without replacement 无放回随机采样行
    random(CsvDocument doc, int sampleSize, long seed)
    Randomly samples rows without replacement using a seed for reproducibility 使用种子无放回随机采样行以实现可重现性
    stratified(CsvDocument doc, String column, int sampleSize)
    Performs stratified sampling, sampling proportionally from each group defined by a column 执行分层采样,按列定义的每个组按比例采样
    stratified(CsvDocument doc, String column, int sampleSize, long seed)
    Performs stratified sampling with a seed for reproducibility 使用种子执行分层采样以实现可重现性
    systematic(CsvDocument doc, int interval)
    Performs systematic sampling, selecting every Nth row starting from a random offset 执行系统采样,从随机偏移开始每隔N行选取一行
    systematic(CsvDocument doc, int interval, int startOffset)
    Performs systematic sampling with a specified start offset 使用指定起始偏移执行系统采样

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Method Details

    • random

      public static CsvDocument random(CsvDocument doc, int sampleSize)
      Randomly samples rows without replacement 无放回随机采样行

      Uses Fisher-Yates shuffle on row indices. If sampleSize >= rowCount, the entire document is returned.

      对行索引使用Fisher-Yates洗牌。 如果sampleSize >= rowCount,返回整个文档。

      Parameters:
      doc - the source document | 源文档
      sampleSize - the number of rows to sample | 要采样的行数
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc is null | 如果doc为null
      OpenCsvException - if sampleSize is not positive | 如果sampleSize不为正数
    • random

      public static CsvDocument random(CsvDocument doc, int sampleSize, long seed)
      Randomly samples rows without replacement using a seed for reproducibility 使用种子无放回随机采样行以实现可重现性
      Parameters:
      doc - the source document | 源文档
      sampleSize - the number of rows to sample | 要采样的行数
      seed - the random seed | 随机种子
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc is null | 如果doc为null
      OpenCsvException - if sampleSize is not positive | 如果sampleSize不为正数
    • systematic

      public static CsvDocument systematic(CsvDocument doc, int interval)
      Performs systematic sampling, selecting every Nth row starting from a random offset 执行系统采样,从随机偏移开始每隔N行选取一行
      Parameters:
      doc - the source document | 源文档
      interval - the sampling interval (every Nth row) | 采样间隔(每N行)
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc is null | 如果doc为null
      OpenCsvException - if interval is not positive | 如果interval不为正数
    • systematic

      public static CsvDocument systematic(CsvDocument doc, int interval, int startOffset)
      Performs systematic sampling with a specified start offset 使用指定起始偏移执行系统采样
      Parameters:
      doc - the source document | 源文档
      interval - the sampling interval | 采样间隔
      startOffset - the 0-based starting row offset | 0起始的起始行偏移
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc is null | 如果doc为null
      OpenCsvException - if interval is not positive or startOffset is invalid | 参数无效时
    • stratified

      public static CsvDocument stratified(CsvDocument doc, String column, int sampleSize)
      Performs stratified sampling, sampling proportionally from each group defined by a column 执行分层采样,按列定义的每个组按比例采样

      Groups rows by the specified column value, then samples proportionally from each group. Each group gets at least 1 row if possible. The total may differ slightly from sampleSize due to rounding.

      按指定列值对行进行分组,然后从每个组按比例采样。 如果可能,每个组至少获得1行。由于四舍五入,总数可能与sampleSize略有不同。

      Parameters:
      doc - the source document | 源文档
      column - the column to group by | 用于分组的列
      sampleSize - the target total sample size | 目标总采样大小
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc or column is null | 如果doc或column为null
      OpenCsvException - if sampleSize is not positive or column is not found | 参数无效时
    • stratified

      public static CsvDocument stratified(CsvDocument doc, String column, int sampleSize, long seed)
      Performs stratified sampling with a seed for reproducibility 使用种子执行分层采样以实现可重现性
      Parameters:
      doc - the source document | 源文档
      column - the column to group by | 用于分组的列
      sampleSize - the target total sample size | 目标总采样大小
      seed - the random seed | 随机种子
      Returns:
      a new document containing the sampled rows | 包含采样行的新文档
      Throws:
      NullPointerException - if doc or column is null | 如果doc或column为null
      OpenCsvException - if sampleSize is not positive or column is not found | 参数无效时