Class CsvSampling
java.lang.Object
cloud.opencode.base.csv.sampling.CsvSampling
CSV Sampling - Sampling utilities for CSV documents
CSV采样 - CSV文档的采样工具
Provides static methods for random sampling, systematic sampling,
and stratified sampling from a CsvDocument. All methods preserve
the original document's headers.
提供从 CsvDocument 进行随机采样、系统采样和分层采样的静态方法。
所有方法保留原始文档的标题。
Features | 主要功能:
- Random sampling (Fisher-Yates, with optional seed) - 随机采样
- Systematic sampling (every Nth row) - 系统采样(每N行)
- Stratified sampling (proportional by group) - 分层采样(按组比例)
Usage Examples | 使用示例:
CsvDocument sample = CsvSampling.random(doc, 10);
CsvDocument sample = CsvSampling.random(doc, 10, 42L);
CsvDocument sample = CsvSampling.systematic(doc, 5);
CsvDocument sample = CsvSampling.stratified(doc, "category", 20);
Security | 安全性:
- Thread-safe: Yes (stateless utility) - 线程安全: 是(无状态工具)
- Null-safe: Validates all inputs - 空值安全: 验证所有输入
- Since:
- JDK 25, opencode-base-csv V1.0.3
- Author:
- Leon Soo www.LeonSoo.com
- See Also:
-
Method Summary
Modifier and TypeMethodDescriptionstatic CsvDocumentrandom(CsvDocument doc, int sampleSize) Randomly samples rows without replacement 无放回随机采样行static CsvDocumentrandom(CsvDocument doc, int sampleSize, long seed) Randomly samples rows without replacement using a seed for reproducibility 使用种子无放回随机采样行以实现可重现性static CsvDocumentstratified(CsvDocument doc, String column, int sampleSize) Performs stratified sampling, sampling proportionally from each group defined by a column 执行分层采样,按列定义的每个组按比例采样static CsvDocumentstratified(CsvDocument doc, String column, int sampleSize, long seed) Performs stratified sampling with a seed for reproducibility 使用种子执行分层采样以实现可重现性static CsvDocumentsystematic(CsvDocument doc, int interval) Performs systematic sampling, selecting every Nth row starting from a random offset 执行系统采样,从随机偏移开始每隔N行选取一行static CsvDocumentsystematic(CsvDocument doc, int interval, int startOffset) Performs systematic sampling with a specified start offset 使用指定起始偏移执行系统采样
-
Method Details
-
random
Randomly samples rows without replacement 无放回随机采样行Uses Fisher-Yates shuffle on row indices. If sampleSize >= rowCount, the entire document is returned.
对行索引使用Fisher-Yates洗牌。 如果sampleSize >= rowCount,返回整个文档。
- Parameters:
doc- the source document | 源文档sampleSize- the number of rows to sample | 要采样的行数- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc is null | 如果doc为nullOpenCsvException- if sampleSize is not positive | 如果sampleSize不为正数
-
random
Randomly samples rows without replacement using a seed for reproducibility 使用种子无放回随机采样行以实现可重现性- Parameters:
doc- the source document | 源文档sampleSize- the number of rows to sample | 要采样的行数seed- the random seed | 随机种子- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc is null | 如果doc为nullOpenCsvException- if sampleSize is not positive | 如果sampleSize不为正数
-
systematic
Performs systematic sampling, selecting every Nth row starting from a random offset 执行系统采样,从随机偏移开始每隔N行选取一行- Parameters:
doc- the source document | 源文档interval- the sampling interval (every Nth row) | 采样间隔(每N行)- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc is null | 如果doc为nullOpenCsvException- if interval is not positive | 如果interval不为正数
-
systematic
Performs systematic sampling with a specified start offset 使用指定起始偏移执行系统采样- Parameters:
doc- the source document | 源文档interval- the sampling interval | 采样间隔startOffset- the 0-based starting row offset | 0起始的起始行偏移- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc is null | 如果doc为nullOpenCsvException- if interval is not positive or startOffset is invalid | 参数无效时
-
stratified
Performs stratified sampling, sampling proportionally from each group defined by a column 执行分层采样,按列定义的每个组按比例采样Groups rows by the specified column value, then samples proportionally from each group. Each group gets at least 1 row if possible. The total may differ slightly from sampleSize due to rounding.
按指定列值对行进行分组,然后从每个组按比例采样。 如果可能,每个组至少获得1行。由于四舍五入,总数可能与sampleSize略有不同。
- Parameters:
doc- the source document | 源文档column- the column to group by | 用于分组的列sampleSize- the target total sample size | 目标总采样大小- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc or column is null | 如果doc或column为nullOpenCsvException- if sampleSize is not positive or column is not found | 参数无效时
-
stratified
Performs stratified sampling with a seed for reproducibility 使用种子执行分层采样以实现可重现性- Parameters:
doc- the source document | 源文档column- the column to group by | 用于分组的列sampleSize- the target total sample size | 目标总采样大小seed- the random seed | 随机种子- Returns:
- a new document containing the sampled rows | 包含采样行的新文档
- Throws:
NullPointerException- if doc or column is null | 如果doc或column为nullOpenCsvException- if sampleSize is not positive or column is not found | 参数无效时
-