T - Type of records represented by the source.public abstract class FileBasedSource<T> extends ByteOffsetBasedSource<T>
Sources. Extend this class to implement your own
file-based custom source.
A file-based Source is a Source backed by a file pattern defined as a Java
glob, a single file, or a offset range for a single file. See ByteOffsetBasedSource for
semantics of offset ranges.
This source stores a String that is an IOChannelFactory specification for a
file or file pattern. There should be an IOChannelFactory defined for the file
specification provided. Please refer to IOChannelUtils and IOChannelFactory for
more information on this.
In addition to the methods left abstract from Source, subclasses must implement
methods to create a sub-source and a reader for a range of a single file -
createForSubrangeOfFile(java.lang.String, long, long) and createSingleFileReader(com.google.cloud.dataflow.sdk.options.PipelineOptions, com.google.cloud.dataflow.sdk.coders.Coder<T>, com.google.cloud.dataflow.sdk.util.ExecutionContext).
| Modifier and Type | Class and Description |
|---|---|
static class |
FileBasedSource.FileBasedReader<T>
A
reader that implements code common to readers of
FileBasedSources. |
static class |
FileBasedSource.Mode
A given
FileBasedSource represents a file resource of one of these types. |
ByteOffsetBasedSource.ByteOffsetBasedReader<T>Source.Reader<T>| Constructor and Description |
|---|
FileBasedSource(boolean isFilePattern,
java.lang.String fileOrPatternSpec,
long minBundleSize,
long startOffset,
long endOffset)
Create a
FileBasedSource based on a file or a file pattern specification. |
| Modifier and Type | Method and Description |
|---|---|
protected Source.Reader<T> |
createBasicReader(PipelineOptions options,
Coder<T> coder,
com.google.cloud.dataflow.sdk.util.ExecutionContext executionContext)
Creates a basic (non-windowed) reader for this source.
|
abstract FileBasedSource<T> |
createForSubrangeOfFile(java.lang.String fileName,
long start,
long end)
Creates and returns a new
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. |
abstract FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options,
Coder<T> coder,
com.google.cloud.dataflow.sdk.util.ExecutionContext executionContext)
Creates and returns an instance of a
FileBasedReader implementation for the current
source assuming the source represents a single file. |
FileBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns a
ByteOffsetBasedSource for a subrange of the current source. |
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
java.lang.String |
getFileOrPatternSpec() |
long |
getMaxEndOffset(PipelineOptions options)
Returns the exact ending offset of the current source.
|
FileBasedSource.Mode |
getMode() |
java.util.List<? extends FileBasedSource<T>> |
splitIntoBundles(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles.
|
java.lang.String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
getEndOffset, getMinBundleSize, getStartOffsetcreateWindowedReader, getDefaultOutputCoder, producesSortedKeyspublic FileBasedSource(boolean isFilePattern,
java.lang.String fileOrPatternSpec,
long minBundleSize,
long startOffset,
long endOffset)
FileBasedSource based on a file or a file pattern specification.
See ByteOffsetBasedSource for detailed descriptions of minBundleSize,
startOffset, and endOffset.
isFilePattern - if true provided fileOrPatternSpec may be a file pattern
and FileBasedSource will try to expand the file pattern, if false
provided fileOrPatternSpec will be considered a single file and will be used
verbatim.fileOrPatternSpec - IOChannelFactory specification of file or file pattern
represented by the FileBasedSource.minBundleSize - minimum bundle size in bytes.startOffset - starting byte offset.endOffset - ending byte offset. If the specified value >= #getMaxEndOffset() it
implies #getMaxEndOffSet().public final java.lang.String getFileOrPatternSpec()
public final FileBasedSource.Mode getMode()
public final FileBasedSource<T> createSourceForSubrange(long start, long end)
ByteOffsetBasedSourceByteOffsetBasedSource for a subrange of the current source. [start, end) will
be within the range [startOffset, endOffset] of the current source.createSourceForSubrange in class ByteOffsetBasedSource<T>public abstract FileBasedSource<T> createForSubrangeOfFile(java.lang.String fileName, long start, long end)
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. When current source is
being split, this method is used to generate new sub-sources. When creating the source
subclasses must call the constructor of FileBasedSource with exactly the same
start and end values passed here.fileName - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE, in
which case it will be inferred using getMaxEndOffset(com.google.cloud.dataflow.sdk.options.PipelineOptions).public abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options, Coder<T> coder, com.google.cloud.dataflow.sdk.util.ExecutionContext executionContext)
FileBasedReader implementation for the current
source assuming the source represents a single file. File patterns will be handled by
FileBasedSource implementation automatically.public final long getEstimatedSizeBytes(PipelineOptions options) throws java.lang.Exception
SourcegetEstimatedSizeBytes in class Source<T>java.lang.Exceptionpublic final java.util.List<? extends FileBasedSource<T>> splitIntoBundles(long desiredBundleSizeBytes, PipelineOptions options) throws java.lang.Exception
Source PipelineOptions can be used to get information such as
credentials for accessing an external storage.
splitIntoBundles in class ByteOffsetBasedSource<T>java.lang.Exceptionprotected final Source.Reader<T> createBasicReader(PipelineOptions options, Coder<T> coder, com.google.cloud.dataflow.sdk.util.ExecutionContext executionContext) throws java.io.IOException
SourcecreateBasicReader in class Source<T>java.io.IOExceptionpublic java.lang.String toString()
toString in class ByteOffsetBasedSource<T>public void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
validate in class ByteOffsetBasedSource<T>public final long getMaxEndOffset(PipelineOptions options) throws java.lang.Exception
ByteOffsetBasedSourceLong.MAX_VALUE.getMaxEndOffset in class ByteOffsetBasedSource<T>java.lang.Exception