public class WindowedWordCount extends Object
This class, WindowedWordCount, is the third in a series of three successively more
detailed 'word count' examples. First take a look at MinimalWordCount and WordCount. This class extends the WordCount class.
Basic concepts, also in the MinimalWordCount and WordCount examples: Reading text files; counting a PCollection; writing to GCS; executing a Pipeline both locally and using the Dataflow service; defining DoFns; creating a custom aggregator; user-defined Ptransforms; defining Pipeline options.
New Concepts:
To execute this pipeline locally, specify general pipeline configuration:
--project=YOUR_PROJECT_ID
To execute this pipeline using the Dataflow service, specify pipeline configuration:
--project=YOUR_PROJECT_ID
--stagingLocation=gs://YOUR_STAGING_DIRECTORY
--runner=BlockingDataflowPipelineRunner
Optionally specify the input file path via:
--inputFile=gs://INPUT_PATH,
which defaults to gs://dataflow-samples/shakespeare/kinglear.txt.
Specify an output BigQuery dataset and optionally, a table for the output. If you don't
specify the table, one will be created for you using the job name. If you don't specify the
dataset, a dataset called dataflow-examples must already exist in your project.
--bigQueryDataset=YOUR-DATASET --bigQueryTable=YOUR-NEW-TABLE-NAME.
Decide whether you want your pipeline to run with 'bounded' or 'unbounded' input.
To run with unbounded input, set:
--unbounded=true.
Then, optionally specify the Google Cloud PubSub topic to read from via
--pubsubTopic=/topics/PROJECT ID/YOUR-TOPIC-NAME.
If the topic does not exist, the pipeline will create one for you. It will delete this topic
when it terminates.
The pipeline will automatically launch an auxiliary batch pipeline to populate the given
PubSub topic with the contents of the --inputFile, in order to make the example easy to run. If
you want to use an independently-populated PubSub topic, indicate this by setting --inputFile to
the empty string. In that case, the auxiliary pipeline will not be started.
By default, the pipeline will do fixed windowing, on 1-minute windows. You can change this interval by setting the --windowSize parameter, e.g. --windowSize=10 for 10-minute windows.
| Modifier and Type | Class and Description |
|---|---|
static interface |
WindowedWordCount.Options
Options supported by
WindowedWordCount. |
| Constructor and Description |
|---|
WindowedWordCount() |
public static void main(String[] args) throws IOException
IOException