public class WindowedWordCount extends Object
This class, WindowedWordCount, is the last in a series of four successively more
detailed 'word count' examples. First take a look at MinimalWordCount,
WordCount, and DebuggingWordCount.
Basic concepts, also in the MinimalWordCount, WordCount, and DebuggingWordCount examples: Reading text files; counting a PCollection; writing to GCS; executing a Pipeline both locally and using the Dataflow service; defining DoFns; creating a custom aggregator; user-defined PTransforms; defining PipelineOptions.
New Concepts:
1. Unbounded and bounded pipeline input modes 2. Adding timestamps to data 3. PubSub topics as sources 4. Windowing 5. Re-using PTransforms over windowed PCollections 6. Writing to BigQuery
To execute this pipeline locally, specify general pipeline configuration:
--project=YOUR_PROJECT_ID
To execute this pipeline using the Dataflow service, specify pipeline configuration:
--project=YOUR_PROJECT_ID
--stagingLocation=gs://YOUR_STAGING_DIRECTORY
--runner=BlockingDataflowPipelineRunner
Optionally specify the input file path via:
--inputFile=gs://INPUT_PATH,
which defaults to gs://dataflow-samples/shakespeare/kinglear.txt.
Specify an output BigQuery dataset and optionally, a table for the output. If you don't
specify the table, one will be created for you using the job name. If you don't specify the
dataset, a dataset called dataflow-examples must already exist in your project.
--bigQueryDataset=YOUR-DATASET --bigQueryTable=YOUR-NEW-TABLE-NAME.
Decide whether you want your pipeline to run with 'bounded' (such as files in GCS) or
'unbounded' input (such as a PubSub topic). To run with unbounded input, set
--unbounded=true. Then, optionally specify the Google Cloud PubSub topic to read from
via --pubsubTopic=projects/PROJECT_ID/topics/YOUR_TOPIC_NAME. If the topic does not
exist, the pipeline will create one for you. It will delete this topic when it terminates.
The pipeline will automatically launch an auxiliary batch pipeline to populate the given PubSub
topic with the contents of the --inputFile, in order to make the example easy to run.
If you want to use an independently-populated PubSub topic, indicate this by setting
--inputFile="". In that case, the auxiliary pipeline will not be started.
By default, the pipeline will do fixed windowing, on 1-minute windows. You can
change this interval by setting the --windowSize parameter, e.g. --windowSize=10
for 10-minute windows.
| Modifier and Type | Class and Description |
|---|---|
static interface |
WindowedWordCount.Options
Options supported by
WindowedWordCount. |
| Constructor and Description |
|---|
WindowedWordCount() |
public static void main(String[] args) throws IOException
IOException