public class DebuggingWordCount extends Object
This class, DebuggingWordCount, is the third in a series of four successively more
detailed 'word count' examples. You may first want to take a look at MinimalWordCount
and WordCount. After you've looked at this example, then see the
WindowedWordCount pipeline, for introduction of additional concepts.
Basic concepts, also in the MinimalWordCount and WordCount examples: Reading text files; counting a PCollection; executing a Pipeline both locally and using the Dataflow service; defining DoFns.
New Concepts:
1. Logging to Cloud Logging 2. Controlling Dataflow worker log levels 3. Creating a custom aggregator 4. Testing your Pipeline via DataflowAssert
To execute this pipeline locally, specify general pipeline configuration:
--project=YOUR_PROJECT_ID
To execute this pipeline using the Dataflow service and the additional logging discussed below, specify pipeline configuration:
--project=YOUR_PROJECT_ID
--stagingLocation=gs://YOUR_STAGING_DIRECTORY
--runner=BlockingDataflowPipelineRunner
--workerLogLevelOverrides={"com.google.cloud.dataflow.examples":"DEBUG"}
Note that when you run via mvn exec, you may need to escape
the quotations as appropriate for your shell. For example, in bash:
mvn compile exec:java ... \
-Dexec.args="... \
--workerLogLevelOverrides={\\\"com.google.cloud.dataflow.examples\\\":\\\"DEBUG\\\"}"
Concept #2: Dataflow workers which execute user code are configured to log to Cloud Logging by default at "INFO" log level and higher. One may override log levels for specific logging namespaces by specifying:
--workerLogLevelOverrides={"Name1":"Level1","Name2":"Level2",...}
For example, by specifying:
--workerLogLevelOverrides={"com.google.cloud.dataflow.examples":"DEBUG"}
when executing this pipeline using the Dataflow service, Cloud Logging would contain only
"DEBUG" or higher level logs for the com.google.cloud.dataflow.examples package in
addition to the default "INFO" or higher level logs. In addition, the default Dataflow worker
logging configuration can be overridden by specifying
--defaultWorkerLogLevel=<one of TRACE, DEBUG, INFO, WARN, ERROR>. For example,
by specifying --defaultWorkerLogLevel=DEBUG when executing this pipeline with
the Dataflow service, Cloud Logging would contain all "DEBUG" or higher level logs. Note
that changing the default worker log level to TRACE or DEBUG will significantly increase
the amount of logs output.
The input file defaults to gs://dataflow-samples/shakespeare/kinglear.txt and can be
overridden with --inputFile.
| Modifier and Type | Class and Description |
|---|---|
static class |
DebuggingWordCount.FilterTextFn
A DoFn that filters for a specific key based upon a regular expression.
|
| Constructor and Description |
|---|
DebuggingWordCount() |
public static void main(String[] args)