Sample: Opinion Analysis of News, Threaded Conversations, and User Generated Content

This sample uses Cloud Dataflow to build an opinion analysis processing pipeline for news, threaded conversations in forums like Hacker News, Reddit, or Twitter and other user generated content e.g. email.

Opinion Analysis can be used for lead generation purposes, user research, or automated testimonial harvesting.

About the sample

This sample contains three components:

How to run the sample

The steps for configuring and running this sample are as follows:


Setup your Google Cloud Platform project and permissions

Install tools necessary for compiling and deploying the code in this sample, if not already on your system, specifically git, Google Cloud SDK, Python (for orchestration scripts), Java and Maven (for Dataflow pipelines):

Create and setup a Cloud Storage bucket and Cloud Pub/Sub topics

Create or verify a configuration for your project

Important: This tutorial uses several billable components of Google Cloud Platform. New Cloud Platform users may be eligible for a free trial.

Clone the sample code

To clone the GitHub repository to your computer, run the following command:

git clone

Go to the dataflow-opinion-analysis directory. The exact path depends on where you placed the directory when you cloned the sample files from GitHub.

cd dataflow-opinion-analysis

Specify cron jobs for the App Engine scheduling app

After you deploy the App Engine application, it uses the App Engine Cron Service to schedule sending messages to the Cloud Pub/Sub control topics. If the control Cloud Pub/Sub topic specified in your Python scripts (e.g. does not exist, the application creates it.

You can see the cron jobs under in the Cloud Console under:

Compute > App Engine > Task queues > Cron Jobs

You can also see the control topic in the Cloud Console:

Big Data > Pub/Sub

Create the BigQuery dataset

Table schema definitions are located in the *Schema.json files in the bigquery directory. View definitions are located in the shell script

Deploy the Dataflow pipelines

Download and install the Sirocco sentiment analysis packages

If you would like to use this sample for deep textual analysis, download and install Sirocco, a framework maintained by @datancoffee.

mvn install:install-file \
  -DgroupId=sirocco.sirocco-sa \
  -DartifactId=sirocco-sa \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-sa-x.y.z.jar \
mvn install:install-file \
  -DgroupId=sirocco.sirocco-mo \
  -DartifactId=sirocco-mo \
  -Dpackaging=jar \
  -Dversion=x.y.z \
  -Dfile=sirocco-mo-x.y.z.jar \

Build and Deploy your Controller pipeline to Cloud Dataflow

Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4

cd scripts
cd ..
scripts/ &

Run a verification job

Note (May 22,2018): We are in the process of updating the Controller pipeline. Skip this step and instead launch Indexing jobs directly as described in Release Notes for version 0.6.4

You can use the included news articles (from Google's blogs) in the src/test/resources/input directory to run a test pipeline.

SELECT * FROM opinions.sentiment 
ORDER BY DocumentTime DESC

Clean up

Now that you have tested the sample, delete the cloud resources you created to prevent further billing for them on your account.


Copyright 2017 Google Inc. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.