Simplest way to get Tweets into BigQuery. Uses Google Cloud & App Engine, as well as Python and D3.
This sample code will help you streaming Twitter data into BigQuery, and running simple visualizations. This sample also generates the queries you can run directly in the BigQuery interface, or extend for your applications.
Additionally, you can use other public or private datasets in BigQuery to do additional joins and develop other insights/correlations.
To work with Google Cloud and BigQuery, follow the below instructions to create a new project, service account and get your PEM file.
Convert the P12 key to a PEM file with the following:
cat key.p12 | openssl pkcs12 -nodes -nocerts -passin pass:notasecret | openssl rsa > key.pem
Fill out the following fields:
setup.pyto generate appropriate yaml and config files in the imagegnip and imagetwitter
As a pre-requisite for setting up BigQuery, you need to first set up a billing account. To do so:
The enclosed sample includes a simple
load.pyfile to stream Tweets directly into BigQuery.
python load.pyto begin loading data from your local machine
When developing on top of the Twitter platform, you must abide by the Developer Agreement & Policy.
Most notably, you must respect the section entitled "Maintain the Integrity of Twitter's Products", including removing all relevant Content with regard to unfavorites, deletes and other user actions.
To help simplify your setup, this project is designed to use:
curl https://sdk.cloud.google.com | bash
Dockerfiledescribes the required libraries and packaging for the container. The below runs through the steps to create your own container and deploy it to Google Compute Engine.
# start docker locally boot2docker start $(boot2docker shellinit)
build and run docker image locally
docker build -t gcr.io/twitter_for_bigquery/image . docker run -i -t gcr.io/twitter_for_bigquery/image
push to Google Cloud container registry
gcloud preview docker push gcr.io/twitter_for_bigquery/image
create and instance with docker container
gcloud compute instances create examplecontainervm01
log into the new instance
gcloud compute instances list gcloud compute --project "twitter-for-bigquery" ssh --zone "us-central1-b" "examplecontainervm01"
pull the container and run it in docker
sudo docker pull gcr.io/twitter_for_bigquery/image sudo docker run -d gcr.io/twitter_for_bigquery/image
view logs to confirm its running
sudo -s sudo docker ps sudo docker logs --follow=true 5d
More notes for Docker + Google Cloud:
From the command line, you can use dev_appserver.py to run your local server. You'll need to specify your service account and private key file on the command line, as such:
dev_appserver.py . --appidentity_email_address="[email protected]" --appidentity_private_key_path=/PATH/TO/key.pem
Once this is complete, open your browser to http://localhost:8080.
To run in Google App Engine, do the following:
In the "Extra Flags" section, add the command line flags, as above:
--appidentity_email_address="[email protected]" --appidentity_private_key_path=/PATH_TO/key.pem
To confirm the deploy worked, you can do the following to view the logs:
If you need large amounts of past tweets loaded onto BigQuery, you will need to use Gnip's Historical Power Track. The best way to load large amounts of tweets is:
batch.pyfile to process each gzip file and load onto BigQuery
When running the above processing, choose an environment that is optimized for network performance, as you may be downloading multiple GB of files onto your server and then onto BigQuery.
curl https://sdk.cloud.google.com | bashBigQuery command line tool
load.pyfile takes tweets and loads them one-by-one into BigQuery. Some basic scrubbing of the data is done to simplify the dataset. (For more information, view the
Utils.scrub()function.) Additionally, JSON files are provided in
/schemaas samples of the data formats from Gnip/Twitter and stored into BigQuery.
To help you get started, below are some sample queries.
Querying for tweets contain a specific word or phrase.
SELECT text FROM [twitter.tweets] WHERE text CONTAINS ' something ' LIMIT 10
Searching for specific hashtags.
SELECT entities.hashtags.text, HOUR(TIMESTAMP(created_at)) AS create_hour, count(*) as count FROM [twitter.tweets] WHERE LOWER(entities.hashtags.text) in ('John', 'Paul', 'George', 'Ringo') GROUP by create_hour, entities.hashtags.text ORDER BY entities.hashtags.text ASC, create_hour ASC
Listing the most popular Twitter applications.
SELECT source, count(*) as count FROM [twitter.tweets] GROUP by source ORDER BY count DESC LIMIT 1000
Finding the most popular content shared on Twitter.
SELECT text, entities.urls.url FROM [twitter.tweets] WHERE entities.urls.url IS NOT NULL LIMIT 10
Users that tweet the most.
SELECT user.screen_name, count(*) as count FROM [twitter.tweets] GROUP BY user.screen_name ORDER BY count DESC LIMIT 10
To learn more about querying, go to [https://cloud.google.com/bigquery/query-reference]https://cloud.google.com/bigquery/query-reference)
Using BigQuery allows you to combine Twitter data with other public sources of information. Here are some ideas to inspire your next project:
You can also visit http://demo.redash.io/ to perform queries and visualizations against publicly available data sources.
You will want to create your own app_id in app.yaml. If that does not work, then per this thread (http://stackoverflow.com/questions/10407955/google-app-engine-this-application-does-not-exist), try the following:
The default Google AppEngine TaskQueue (named 'default') has a limit of 10 minutes for any task. To run a task for longer, you need to set up a custom task queue and a backend server. The instructions are above, but the basics include:
appcfg.py update app.yaml backfill.yamlcommand to start both the main app and the background app.
Google AppEngine has usage quotas to regulate billing and usage. You can read about the quotas for various products here:
To increase quota limits, you can go into Compute->App Engine->Settings and edit your daily budget to allow for increased usage.
The following documents serve as additional information on streaming data from Twitter and working with BigQuery.
The following developers and bloggers have aided greatly in the development of this source. I'm appreciative of contributions and knowledge sharing.