Process Common Crawl data with Python and Spark
This project provides examples how to process the Common Crawl dataset with Apache Spark and Python:
count HTML tags in Common Crawl's raw response data (WARC files)
count web server names in Common Crawl's metadata (WAT files or WARC files)
list host names and corresponding IP addresses (WAT files or WARC files)
word count (term and document frequency) in Common Crawl's extracted text (WET files)
--csvto the Spark job.
Further information about the examples and available options is shown via the command-line option
To develop and test locally, you will need to install * Spark, see the detailed instructions, and * all required Python modules by running
pip install -r requirements.txt* (optionally, and only if you want to query the columnar index) install S3 support libraries so that Spark can load the columnar index from S3
Tested with Spark 2.1.0 – 2.4.6 in combination with Python 2.7 or 3.5, 3.6, 3.7, and with Spark 3.0.0 in combination with Python 3.7 and 3.8
get-data.shwill download the sample data. It also writes input files containing * sample input as
file://URLs * all input of one monthly crawl as
Note that the sample data is from an older crawl (
CC-MAIN-2017-13run in March 2017). If you want to use more recent data, please visit the Common Crawl site.
First, point the environment variable
SPARK_HOMEto your Spark installation. Then submit a job via
$SPARK_HOME/bin/spark-submit ./server_count.py \ --num_output_partitions 1 --log_level WARN \ ./input/test_warc.txt servernames
This will count web server names sent in HTTP response headers for the sample WARC input and store the resulting counts in the SparkSQL table "servernames" in your warehouse location defined by
spark.sql.warehouse.dir(usually in your working directory as
The output table can be accessed via SparkSQL, e.g.,
$SPARK_HOME/bin/pyspark >>> df = sqlContext.read.parquet("spark-warehouse/servernames") >>> for row in df.sort(df.val.desc()).take(10): print(row) ... Row(key=u'Apache', val=9396) Row(key=u'nginx', val=4339) Row(key=u'Microsoft-IIS/7.5', val=3635) Row(key=u'(no server in HTTP header)', val=3188) Row(key=u'cloudflare-nginx', val=2743) Row(key=u'Microsoft-IIS/8.5', val=1459) Row(key=u'Microsoft-IIS/6.0', val=1324) Row(key=u'GSE', val=886) Row(key=u'Apache/2.2.15 (CentOS)', val=827) Row(key=u'Apache-Coyote/1.1', val=790)
As the Common Crawl dataset lives in the Amazon Public Datasets program, you can access and process it on Amazon AWS (in the us-east-1 AWS region) without incurring any transfer costs. The only cost that you incur is the cost of the machines running your Spark cluster.
spinning up the Spark cluster: AWS EMR contains a ready-to-use Spark installation but you'll find multiple descriptions on the web how to deploy Spark on a cheap cluster of AWS spot instances. See also launching Spark on a cluster.
--num_output_partitions, see below)
don't forget to deploy all dependencies in the cluster, see advanced dependency management
also the the file
sparkcc.pyneeds to be deployed or added as argument
spark-submit. Note: some of the examples require further Python files as dependencies.
All examples show the available command-line options if called with the parameter
$SPARK_HOME/bin/spark-submit ./server_count.py --help
It's possible to overwrite Spark properties when submitting the job:
$SPARK_HOME/bin/spark-submit \ --conf spark.sql.warehouse.dir=myWareHouseDir \ ... (other Spark options, flags, config properties) \ ./server_count.py \ ... (program-specific options)
While WARC/WAT/WET files are read using boto3, accessing the columnar URL index (see option
--queryof CCIndexSparkJob) is done directly by the SparkSQL engine and requires that S3 support libraries are available. These libs are usually provided when the Spark job is run on a Hadoop cluster running on AWS (eg. EMR). However, they may not be provided for any Spark distribution and are usually absent when running Spark locally (not in a Hadoop cluster). In these situations, the easiest way is to add the libs as required packages by adding
--packages org.apache.hadoop:hadoop-aws:3.2.0to the arguments of
spark-submit. This will make Spark manage the dependencies - the hadoop-aws package and transitive dependencies are downloaded as Maven dependencies. Note that the required version of hadoop-aws package depends on the Hadoop version bundled with your Spark installation, e.g., Spark 3.0.0 bundled with Hadoop 3.2.0 (spark-3.0.0-bin-hadoop3.2.tgz).
Please also note that: - the schema of the URL referencing the columnar index depends on the actual S3 file system implementation: it's
s3://on EMR but
s3a://when using s3a. - data can be accessed anonymously using s3a.AnonymousAWSCredentialsProvider. This requires Hadoop 2.9 or newer. - without anonymous access valid AWS credentials need to be provided, e.g., by setting
spark.hadoop.fs.s3a.secret.keyin the Spark configuration.
Example call to count words in 10 WARC records host under the
$SPARK_HOME/bin/spark-submit \ --packages org.apache.hadoop:hadoop-aws:3.2.0 \ --conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \ ./cc_index_word_count.py \ --query "SELECT url, warc_filename, warc_record_offset, warc_record_length, content_charset FROM ccindex WHERE crawl = 'CC-MAIN-2020-24' AND subset = 'warc' AND url_host_tld = 'is' LIMIT 10" \ s3a://commoncrawl/cc-index/table/cc-main/warc/ \ myccindexwordcountoutput \ --num_output_partitions 1 \ --output_format json
The schema of the columnar URL index has been extended over time by adding new columns. If you want to query one of the new columns (e.g.,
content_languages), the following Spark configuration option needs to be set:
However, this option impacts the query performance, so use with care! Please also read cc-index-table about configuration options to improve the performance of Spark SQL queries.
Alternatively, it's possible configure the table schema explicitly: - download the latest table schema as JSON - and use it by adding the command-line argument
Examples are originally ported from Stephen Merity's cc-mrjob with the following changes and upgrades: * based on Apache Spark (instead of mrjob) * boto3 supporting multi-part download of data from S3 * warcio a Python 2 and Python 3 compatible module to access WARC files
Further inspirations are taken from * cosr-back written by Sylvain Zimmer for Common Search. You definitely should have a look at it if you need a more sophisticated WARC processor (including a HTML parser for example). * Mark Litwintschik's blog post Analysing Petabytes of Websites
MIT License, as per LICENSE