Need help with commoncrawl-crawler?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

203 Stars 63 Forks 223 Commits 1 Opened issues


The Common Crawl Crawler Engine and Related MapReduce code (2008-2012)

Services available


Need anything else?

Contributors list

This is the primary repository for the services & map-reduce jobs used to produce the CommonCrawl web corpus from 2008 to 2012.

Tree Structure

  • org.commoncrawl.async - Utility code used to build Async server.
  • - ARCInputFormat and related classes.
  • org.commoncrawl.hadoop.mergeutils - Support for merge-sorts outside the context of a Hadoop job.
  • org.commoncrawl.hadoop.template - Sample Hadoop Job.
  • - CommonCrawl IO library used by crawlers.
  • org.commoncrawl.mapred - Root for all MapReduce jobs. Also contains data structure definitions shared across jobs (database.jr).
  • org.commoncrawl.mapred.ec2.parser - Code used to generate ARCFiles and intermediate data on EC2 using EMR.
  • org.commoncrawl.mapred.ec2.postprocess.deduper - Code to support a parallel dedupe using a 64bit Simhash.
  • org.commoncrawl.mapred.ec2.postprocess.linkCollector - Code to merge metadata generated by the parser job.
  • org.commoncrawl.mapred.pipelineV3 - The start of the new Nutch Free map-reduce pipeline used to process crawl metadata and generate new crawl lists.
  • org.commoncrawl.mapred.segmenter - Support code used to generate Crawl Segment (URL lists consumed by the crawlers).
  • org.commoncrawl.protocol - Shared data structure and enum definitions (generated).
  • org.commoncrawl.rpc - CommonCrawl RPC library used to build distributed systems.
  • org.commoncrawl.server - CommonCrawl Server base class used by various services.
  • org.commoncrawl.service - All long lived processes in the CommonCrawl system are house under this directory.
  • org.commoncrawl.service.crawler - The crawler long running process (Consumes Crawl Lists, writes content to HDFS).
  • org.commoncrawl.service.crawlhistory - A service that manages a crawler's crawl state in a BloomFilter.
  • - A barebones service used to store and subscribe to lists via a path.
  • org.commoncrawl.service.dns - CommonCrawl DNS Service (used by crawlers to queue up DNS requests).
  • org.commoncrawl.service.listcrawler - A different type of list crawler that supports dynamic uploading a crawling of very large lists of URLS.
  • org.commoncrawl.service.pagerank - PageRank Master / Slave implementations (and related code) used to compute PageRank across the graph.
  • org.commoncrawl.service.parser - The beginnings of a distributed parser service that Crawlers can use to do on demand link extraction.
  • org.commoncrawl.service.queryserver - The (deprecated) crawl metadata service.
  • org.commoncrawl.service.statscollector - Service that receives crawl stats.
  • org.commoncrawl.util - The catch-all repository of Utility classes used by the CommonCrawl system.


This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see


Ahad Rana (ahad at

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.