Need help with crunch?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

197 Stars 15 Forks 16 Commits 1 Opened issues


A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

Services available


Need anything else?

Contributors list

# 2,722
React N...
16 commits

A fast to iterate, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

Use crunch-starter for a boilerplate project to kickstart a production setup.

Quick Start

Crunch is optimized to be a big-bang-for-the-buck libary, yet almost every aspect is extensible.

Let's say you have a log of semi-structured and deeply nested JSON. Each line contains a record.

You would like to:

  1. Parse JSON records
  2. Extract fields
  3. Cleanup/process fields
  4. Extract features - run custom code on field values and output the result as new field(s)

So here's a detailed view:

// Describe your row
transform := crunch.NewTransformer()
row := crunch.NewRow()
// Use "field_name type". Types are Hive types.
row.FieldWithValue("ev_smp int", "1.0")
// If no type given, assume 'string'
row.FieldWithDefault("ip", "", makeQuery("head.x-forwarded-for"), transform.AsIs)
row.FieldWithDefault("ev_ts", "", makeQuery("action.timestamp"), transform.AsIs)
row.FieldWithDefault("ev_source", "", makeQuery("action.source"), transform.AsIs)
row.Feature("doing ip to location", []string{"country", "city"},
  func(r crunch.DataReader, row *crunch.Row)[]string{
    // call your "standard" Go code for doing ip2location
    return ip2location(row["ip"])

// By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV]) // Also will plug-in Crunch's CLI utility functions (use -help) crunch.ProcessJson(row)

Build your processor

$ go build my_processor.go

Generate a Pig driver that uses

, and a Hive table creation DDL.
$ ./my_processor -crunch.stubs="."

You can now ship your binary and scripts (crunch.hql, crunch.pig) to your cluster.

In your cluster, you can now setup your table with Hive and run an ETL job with Pig:

$ hive -f crunch.hql
$ pig -stop_on_failure --param inurl=s3://inbucket/logs/dt=20140304 --param outurl=s3://outbucket/success/dt=20140304 crunch.pig

Row Setup

The row setup is the most important part of the processor.

Make a row:

transform := crunch.NewTransformer()
row := crunch.NewRow()

And start describing fields in it:

row.FieldWithDefault("name type", "default-value", , )

A field description is:

  • A
    name type
    pair, where types are Hive types.
  • A default value (for
    , there are variants of this -- see the API docs)
  • A lookup function (the 'Extract' part of ETL) - see one in the example processor. It outputs an
  • A transform function, which eventually should represent that
    as a string type but its contents can changed based on semantics (JSON, int values, dates, etc).

The Processor

Crunch comes with a built in processor rig, that packs its API into a ready-made processor:


This processor reads JSON and outputs Hadoop-streaming TSV that is compatible with Pig STREAM (which we use later), based on your row description and functions.

It also injects the following commands into your binary:

$ ./simple_processor -help
Usage of ./simple_processor:
  -crunch.cpuprofile="": Turn on CPU profiling and write to the specified file.
  -crunch.hivetemplate="": Custom Hive template for stub generation.
  -crunch.pigtemplate="": Custom Pig template for stub generation.
  -crunch.stubs="": Generate stubs and output to given path, and exit.

Building a binary

Since go packs all dependencies into your binary, this makes a great delivery package to hadoop.

Simply take a starter processor from

and build your processor based on it. Then build it:
$ go build simple_processor.go
$ ./simple_processor -crunch.stubs="."
Generated crunch.pig
Generated crunch.hql

The resulting binary should be ready for action, using Pig (see next section)

Generating Pig and Hive stubs

Crunch injects useful commands into your processor, one of them supports script generation to create your Hive table, and your Pig job.

$ ./simple_processor -crunch.stubs="."
Generated crunch.pig
Generated crunch.hql

You can use your own templates with the

flags, as long as you include a
for the pig script) special pragma so that Crunch will replace it with the actual Pig or Hive schema.

Extending Crunch

[this section is WIP]

Crunch is packaged into use-cases accessible from the crunch package,

to name one.

However beneath the usecase facade, lies an extensible API which lets you have any kind of granularity over using Crunch.

Some detailed examples can be seen in



Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).


Copyright (c) 2014 Dotan Nahum @jondot. See MIT-LICENSE for further details.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.