crunch

by jondot

jondot / crunch

A fast to develop, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

196 Stars 15 Forks Last release: Not found 16 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

A fast to iterate, fast to run, Go based toolkit for ETL and feature extraction on Hadoop.

Use crunch-starter for a boilerplate project to kickstart a production setup.

Quick Start

Crunch is optimized to be a big-bang-for-the-buck libary, yet almost every aspect is extensible.

Let's say you have a log of semi-structured and deeply nested JSON. Each line contains a record.

You would like to:

  1. Parse JSON records
  2. Extract fields
  3. Cleanup/process fields
  4. Extract features - run custom code on field values and output the result as new field(s)

So here's a detailed view:

// Describe your row
transform := crunch.NewTransformer()
row := crunch.NewRow()
// Use "field_name type". Types are Hive types.
row.FieldWithValue("ev_smp int", "1.0")
// If no type given, assume 'string'
row.FieldWithDefault("ip", "0.0.0.0", makeQuery("head.x-forwarded-for"), transform.AsIs)
row.FieldWithDefault("ev_ts", "", makeQuery("action.timestamp"), transform.AsIs)
row.FieldWithDefault("ev_source", "", makeQuery("action.source"), transform.AsIs)
row.Feature("doing ip to location", []string{"country", "city"},
  func(r crunch.DataReader, row *crunch.Row)[]string{
    // call your "standard" Go code for doing ip2location
    return ip2location(row["ip"])
  })

// By default, will build a hadoop-compatible streamer process that understands json: (stdin[JSON] to stdout[TSV]) // Also will plug-in Crunch's CLI utility functions (use -help) crunch.ProcessJson(row)

Build your processor

$ go build my_processor.go

Generate a Pig driver that uses

my_processor
, and a Hive table creation DDL.
$ ./my_processor -crunch.stubs="."

You can now ship your binary and scripts (crunch.hql, crunch.pig) to your cluster.

In your cluster, you can now setup your table with Hive and run an ETL job with Pig:

$ hive -f crunch.hql
$ pig -stop_on_failure --param inurl=s3://inbucket/logs/dt=20140304 --param outurl=s3://outbucket/success/dt=20140304 crunch.pig

Row Setup

The row setup is the most important part of the processor.

Make a row:

transform := crunch.NewTransformer()
row := crunch.NewRow()

And start describing fields in it:

row.FieldWithDefault("name type", "default-value", , )

A field description is:

  • A
    name type
    pair, where types are Hive types.
  • A default value (for
    FieldWithDefault
    , there are variants of this -- see the API docs)
  • A lookup function (the 'Extract' part of ETL) - see one in the example processor. It outputs an
    interface{}
  • A transform function, which eventually should represent that
    interface{}
    as a string type but its contents can changed based on semantics (JSON, int values, dates, etc).

The Processor

Crunch comes with a built in processor rig, that packs its API into a ready-made processor:

crunch.ProcessJson(row)

This processor reads JSON and outputs Hadoop-streaming TSV that is compatible with Pig STREAM (which we use later), based on your row description and functions.

It also injects the following commands into your binary:

$ ./simple_processor -help
Usage of ./simple_processor:
  -crunch.cpuprofile="": Turn on CPU profiling and write to the specified file.
  -crunch.hivetemplate="": Custom Hive template for stub generation.
  -crunch.pigtemplate="": Custom Pig template for stub generation.
  -crunch.stubs="": Generate stubs and output to given path, and exit.

Building a binary

Since go packs all dependencies into your binary, this makes a great delivery package to hadoop.

Simply take a starter processor from

/examples
and build your processor based on it. Then build it:
$ go build simple_processor.go
$ ./simple_processor -crunch.stubs="."
Generated crunch.pig
Generated crunch.hql

The resulting binary should be ready for action, using Pig (see next section)

Generating Pig and Hive stubs

Crunch injects useful commands into your processor, one of them supports script generation to create your Hive table, and your Pig job.

$ ./simple_processor -crunch.stubs="."
Generated crunch.pig
Generated crunch.hql

You can use your own templates with the

-crunch.hivetemplate
and
-crunch.pigtemplate
flags, as long as you include a
%%schema%%
(and
%%process%%
for the pig script) special pragma so that Crunch will replace it with the actual Pig or Hive schema.

Extending Crunch

[this section is WIP]

Crunch is packaged into use-cases accessible from the crunch package,

crunch.ProcessJson
to name one.

However beneath the usecase facade, lies an extensible API which lets you have any kind of granularity over using Crunch.

Some detailed examples can be seen in

/examples/detailed_processor.go
.

Contributing

Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).

Copyright

Copyright (c) 2014 Dotan Nahum @jondot. See MIT-LICENSE for further details.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.