Hydra is a distributed data processing and storage system originally developed at AddThis. It ingests streams of data (think log files) and builds trees that are aggregates, summaries, or transformations of the data. These trees can be used by humans to explore (tiny queries), as part of a machine learning pipeline (big queries), or to support live consoles on websites (lots of queries).
You can run hydra from the command line to slice and dice that Apache access log you have sitting around (or that gargantuan csv file). Or if terabytes per day is your cup of tea run a Hydra Cluster that supports your job with resource sharing, job management, distributed backups, data partitioning, and efficient bulk file transfer.
The Hydra Documentation Page contains concepts, tutorials, guides, and the web api.
The Hydra User Reference is built automatically from the source code and contains reference material on hydra's configurable job components.
Getting Started With Hydra is a blog post that contains a nice self-contained introduction to hydra processing.
AddThis Java Code Style is the code style that hydra tries to adhere to.
Assuming you have Apache Maven installed and configured:
Should compile and build jars. All hydra dependencies should be available on maven central but hydra itself is not yet published.
Berkeley DB Java Edition is used for several core features. The sleepycat license has strong copyleft properties that do not match the rest of the project. It is set as a non-transitive dependency to avoid inadvertently pulling it into downstream projects. In the future hydra should have pluggable storage with multiple implementations.
hydra-ubermodule builds an
execjar containing hydra and all of it's dependencies. To include BDB JE when building with
-P bdbje. The main class of the
execjar launches the various components of a hydra cluster by name.
JDK 8 is required. Hydra has been developed on Linux (Centos 6) and should work on any modern Linux distro. Other unix-like systems should work with minor changes but have not been tested. Mac OSX should work for building and running local-stack (see below).
Hydra uses rabbitmq for low volume command and control message exchange. On a modern Linux systems
apt-get install rabbitmq-serverand running with the default settings is adequate in most cases.
To run efficiently Hydra needs a mechanism to take copy on write backups of the output of jobs. The is currently accomplished by adding the fl-cow library to
LD_PRELOAD. Experimenting with other approaches such as ZFS or
cp --reflinkare under consideration.
Many components assume that there is a local user called
hydraand that all minion nodes can ssh as that user to each other. This is used most prominently for
rsyncbased replicas. The user
hydrais not necessary when running a local-stack environment (see below).
On OS X several utilities are necessary to run the local-stack environment:
brew install coreutils brew install wget
While hydra can be used for ad-hoc analysis of csv and other local files, it's most commonly used in a distributed cluster. In that case the following components are involved:
A typical configuration is to have a cluster head with Spawn & QueryMaster backed by a homogeneous clusters of nodes running Minion, QueryWorker, and Meshy.
For local development all of the above components can run together in a single stack run out of
hydra-local. There is a
local-stack.shscript to assist with this. To run the local stack:
The first time the script is run a
hydra-localdirectory will be created.
./hydra-uber/local/bin/local-stack.sh start- start ZooKeeper
./hydra-uber/local/bin/local-stack.sh start- start spawn, querymaster etc.
./hydra-uber/local/bin/local-stack.sh seed- add some sample data
You can then navigate to http://localhost:5052/ and you should see the spawn web interface.
./hydra-uber/local/bin/local-stack.sh stopwill stop everything except ZooKeeper, and running
stopa second time will bring that process down as well.
There are sample job configurations located in
Mailing list: http://groups.google.com/group/hydra-oss
It's x.y.z where:
hydra is released under the Apache License Version 2.0. See Apache or the LICENSE file in this distribution for details.
Hydra logo by Appy Vohra.