Connecting Apache Spark with different data stores [DEPRECATED]
*Disclaimer: As of 01/06/2015 this project has been deprecated. Thank you for your understanding and continued help throughout the project's life.
Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra, MongoDB, Elastic Search, Aerospike, HDFS, S3 and any database accessible through JDBC, but in the near future we will add support for sever other datastores.
In order to compile the deep-jdbc module is necessary to add the Oracle ojdbc driver into your local repository. You can download it from the URL: http://www.oracle.com/technetwork/database/features/jdbc/default-2280470.html. When you are on the web you must click in "Accept License Agreement" and later downlad ojdbc7.jar library. You need a free oracle account to download the official driver.
To install the ojdbc driver in your local repository you must execute the command below:
mvn install:install-file -Dfile= -DgroupId=com.oracle -DartifactId=ojdbc7 -Dversion=12.1.0.2 -Dpackaging=jar
After that you can compile Deep executing the following steps:
cd deep-parent
mvn clean install
If you want to create a Deep distribution you must execute the following steps:
cd deep-scripts
make-distribution-deep.sh
During the creation you'll see the following question:
What tag want to use for Aerospike native repository?
You must type 0.7.0 and press enter.
The integration is not based on the Cassandra's Hadoop interface.
Deep comes with an user friendly API that lets developers create Spark RDDs mapped to Cassandra column families. We provide two different interfaces:
The first one will let developers map Cassandra tables to plain old java objects (POJOs), just like if you were using any other ORM. We call this API the 'entity objects' API. This abstraction is quite handy, it will let you work on RDD (under the hood Deep will transparently map Cassandra's columns to entity properties). Your domain entities must be correctly annotated using Deep annotations (take a look at deep-core example entities in package com.stratio.deep.core.entity).
The second one is a more generic 'cell' API, that will let developerss work on RDD where a 'Cells' object is a collection of com.stratio.deep.entity.Cell objects. Column metadata is automatically fetched from the data store. This interface is a little bit more cumbersome to work with (see the example below), but has the advantage that it doesn't require the definition of additional entity classes. Example: you have a table called 'users' and you decide to use the 'Cells' interface. Once you get an instance 'c' of the Cells object, to get the value of column 'address' you can issue a c.getCellByName("address").getCellValue(). Please, refer to the Deep API documentation to know more about the Cells and Cell objects.
We encourage you to read the more comprehensive documentation hosted on the Openstratio website.
Deep comes with an example sub project called 'deep-examples' containing a set of working examples, both in Java and Scala. Please, refer to the deep-example project README for further information on how to setup a working environment.
Spark-MongoDB connector is based on Hadoop-mongoDB.
Support for MongoDB has been added in version 0.3.0.
We provide two different interfaces:
ORM API, you just have to annotate your POJOs with Deep annotations and magic will begin, you will be able to connect MongoDB with Spark using your own model entities.
Generic cell API, you do not need to specify the collection's schema or add anything to your POJOs, each document will be transform to an object "Cells".
We added a few working examples for MongoDB in deep-examples subproject, take a look at:
Entities:
Cells:
You can check out our first steps guide here:
We are working on further improvements!
Support for ElasticSearch has been added in version 0.5.0.
Support for Aerospike has been added in version 0.6.0.
Examples:
Entities:
Cells:
Support for JDBC has been added in version 0.7.0.
Examples:
Entities:
Cells:
Put Deep to work on a working cassandra + spark cluster. You have several options:
You already have a working Cassandra server on your development machine: you need a spark+deep bundle, we suggest to create one by running:
cd deep-scripts
./make-distribution-deep.sh
this will build a Spark distribution package with StratioDeep and Cassandra's jars included (depending on your machine this script could take a while, since it will compile Spark from sources). The package will be called
spark-deep-distribution-X.Y.Z.tgz, untar it to a folder of your choice, enter that folder and issue a
./stratio-deep-shell, this will start an interactive shell where you can test StratioDeep (you may have noticed this is will start a development cluster started with MASTER="local"). * You already have a working installation os Cassandra and Spark on your development machine: this is the most difficult way to start testing Deep, but you know what you're doing you will have to 1. copy the Stratio Deep jars to Spark's 'jars' folder (
$SPARK_HOME/jars). 2. copy Cassandra's jars to Spark's 'jar' folder. 3. copy Datastax Java Driver jar (v 2.0.x) to Spark's 'jar' folder. 4. start spark shell and import the following:
import com.stratio.deep.commons.annotations._
import com.stratio.deep.commons.config._
import com.stratio.deep.commons.entity._
import com.stratio.deep.core.context._
import com.stratio.deep.cassandra.config._
import com.stratio.deep.cassandra.extractor._
import com.stratio.deep.mongodb.config._
import com.stratio.deep.mongodb.extractor._
import com.stratio.deep.es.config._
import com.stratio.deep.es.extractor._
import com.stratio.deep.aerospike.config._
import com.stratio.deep.aerospike.extractor._
import org.apache.spark.rdd._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.api.java.JavaSQLContext
import org.apache.spark.sql.api.java.JavaSchemaRDD
import org.apache.spark.sql.api.java.Row
import scala.collection.JavaConversions._
Once you have a working development environment you can finally start testing Deep. This are the basic steps you will always have to perform in order to use Deep:
From version 0.4.x, Deep supports multiple datastores, in order to correctly implement this new feature Deep has undergone an huge refactor between versions 0.2.9 and 0.4.x. To port your code to the new version you should take into account a few changes we made.
From version 0.4.x, Deep supports multiple datastores, in your project you should import only the maven dependency you will use: deep-cassandra, deep-mongodb, deep-elasticsearch or deep-aerospike.
Examples:
Cells cells1 = new Cells(); // instantiate a Cells object whose default table name is generated internally. Cells cells2 = new Cells("mydefaulttable"); // creates a new Cells object whose default table name is specified by the user cells2.add(new Cell(...)); // adds to the 'cells2' object a new Cell object associated to the default table cells2.add("myothertable", new Cell(...)); // adds to the 'cells2' object a new Cell associated to "myothertable"
Methods used to create Cell and Entity RDD has been merged into one single method: