Distributed Stream and Batch Processing
Jet is an open-source, in-memory, distributed batch and stream processing engine. You can use it to process large volumes of real-time events or huge batches of static datasets. To give a sense of scale, a single node of Jet has been proven to aggregate 10 million events per second with latency under 10 milliseconds.
It provides a Java API to build stream and batch processing applications through the use of a dataflow programming model. After you deploy your application to a Jet cluster, Jet will automatically use all the computational resources on the cluster to run your application.
If you add more nodes to the cluster while your application is running, Jet automatically scales up your application to run on the new nodes. If you remove nodes from the cluster, it scales it down seamlessly without losing the current computational state, providing exactly-once processing guarantees.
For example, you can represent the classical word count problem that reads some local files and outputs the frequency of each word to console using the following API:
JetInstance jet = Jet.bootstrappedInstance();
Pipeline p = Pipeline.create(); p.readFrom(Sources.files("/path/to/text-files")) .flatMap(line -> traverseArray(line.toLowerCase().split("\W+"))) .filter(word -> !word.isEmpty()) .groupingKey(word -> word) .aggregate(counting()) .writeTo(Sinks.logger());
and then deploy the application to the cluster:
bin/jet submit word-count.jar
Another application which aggregates millions of sensor readings per second with 10-millisecond resolution from Kafka looks like the following:
Pipeline p = Pipeline.create();
p.readFrom(KafkaSources.kafka(kafkaProperties, "sensors")) .withTimestamps(event -> event.getValue().timestamp(), 10) // use event timestamp, allowed lag in ms .groupingKey(reading -> reading.sensorId()) .window(sliding(1_000, 10)) // sliding window of 1s by 10ms .aggregate(averagingDouble(reading -> reading.temperature())) .writeTo(Sinks.logger());
Jet comes with out-of-the-box support for many kinds of data sources and sinks, including:
Jet is a good fit when you need to process large amounts of data in a distributed fashion. You can use it to build a variety of data-processing applications, such as:
The engine is able to run anywhere from tens to thousands of jobs concurrently on a fixed number of threads.
Jet stores computational state in a distributed, replicated in-memory store and does not require the presence of a distributed file system nor infrastructure like Zookeeper to provide high-availability and fault-tolerance.
Jet implements a version of the Chandy-Lamport algorithm to provide exactly-once processing under the face of failures. When interfacing with external transactional systems like databases, it can provide end-to-end processing guarantees using two-phase commit.
Event data can often arrive out of order and Jet has first-class support for dealing with this disorder. Jet implements a technique called distributed watermarks to treat disordered events as if they were arriving in order.
Follow the Get Started guide to start using Jet.
You can download Jet from https://jet-start.sh.
Alternatively, you can use the latest docker image:
docker run -p 5701:5701 hazelcast/hazelcast-jet
Use the following Maven coordinates to add Jet to your application:
com.hazelcast.jet hazelcast-jet 4.2
See the tutorials for tutorials on using Jet. Some examples:
Jet supports a variety of transforms and operators. These include:
You are also encouraged to join the hazelcast-jet mailing list if you are interested in community discussions
Thanks for your interest in contributing! The easiest way is to just send a pull request. Have a look at the issues marked as good first issue for some guidance.
To build, use:
./mvnw clean package -DskipTests
You can always use the latest snapshot release if you want to try the features currently under development.
snapshot-repository Maven2 Snapshot Repository https://oss.sonatype.org/content/repositories/snapshots true daily com.hazelcast.jet hazelcast-jet 4.3-SNAPSHOT
When you create a pull request (PR), it must pass a build-and-test procedure. Maintainers will be notified about your PR, and they can trigger the build using special comments. These are the phrases you may see used in the comments on your PR:
verify- run the default PR builder, equivalent to
mvn clean install
run-nightly-tests- use the settings for the nightly build (
mvn clean install -Pnightly). This includes slower tests in the run, which we don't normally run on every PR
run-windows- run the tests on a Windows machine (HighFive is not supported here)
run-cdc-debezium-tests- run all tests in the
run-cdc-mysql-tests- run all tests in the
run-cdc-postgres-tests- run all tests in the
Where not indicated, the builds run on a Linux machine with Oracle JDK 8.
Source code in this repository is covered by one of two licenses:
The default license throughout the repository is Apache License 2.0 unless the header specifies another license. Please see the Licensing section for more information.
We owe (the good parts of) our CLI tool's user experience to picocli.
Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.
Visit www.hazelcast.com for more info.