Haskell on Apache Spark.
sparkle [spär′kəl]: a library for writing resilient analytics applications in Haskell that scale to thousands of nodes, using Spark and the rest of the Apache ecosystem under the hood. See this blog post for the details.
The tl;dr using the
helloapp as an example on your local machine:
$ nix-shell --pure --run "bazel build //apps/hello:sparkle-example-hello_deploy.jar" $ nix-shell --pure --run "bazel run spark-submit -- --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.8.4 $PWD/bazel-bin/apps/hello/sparkle-example-hello_deploy.jar"
You'll need Nix for the above to work.
sparkle is a tool for creating self-contained Spark applications in Haskell. Spark applications are typically distributed as JAR files, so that's what sparkle creates. We embed Haskell native object code as compiled by GHC in these JAR files, along with any shared library required by this object code to run. Spark dynamically loads this object code into its address space at runtime and interacts with it via the Java Native Interface (JNI).
To run a Spark application the process is as follows:
apps/folder, in-repo or as a submodule;
If you run into issues, read the Troubleshooting section below first.
Include the following in a
BUILD.bazelfile next to your source code. ``` package(default_visibility = ["//visibility:public"])
load( "@ruleshaskell//haskell:defs.bzl", "haskelllibrary", )
haskell_library ( name = "hello-hs", srcs = ..., deps = ..., ... )
sparkle_package( name = "sparkle-example-hello", src = ":hello-hs", ) ```
You might want to add the following settings to your
common --repository_cache=~/.bazel_repo_cache common --disk_cache=~/.bazel_disk_cache common --local_cpu_resources=4
And then ask Bazel to build a deploy jar file.
$ nix-shell --pure --run "bazel build //apps/hello:sparkle-example-hello_deploy.jar"
sparklebuilds in Mac OS X, but running it requires installing binaries for
Another alternative is to build and run
sparklevia Docker in non-Linux platforms, using a docker image provisioned with Nix.
sparklein another project
sparkleinteracts with the JVM, you need to tell
ghcwhere JVM-specific headers and libraries are. It needs to be able to locate
inline-javato embed fragments of Java code in Haskell modules, which requires running the
javaccompiler, which must be available in the
PATHof the shell. Moreover,
javacneeds to find the Spark classes that
inline-javaquotations refer to. Therefore, these classes need to be added to the
CLASSPATHwhen building sparkle. Dependending on your build system, how to do this might vary. In this repo, we use
gradleto install Spark, and we query
gradleto get the paths we need to add to the
Additionally, the classes need to be found at runtime to load them. The main thread can find them, but other threads need to invoke
mainfunction terminates with unhandled exceptions, they can be propagated to Spark with
Control.Distributed.Spark.forwardUnhandledExceptionsToSpark. This allows spark both to report the exception and to cleanup before termination.
Finally, to run your application, for example locally:
$ nix-shell --pure --run "bazel run spark-submit -- /path/to/$PWD/_deploy.jar"
Theis the name of the Bazel target producing the jar file. See apps in the apps/ folder for examples.
RTS options can be passed as a java property
$ nix-shell --pure --run "bazel run spark-submit -- --driver-java-options=-Dghc_rts_opts='+RTS\ -s\ -RTS' _deploy.jaror as command line arguments
$ nix-shell --pure --run "bazel run spark-submit -- _deploy.jar +RTS -s -RTS
The context class loader of threads needs to be set appropriately before JNI calls can find classes in Spark. Calling
Control.Distributed.Sparkshould set it.
When using inline-java, it is recommended to use the Kryo serializer, which is currently not the default in Spark but is faster anyways. If you don't use the Kryo serializer, objects of anonymous class, which arise e.g. when using Java 8 function literals,
foo :: RDD Int -> IO (RDD Bool) foo rdd = [java| $rdd.map((Integer x) -> x.equals(0)) |]
won't be deserialized properly in multi-node setups. To avoid this problem, switch to the Kryo serializer by setting the following configuration properties in your
See #104 for more details.
java.lang.UnsatisfiedLinkError: /tmp/sparkle-app...: failed to map segment from shared object
Sparkle unzips the Haskell binary program in a temporary location on the filesystem and then loads it from there. For loading to succeed, the temporary location must not be mounted with thenoexecoption. Alternatively, the temporary location can be changed withspark-submit --driver-java-options="-Djava.io.tmpdir=..." \ --conf "spark.executor.extraJavaOptions=-Djava.io.tmpdir=..."
java.io.IOException: No FileSystem for scheme: s3n
Spark 2.4 requires explicitly specifying extra JAR files tospark-submitin order to work with AWS. To work around this, add an additional 'packages' argument when submitting the job:spark-submit --packages com.amazonaws:aws-java-sdk:1.11.920,org.apache.hadoop:hadoop-aws:2.8.4
Copyright (c) 2015-2016 EURL Tweag.
All rights reserved.
sparkle is free software, and may be redistributed under the terms specified in the LICENSE file.
sparkle is maintained by Tweag I/O.
Have questions? Need help? Tweet at @tweagio.