This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Your next API to work with Apache Spark.
This project adds a missing layer of compatibility between Kotlin and Apache Spark. It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references.
We have opened a Spark Project Improvement Proposal: Kotlin support for Apache Spark to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion.
| Apache Spark | Scala | Kotlin for Apache Spark | |:------------:|:-----------:|:------------:| | 3.0.0+ | 2.12 | kotlin-spark-api-3.0.0:1.0.0-preview2 | | 2.4.1+ | 2.12 | kotlin-spark-api-2.42.12:1.0.0-preview2 | | 2.4.1+ | 2.11 | kotlin-spark-api-2.42.11:1.0.0-preview2 |
The list of Kotlin for Apache Spark releases is available here. The Kotlin for Spark artifacts adhere to the following convention:
[Apache Spark version]_[Scala core version]:[Kotlin for Apache Spark API version]
You can add Kotlin for Apache Spark as a dependency to your project:
Maven,
Gradle,
SBT, and
leinengenare supported.
Here's an example
pom.xml:
org.jetbrains.kotlinx.spark kotlin-spark-api-3.0.0 ${kotlin-spark-api.version} org.apache.spark spark-sql_2.12 ${spark.version}
Note that
coreis being compiled against Scala version
2.12.
pom.xmland
build.gradlein the Quick Start Guide.
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
kotlin import org.jetbrains.kotlinx.spark.api.*
val spark = SparkSession .builder() .master("local[2]") .appName("Simple Application").orCreate
spark.toDS("a" to 1, "b" to 2)
The example above produces
Dataset>.
There are several aliases in API, like
leftJoin,
rightJoinetc. These are null-safe by design. For example,
leftJoinis aware of nullability and returns
Dataset>. Note that we are forcing
RIGHTto be nullable for you as a developer to be able to handle this situation.
NullPointerExceptions are hard to debug in Spark, and we doing our best to make them as rare as possible.
We provide you with useful function
withSpark, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
After work block ends,
spark.stop()is called automatically.
withSpark { dsOf(1, 2) .map { it to it } .show() }
dsOfis just one more way to create
Dataset(
Dataset) from varargs.
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call
cachemethod. However, it becomes difficult to control when we're using cached
Datasetand when not. It is also easy to forget to unpersist cached data, which can break things unexpectedly or take up more memory than intended.
To solve these problems we've added
withCachedfunction
withSpark { dsOf(1, 2, 3, 4, 5) .map { it to (it + 2) } .withCached { showDS()filter { it.first % 2 == 0 }.showDS() } .map { c(it.first, it.second, (it.first + it.second) * 2) } .show()
}
Here we're showing cached
Datasetfor debugging purposes then filtering it. The
filtermethod returns filtered
Datasetand then the cached
Datasetis being unpersisted, so we have more memory t o call the
mapmethod and collect the resulting
Dataset.
For more idiomatic Kotlin code we've added
toListand
toArraymethods in this API. You can still use the
collectmethod as in Scala API, however the result should be casted to
Array. This is because
collectreturns a Scala array, which is not the same as Java/Kotlin one.
For more, check out examples module. To get up and running quickly, check out this tutorial.
Please use GitHub issues for filing feature requests and bug reports. You are also welcome to join kotlin-spark channel in the Kotlin Slack.
This project and the corresponding community is governed by the JetBrains Open Source and Community Code of Conduct. Please make sure you read it.
Kotlin for Apache Spark is licensed under the Apache 2.0 License.