Essential Spark extensions and helper methods ✨😲
Spark helper methods to maximize developer productivity.
Fetch the JAR file from Maven.
libraryDependencies += "com.github.mrpowers" %% "spark-daria" % "0.38.2"
You can find the spark-daria Scala 2.11 versions here and the Scala 2.12 versions here. The legacy versions are here.
You should generally use Scala 2.11 with Spark 2 and Scala 2.12 with Spark 3.
Reading Beautiful Spark Code is the best way to learn how to build Spark projects and leverage spark-daria.
spark-daria will make you a more productive Spark programmer. Studying the spark-daria codebase will help you understand how to organize Spark codebases.
Use quinn to access similar functions in PySpark.
spark-daria provides different types of functions that will make your life as a Spark developer easier:
The following overview will give you an idea of the types of functions that are provided by spark-daria, but you'll need to dig into the docs to learn about all the methods.
The core extensions add methods to existing Spark classes that will help you write beautiful code.
The native Spark API forces you to write code like this.
col("is_nice_person").isNull && col("likes_peanut_butter") === false
When you import the spark-daria
ColumnExtclass, you can write idiomatic Scala code like this:
import com.github.mrpowers.spark.daria.sql.ColumnExt._col("is_nice_person").isNull && col("likes_peanut_butter").isFalse
This blog post describes how to use the spark-daria
createDF()method that's much better than the
toDF()and
createDataFrame()methods provided by Spark.
See the
ColumnExt,
DataFrameExt, and
SparkSessionExtobjects for all the core extensions offered by spark-daria.
Column functions can be used in addition to the org.apache.spark.sql.functions.
Here is how to remove all whitespace from a string with the native Spark API:
import org.apache.spark.sql.functions._regexp_replace(col("first_name"), "\s+", "")
The spark-daria
removeAllWhitespace()function lets you express this logic with code that's more readable.
import com.github.mrpowers.spark.daria.sql.functions._removeAllWhitespace(col("first_name"))
beginningOfWeek
endOfWeek
beginningOfMonth
endOfMonth
Custom transformations have the following method signature so they can be passed as arguments to the Spark
DataFrame#transform()method.
def someCustomTransformation(arg1: String)(df: DataFrame): DataFrame = { // code that returns a DataFrame }
The spark-daria
snakeCaseColumns()custom transformation snake_cases all of the column names in a DataFrame.
import com.github.mrpowers.spark.daria.sql.transformations._val betterDF = df.transform(snakeCaseColumns())
Protip: You'll always want to deal with snake_case column names in Spark - use this function if your column names contain spaces of uppercase letters.
The DataFrame helper methods make it easy to convert DataFrame columns into Arrays or Maps. Here's how to convert a column to an Array.
import com.github.mrpowers.spark.daria.sql.DataFrameHelpers._val arr = columnToArrayInt
DataFrame validators check that DataFrames contain certain columns or a specific schema. They throw descriptive error messages if the DataFrame schema is not as expected. DataFrame validators are a great way to make sure your application gives descriptive error messages.
Let's look at a method that makes sure a DataFrame contains the expected columns.
val sourceDF = Seq( ("jets", "football"), ("nacional", "soccer") ).toDF("team", "sport")val requiredColNames = Seq("team", "sport", "country", "city")
validatePresenceOfColumns(sourceDF, requiredColNames)
// throws this error message: com.github.mrpowers.spark.daria.sql.MissingDataFrameColumnsException: The [country, city] columns are not included in the DataFrame with the following columns [team, sport]
Here is the latest spark-daria documentation.
Studying these docs will make you a better Spark developer!
We are actively looking for contributors to add functionality that fills in the gaps of the Spark source code.
To get started, fork the project and submit a pull request. Please write tests!
After submitting a couple of good pull requests, you'll be added as a contributor to the project.
Continued excellence will be rewarded with push access to the master branch.
Version bump commit and create GitHub tag
Publish documentation with
sbt ghpagesPushSite
Publish JAR
Run
sbtto open the SBT console.
Run
> ; + publishSigned; sonatypeBundleReleaseto create the JAR files and release them to Maven. These commands are made available by the sbt-sonatype plugin.
When the release command is run, you'll be prompted to enter your GPG passphrase.
The Sonatype credentials should be stored in the
~/.sbt/sonatype_credentialsfile in this format:
realm=Sonatype Nexus Repository Manager host=oss.sonatype.org user=$USERNAME password=$PASSWORD