Joblib Apache Spark Backend
This library provides Apache Spark backend for joblib to distribute tasks on a Spark cluster.
joblibsparkrequires Python 3.6+,
joblib>=0.14and
pyspark>=2.4to run. To install
joblibspark, run:
pip install joblibspark
The installation does not install PySpark because for most users, PySpark is already installed. If you do not have PySpark installed, you can install
pysparktogether with
joblibspark:
pip install pyspark>=3.0.0 joblibspark
If you want to use
joblibsparkwith
scikit-learn, please install
scikit-learn>=0.21.
Run following example code in
pysparkshell:
from sklearn.utils import parallel_backend from sklearn.model_selection import cross_val_score from sklearn import datasets from sklearn import svm from joblibspark import register_sparkregister_spark() # register spark backend
iris = datasets.load_iris() clf = svm.SVC(kernel='linear', C=1) with parallel_backend('spark', n_jobs=3): scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
joblibsparkdoes not generally support run model inference and feature engineering in parallel. For example:
from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=10) with parallel_backend('spark', n_jobs=3): # This won't run parallelly on spark, it will still run locally. h.transform(...)from sklearn import linear_model regr = linear_model.LinearRegression() regr.fit(X_train, y_train)
with parallel_backend('spark', n_jobs=3): # This won't run parallelly on spark, it will still run locally. regr.predict(diabetes_X_test)
Note: for
sklearn.ensemble.RandomForestClassifier, there is a
n_jobsparameter, that means the algorithm support model training/inference in parallel, but in its inference implementation, it bind the backend to built-in backends, so the spark backend not work for this case.