Cylon is a fast, scalable, distributed memory, parallel runtime for DataFrames.
Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications.
Internally Cylon uses Apache Arrow to represent the data in a column format.
The documentation can be found at https://cylondata.org
Email - [email protected]
Mailing List - Join
We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. THe Conda binaries need Ubunut 16.04 or higher.
conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7 conda activate cylon-0.4.0
Now lets run our first Cylon application. The following code creates two DataFrames and joins them.
from pycylon import DataFrame, CylonEnv from pycylon.net import MPIConfig
df1 = DataFrame([[1, 2, 3], [2, 3, 4]]) df2 = DataFrame([[1, 1, 1], [2, 3, 4]])
df3 = df1.merge(right=df2, on=[0, 1]) print("Local Merge") print(df3)
Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of the program will run. They will each load a two DataFrames in their memory and do a distributed join among all the DataFrames. The results will be created in the n processes as well.
from pycylon import DataFrame, CylonEnv from pycylon.net import MPIConfig import random
env = CylonEnv(config=MPIConfig())
df1 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2.set_index(, inplace=True) print("Distributed Join") df3 = df1.join(other=df2, on=, env=env) print(df3)
You can run the above program in the Conda environment by using the following command. It uses
mpiruncommand with 2 parallel processes.
mpirun -np 2 python
Refer to the documentation on how to compile Cylon
Cylon uses the Apache Lincense Version 2.0