Need help with cylon?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

144 Stars 20 Forks Apache License 2.0 1.2K Commits 76 Opened issues


Cylon is a fast, scalable, distributed memory, parallel runtime for DataFrames.

Services available


Need anything else?

Contributors list


Build Status License

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications.

Internally Cylon uses Apache Arrow to represent the data in a column format.

The documentation can be found at

Email - [email protected]

Mailing List - Join

Get Started

We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. THe Conda binaries need Ubunut 16.04 or higher.

conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7
conda activate cylon-0.4.0

Now lets run our first Cylon application. The following code creates two DataFrames and joins them.

from pycylon import DataFrame, CylonEnv
from import MPIConfig

df1 = DataFrame([[1, 2, 3], [2, 3, 4]]) df2 = DataFrame([[1, 1, 1], [2, 3, 4]])

local merge

df3 = df1.merge(right=df2, on=[0, 1]) print("Local Merge") print(df3)

Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of the program will run. They will each load a two DataFrames in their memory and do a distributed join among all the DataFrames. The results will be created in the n processes as well.

from pycylon import DataFrame, CylonEnv
from import MPIConfig
import random

distributed join

env = CylonEnv(config=MPIConfig())

df1 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2.set_index([0], inplace=True) print("Distributed Join") df3 = df1.join(other=df2, on=[0], env=env) print(df3)

You can run the above program in the Conda environment by using the following command. It uses

command with 2 parallel processes.
mpirun -np 2 python 

Compiling Cylon

Refer to the documentation on how to compile Cylon

Compiling on Linux


Cylon uses the Apache Lincense Version 2.0

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.