Need help with cylon?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

cylondata
144 Stars 20 Forks Apache License 2.0 1.2K Commits 76 Opened issues

Description

Cylon is a fast, scalable, distributed memory, parallel runtime for DataFrames.

Services available

!
?

Need anything else?

Contributors list

Cylon

Build Status License

Cylon is a fast, scalable distributed memory data parallel library for processing structured data. Cylon implements a set of relational operators to process data. While ”Core Cylon” is implemented using system level C/C++, multiple language interfaces (Python and Java ) are provided to seamlessly integrate with existing applications, enabling both data and AI/ML engineers to invoke data processing operators in a familiar programming language. By default it works with MPI for distributing the applications.

Internally Cylon uses Apache Arrow to represent the data in a column format.

The documentation can be found at https://cylondata.org

Email - [email protected]

Mailing List - Join

Get Started

We can use Conda to install PyCylon. At the moment Cylon only works on Linux Systems. THe Conda binaries need Ubunut 16.04 or higher.

conda create -n cylon-0.4.0 -c cylondata pycylon python=3.7
conda activate cylon-0.4.0

Now lets run our first Cylon application. The following code creates two DataFrames and joins them.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig

df1 = DataFrame([[1, 2, 3], [2, 3, 4]]) df2 = DataFrame([[1, 1, 1], [2, 3, 4]])

local merge

df3 = df1.merge(right=df2, on=[0, 1]) print("Local Merge") print(df3)

Now lets run a parallel version of this program. Here if we create n processes (parallelism), n instances of the program will run. They will each load a two DataFrames in their memory and do a distributed join among all the DataFrames. The results will be created in the n processes as well.

from pycylon import DataFrame, CylonEnv
from pycylon.net import MPIConfig
import random

distributed join

env = CylonEnv(config=MPIConfig())

df1 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2 = DataFrame([random.sample(range(10env.rank, 15(env.rank+1)), 5), random.sample(range(10env.rank, 15(env.rank+1)), 5)]) df2.set_index([0], inplace=True) print("Distributed Join") df3 = df1.join(other=df2, on=[0], env=env) print(df3)

You can run the above program in the Conda environment by using the following command. It uses

mpirun
command with 2 parallel processes.
mpirun -np 2 python 

Compiling Cylon

Refer to the documentation on how to compile Cylon

Compiling on Linux

Licence

Cylon uses the Apache Lincense Version 2.0

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.