Need help with peregrine?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

171 Stars 20 Forks GNU General Public License v3.0 80 Commits 1 Opened issues


Peregrine: A Pattern-Aware Graph Mining System

Services available


Need anything else?

Contributors list

# 329,677
75 commits
# 560,024
4 commits
# 379,190
Arch Li...
1 commit


Peregrine: A Pattern-Aware Graph Mining System


Peregrine is an efficient, single-machine system for performing data mining tasks on large graphs. Some graph mining applications include: * Finding frequent subgraphs * Generating the motif/graphlet distribution * Finding all occurrences of a subgraph

Peregrine is highly programmable, so you can easily develop your own graph mining applications using its novel, declarative, graph-pattern-centric API. To write a Peregrine program, you describe which graph patterns you are interested in mining, and what to do with each occurrence of those patterns. You provide the what and the runtime handles the how.

For full details, you can read our paper published in EuroSys 2020 or the longer version on arXiv. For a deeper look at optimizations under the hood, check out our article in the 2021 ACM Operating Systems Review.

For an in-depth summary, watch the video presentation:

EuroSys 2020 Presentation Thumbnail

TL;DR: compared to other state-of-the-art open-source graph mining systems, Peregrine: * executes up to 700x faster * consumes up to 100x less memory * scales to 100x larger data sets * on 8x fewer machines * with a simpler, more expressive API

Table of Contents

  1. Quick start
  2. Writing your own programs
  3. Data graphs
  4. Reproducing our EuroSys 2020 paper results
  5. Contributing
  6. Acknowledgements
  7. Resources

1. Quick start

Peregrine has been tested on Ubuntu 18.04 and Arch Linux but should work on any POSIX-y OS. It requires C++20 support (GCC version >= 10.1). Additionally, the tests require UnitTest++.

Ubuntu 18.04 prerequisites:

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt install g++-10 libunittest++-dev

To build Peregrine:

$ git clone
$ cd Peregrine
$ source tbb2020/bin/ intel64
$ make -j CC=g++-10
$ bin/test

Several sample applications, query patterns, and a sample dataset are released with the code. Calling any of the applications without arguments will show you what they expect:

$ bin/count
USAGE: bin/count   [# threads]

These applications print their results in


For example, motif-counting:

$ bin/count data/citeseer 3-motifs 8
Counting 3-motifs
Finished reading datagraph: |V| = 3264 |E| = 4536
All patterns finished after 0.030265s
[2-3][1-3]: 23380
[1-2][1-3][2-3]: 1166

The string

encodes the pattern consisting of edges
(1, 3), (2, 3)
, and 23380 is the number of unique occurrences Peregrine found in the citeseer graph.

Other applications give similar output:

$ bin/count data/citeseer 4-clique 8
All patterns finished after 0.030265s
[3-4][1-2][1-3][1-4][2-3][2-4]: 255
$ bin/count data/citeseer query/p1.graph 8
All patterns finished after 0.003368s
[3-4][1-2][1-3][1-4][2-3]: 3730

FSM provides support values instead of counts:

$ bin/fsm data/citeseer 3 300 8 # size 3 FSM with support 300
Frequent patterns:
[1,0-2,0][1,0-3,0][2,0-4,0]: 303
[1,1-2,1][1,1-3,1][2,1-4,1]: 335
Finished in 0.078629s

The sample FSM app performs edge-induced FSM by default. For vertex-induced FSM, simply call it as

$ bin/fsm data/citeseer 3 300 8 v # vertex-induced

The existence-query application simply states whether the desired pattern exists or not:

$ bin/existence-query data/citeseer 14-clique 8
All patterns finished after 0.005509s
[pattern omitted due to length] doesn't exist in data/citeseer

The output application stores matches grouped by pattern in either CSV or packed binary format:

$ bin/output data/citeseer 3-motifs results bin 8
all patterns finished after 0.002905s
[1-3](1~2)[2-3]: 23380 matches stored in "results/[1-3](1~2)[2-3]"
[1-2][1-3][2-3]: 1166 matches stored in "results/[1-2][1-3][2-3]"

In the CSV format, for a pattern with

vertices each line will contain a match written in the form:

In the binary format, matches are written as sequences of

4-byte vertex IDs in binary, with no delimiters:

2. Writing your own programs

In Peregrine's programming model, you provide a data graph, a set of patterns you're interested in, and a callback the system will apply to each occurrence of these patterns in your data graph. We present a brief overview of the API here, beginning with constructing patterns.

For all of the following code snippets, assume we are

using namespace Peregrine

We have not released support for directed graphs yet; the code currently assumes all graphs are undirected.

2.1 Constructing patterns directly

Pattern graphs can constructed in two ways using the

data structure:

2.1.1 Statically, from a file

Given a file in the following edge-list format:

 [label]  [label]

where the

's are optional 32-bit integers and vertex ids are contiguous integers starting from 1. To indicate a vertex is unlabelled in a partially-labelled pattern, assign it label
. To indicate an anti-edge, any extra integer can be placed at the end of the line.

For example, a triangle:

1 2
1 3
2 3

A vertex-induced 3-star, notice the last edge is an anti-edge:

1 2
1 3
2 3 1

A partially-labelled triangle (vertex 3 is unlabelled):

1 100 2 101
1 100 3 -1
2 101 3 -1

Then, construct the

SmallGraph p("pattern_graph.txt");

2.1.2 Dynamically, using the builder methods

Construct an empty graph and add edges/anti-edges one by one:

SmallGraph p
  .add_edge(1, 2)
  .add_edge(1, 3)
  .add_anti_edge(2, 3)
  .set_label(1, 100)
  .set_label(2, 101)
  .set_label(3, -1);

2.2 Constructing patterns using the


class is useful for

Quickly generating common patterns:

SmallGraph triangle = PatternGenerator::clique(3);
SmallGraph wedge = PatternGenerator::star(3);

Quickly generating many patterns: ``` int size = 4; std::vector vertexinduced = PatternGenerator::all(size, PatternGenerator::VERTEXBASED, // 4 vertices PatternGenerator::INCLUDEANTIEDGES); // anti-edges are inserted between all unadjacent vertices

std::vector vertexinducededgebased = PatternGenerator::all(size, PatternGenerator::EDGEBASED, // 4 edges PatternGenerator::INCLUDEANTIEDGES); // anti-edges are inserted between all unadjacent vertices

std::vector edgeinduced = PatternGenerator::all(size, PatternGenerator::EDGEBASED, // 4 edges PatternGenerator::EXCLUDEANTIEDGES); // no extra anti-edges are inserted

std::vector edgeinducedvertexbased = PatternGenerator::all(size, PatternGenerator::VERTEXBASED, // 4 vertices PatternGenerator::EXCLUDEANTIEDGES); // no extra anti-edges are inserted ```

Extending existing patterns:

std::vector vertex_extensions = PatternGenerator::extend({triangle}, PatternGenerator::VERTEX_BASED);
std::vector edge_extensions = PatternGenerator::extend({triangle}, PatternGenerator::EDGE_BASED);

2.3 Importing data graphs

Data graphs can be constructed from a preprocessed edge-list (see Data graphs) or a

DataGraph dg("preprocessed_data/");
SmallGraph g("small_data_graph.txt");
DataGraph dg(g);

2.4 Matching patterns

Pattern matching is done through the

, and

Counting instances of patterns is very simple. Consider the following minimal motifs program:

#include "Peregrine.hh"
using namespace Peregrine;
void motifs(int size)
  int nthreads = 16;
  DataGraph g("data/citeseer/");

auto patterns = PatternGenerator::all(size, PatternGenerator::VERTEX_BASED, PatternGenerator::INCLUDE_ANTI_EDGES);

auto results = count(g, patterns, nthreads);

for (const auto &[pattern, count] : results) { std::cout << pattern << ": " << count << std::endl; } }

The arguments to

are straightforward: the data graph you wish to use, the patterns whose instances you wish to count, and the number of worker threads to use.

For arbitrary aggregations, use the

template functions. For example, you could replace
above with this snippet (note that using
will be faster):
  const auto callback = [](auto &&handle, auto &&match) {, 1); };
  auto results = match(g, patterns, nthreads, callback);

First, let's explain the template arguments to

: *
is the aggregation key type *
is the aggregation value type *
describes when aggregation happens: either
of execution or
indicates that you will not be using early termination

The regular parameters are the same as the

example except for
. This is a function that will be called on each match for a pattern. It takes two arguments: a handle to Peregrine's aggregator, and the match itself.


is mapping the pattern of the match to 1. Note that this lines up with the types passed to
: the aggregation key is
and the value is an integer. Peregrine automatically takes care of simple types like
, aggregating them with the built-in sum operator.

Using more complicated aggregation values requires them to implement a few methods and implement a view function. The sample FSM application is a great example: check out both

to see a view function and the aggregation value methods.

Users can call three methods on the

: -
map(key, value)
: maps
: returns a view of the value mapped by
: halts exploration early


function takes the same template parameters and arguments as
, but allows you to write matches to disk using the
  const auto callback = [](auto &&handle, auto &&match)
  {, 1);
    handle.template output(match.mapping); // this writes out matches
  auto results = output(g, patterns, nthreads, callback);

You can pass

an additional argument to specify where to put results (by default it writes them under "."):
  auto results = output(g, patterns, nthreads, callback, default_view, "/destination/path/");

In the simple case that you want to output all matches and don't require any other processing, you can call an overloaded definition of

auto results = output(g, patterns, nthreads, "/destination/path/");

You can output matches in CSV or binary formats (


3. Data graphs

Peregrine's data processor ingests graph edge-lists and stores them in binary adjacency-list format. For labeled data graphs, a separate file of vertex-label pairs is used. Both vertex ID's and labels should be unsigned 32-bit integers.

Edge-list file format:


Label file format:


Given such files

, the processor is used as follows:
$ cd Peregrine
$ mkdir data/my-graph
$ make convert_data
$ bin/convert_data edges.txt labels.txt data/my-graph/

To convert an unlabeled dataset,

can be omitted.

4. Reproducing our EuroSys 2020 paper results

See the guide.

5. Contributing

A sincere thank you for deciding to spend your valuable time on this project! PR's are welcome, and will be carefully considered so long as they do not degrade performance or compromise correctness.

If you want to understand the system, the most important bits are in: 1.

, which contains the main entry-points to the system 2.
, which contains the pattern matching engine 3.
, where pattern analysis happens

You can also contribute in several ways without having to dig into Peregrine internals: - Submit your applications to be included as samples - Point out or improve missing or unsatisfactory documentation - Complain about confusing errors - Suggest features you wish Peregrine had - Report correctness or performance bugs

6. Acknowledgements

We are grateful for the other open-source projects this codebase relies on: * bliss * Roaring

7. Resources

If you use this software project, please cite:

  author = {Jamshidi, Kasra and Mahadasa, Rakesh and Vora, Keval},
  title = {Peregrine: A Pattern-Aware Graph Mining System},
  year = {2020},
  isbn = {9781450368827},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {},
  doi = {10.1145/3342195.3387548},
  booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},
  articleno = {Article 13},
  numpages = {16},
  location = {Heraklion, Greece},
  series = {EuroSys ’20}

@article{10.1145/3469379.3469381, author = {Jamshidi, Kasra and Vora, Keval}, title = {A Deeper Dive into Pattern-Aware Subgraph Exploration with PEREGRINE}, year = {2021}, issue_date = {July 2021}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {55}, number = {1}, issn = {0163-5980}, url = {}, doi = {10.1145/3469379.3469381}, journal = {SIGOPS Oper. Syst. Rev.}, month = jun, pages = {1–10}, numpages = {10} }

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.