Need help with skizze?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

skizzehq
776 Stars 62 Forks MIT License 439 Commits 8 Opened issues

Description

A probabilistic data structure service and storage

Services available

!
?

Need anything else?

Contributors list


Build Status license

Skizze ([ˈskɪt͡sə]: german for sketch) is a sketch data store to deal with all problems around counting and sketching using probabilistic data-structures.

Unlike a Key-Value store, Skizze does not store values, but rather appends values to defined sketches, allowing one to solve frequency and cardinality queries in near O(1) time, with minimal memory footprint.

Current status ==> Alpha (tagged v0.0.2)

Motivation

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly.

Skizze is a (fire and forget) service that provides a probabilistic data structures (sketches) storage that allows estimation of these and many other metrics, with a trade off in precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

Example use cases (queries)

  • How many distinct elements are in the data set (i.e. what is the cardinality of the data set)?
  • What are the most frequent elements (the terms “heavy hitters” and “top-k elements” are also used)?
  • What are the frequencies of the most frequent elements?
  • How many elements belong to the specified range (range query, in SQL it looks like
    SELECT count(v) WHERE v >= c1 AND v < c2)
    ?
  • Does the data set contain a particular element (membership query)?

How to build and run

make dist
./bin/skizze

Bindings

Two bindings are currently available:

Example usage:

Skizze comes with a CLI to help test and explore the server. It can be run via

./bin/skizze-cli

Commands

Create a new Domain (Collection of Sketches): ```{r, engine='bash', count_lines}

CREATE DOM $name $estCardinality $topk

CREATE DOM demostream 10000000 100 ```

Add values to the domain: ```{r, engine='bash', count_lines}

ADD DOM $name $value1, $value2 ....

ADD DOM demostream zod joker grod zod zod grod ```

Get the cardinality of the domain: ```{r, engine='bash', count_lines}

GET CARD $name

GET CARD demostream

returns:

Cardinality: 9

**Get** the *rankings* of the domain:
```{r, engine='bash', count_lines}
# GET RANK $name
GET RANK demostream

returns:

Rank: 1 Value: zod Hits: 3

Rank: 2 Value: grod Hits: 2

Rank: 3 Value: joker Hits: 1

Get the frequencies of values in the domain: ```{r, engine='bash', count_lines}

GET FREQ $name $value1 $value2 ...

GET FREQ demostream zod joker batman grod

returns

Value: zod Hits: 3

Value: joker Hits: 1

Value: batman Hits: 0

Value: grod Hits: 2

**Get** the *membership* of values in the domain:
```{r, engine='bash', count_lines}
# GET MEMB $name $value1 $value2 ...
GET MEMB demostream zod joker batman grod

returns

Value: zod Member: true

Value: joker Member: true

Value: batman Member: false

Value: grod Member: true

List all available sketches (created by domains): ```{r, engine='bash', count_lines} LIST

returns

Name: demostream Type: CARD

Name: demostream Type: FREQ

Name: demostream Type: MEMB

Name: demostream Type: RANK

**Create** a new sketch of type $type (CARD, MEMB, FREQ or RANK):
```{r, engine='bash', count_lines}
# CREATE CARD $name
CREATE CARD demosketch

Add values to the sketch of type $type (CARD, MEMB, FREQ or RANK): ```{r, engine='bash', count_lines}

ADD $type $name $value1, $value2 ....

ADD CARD demostream zod joker grod zod zod grod ```

License

Skizze is available under the Apache License, Version 2.0.

Authors

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.