Need help with c-blosc2?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

Blosc
216 Stars 34 Forks Other 2.6K Commits 47 Opened issues

Description

A fast, compressed, persistent binary data store library for C.

Services available

!
?

Need anything else?

Contributors list

======================================================================

C-Blosc2: A fast, compressed and persistent data store library for C

:Author: The Blosc Development Team :Contact: [email protected] :URL: http://www.blosc.org :Gitter: |gitter| :Actions: |actions| :NumFOCUS: |numfocus| :Code of Conduct: |Contributor Covenant|

.. |gitter| image:: https://badges.gitter.im/Blosc/c-blosc.svg :alt: Join the chat at https://gitter.im/Blosc/c-blosc :target: https://gitter.im/Blosc/c-blosc?utmsource=badge&utmmedium=badge&utmcampaign=pr-badge&utmcontent=badge

.. |actions| image:: https://github.com/Blosc/c-blosc2/workflows/CI%20CMake/badge.svg :target: https://github.com/Blosc/c-blosc2/actions?query=workflow%3A%22CI+CMake%22

.. |appveyor| image:: https://ci.appveyor.com/api/projects/status/qiaxywqrouj6nkug/branch/master?svg=true :target: https://ci.appveyor.com/project/FrancescAlted/c-blosc2/branch/master

.. |numfocus| image:: https://img.shields.io/badge/powered%20by-NumFOCUS-orange.svg?style=flat&colorA=E1523D&colorB=007D8A :target: https://numfocus.org

.. |Contributor Covenant| image:: https://img.shields.io/badge/Contributor%20Covenant-v2.0%20adopted-ff69b4.svg :target: codeofconduct.md

What is it?

Blosc 
_ is a high performance compressor optimized for binary data (i.e. floating point numbers, integers and booleans). It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc main goal is not just to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations.

C-Blosc2 is the new major version of

C-Blosc 
_, and tries hard to be backward compatible with both the C-Blosc1 API and its in-memory format. However, the reverse thing is generally not true; buffers generated with C-Blosc2 are not format-compatible with C-Blosc1 (i.e. forward compatibility is not supported).

See a 3 minutes

introductory video to Blosc2 
_.

New features in C-Blosc2

  • 64-bit containers: the first-class container in C-Blosc2 is the

    super-chunk
    or, for brevity,
    schunk
    , that is made by smaller chunks which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed or not by another container which is called a
    frame
    (see later).
  • More filters: besides

    shuffle
    and
    bitshuffle
    already present in C-Blosc1, C-Blosc2 already implements:
    • delta
      : the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The idea is that, in some situations, the diff will have more zeros than the original data, leading to better compression.
    • trunc_prec
      : it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the
      shuffle
      or
      bitshuffle
      filter, this leads to more contiguous zeros, which are compressed better.
  • A filter pipeline: the different filters can be pipelined so that the output of one can the input for the other. A possible example is a

    delta
    followed by
    shuffle
    , or as described above,
    trunc_prec
    followed by
    bitshuffle
    .
  • Prefilters: allow to apply user-defined C callbacks prior the filter pipeline during compression. See

    test_prefilter.c 
    _ for an example of use.
  • Postfilters: allow to apply user-defined C callbacks after the filter pipeline during decompression. The combination of prefilters and postfilters could be interesting for supporting e.g. encryption (via prefilters) and decryption (via postfilters). Also, a postfilter alone can be used to produce on-the-flight computation based on existing data (or other metadata, like e.g. coordinates). See

    test_postfilter.c 
    _ for an example of use.
  • SIMD support for ARM (NEON): this allows for faster operation on ARM architectures. Only

    shuffle
    is supported right now, but the idea is to implement
    bitshuffle
    for NEON too. Thanks to Lucian Marc.
  • SIMD support for PowerPC (ALTIVEC): this allows for faster operation on PowerPC architectures. Both

    shuffle
    and
    bitshuffle
    are supported; however, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer and
    ESRF 
    _ for sponsoring the Blosc team in helping him in this task.
  • Dictionaries: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, it is only supported in the

    zstd
    codec, but would be nice to extend it to
    lz4
    and
    blosclz
    at least.
  • Contiguous frames: allow to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported.

  • Sparse frames (on-disk): each chunk in a super-chunk is stored in a separate file, as well as the metadata. This is the counterpart of in-memory super-chunk, and allows for more efficient updates than in frames (i.e. avoiding 'holes' in monolithic files).

  • Partial chunk reads: there is support for reading just part of chunks, so avoiding to read the whole thing and then discard the unnecessary data.

  • Parallel chunk reads: when several blocks of a chunk are to be read, this is done in parallel by the decompressing machinery. That means that every thread is responsible to read, post-filter and decompress a block by itself, leading to an efficient overlap of I/O and CPU usage that optimizes reads to a maximum.

  • Meta-layers: optionally, the user can add meta-data for different uses and in different layers. For example, one may think on providing a meta-layer for

    NumPy 
    _ so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter for adding more high-level info if desired (e.g. geo-spatial, meteorological...).
  • Variable length meta-layers: the user may want to add variable-length meta information that can be potentially very large (up to 2 GB). The regular meta-layer described above is very quick to read, but meant to store fixed-length and relatively small meta information. Variable length metalayers are stored in the trailer of a frame, whereas regular meta-layers are in the header.

  • Efficient support for special values: large sequences of repeated values can be represented with an efficient, simple and fast run-length representation, without the need to use regular codecs. With that, chunks or super-chunks with values that are the same (zeros, NaNs or any value in general) can be built in constant time, regardless of the size. This can be useful in situations where a lot of zeros (or NaNs) need to be stored (e.g. sparse matrices).

  • Nice markup for documentation: we are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C-API. See https://c-blosc2.readthedocs.io. Thanks to Alberto Sabater and Aleix Alcacer for contributing the support for this.

  • Plugin capabilities for filters and codecs: we have a plugin register capability inplace so that the info about the new filters and codecs can be persisted and transmitted to different machines. See https://github.com/Blosc/c-blosc2/blob/main/examples/urfilters.c for a self-contained example. Thanks to the NumFOCUS foundation for providing a grant for doing this.

  • Pluggable tuning capabilities: this will allow users with different needs to define an interface so as to better tune different parameters like the codec, the compression level, the filters to use, the blocksize or the shuffle size. Thanks to ironArray for sponsoring us in doing this.

  • Support for I/O plugins: so that users can extend the I/O capabilities beyond the current filesystem support. Things like the use of databases or S3 interfaces should be possible by implementing these interfaces. Thanks to ironArray for sponsoring us in doing this.

  • Python wrapper: we have a preliminary wrapper in the works. You can have a look at our ongoing efforts in the

    python-blosc2 repo 
    _. Thanks to the Python Software Foundation for providing a grant for doing this.
  • Security: we are actively using using the

    OSS-Fuzz 
    _ and
    ClusterFuzz 
    _ for uncovering programming errors in C-Blosc2. Thanks to Google for sponsoring us in doing this.

More info about the

improved capabilities of C-Blosc2 can be found in this talk 
_.

After a long period of testing, C-Blosc2 entered production stage in 2.0.0. The API and format have been frozen, and that means that there is guarantee that your programs will continue to work with future versions of the library, and that next releases will be able to read from persistent storage generated from previous releases (as of 2.0.0).

Meta-compression and other advantages over existing compressors

C-Blosc2 is not like other compressors: it should rather be called a meta-compressor. This is so because it can use different codecs (libraries that can reduce the size of inputs) and filters (libraries that generally improve compression ratio). At the same time, it can also be called a compressor because it makes an actual use of the several codecs and filters, so it can actually work like so.

Currently C-Blosc2 comes with support of BloscLZ, a compressor heavily based on

FastLZ 
,
LZ4 and LZ4HC 
,
Zstd 
, and
Zlib, via zlib-ng: 
, as well as a highly optimized (it can use SSE2, AVX2, NEON or ALTIVEC instructions, if available) shuffle and bitshuffle filters (for info on how shuffling works, see slide 17 of http://www.slideshare.net/PyData/blosc-py-data-2014).

Blosc is in charge of coordinating the codecs and filters so that they can leverage the blocking technique (described above) as well as multi-threaded execution (if several cores are available) automatically. That makes that every codec and filter will work at very high speeds, even if it was not initially designed for doing blocking or multi-threading. For example, Blosc allows you to use the

LZ4
codec, but in a multi-threaded way.

Last but not least, C-Blosc2 comes with an easy-to-use plugin mechanism for codecs and filters, so anyone can inject their own code in the compression pipeline of Blosc2 and reap its benefits (like multi-threading and integration with other filters) for free (see a

self-contained example 
). In addition, we have implemented a centralized plugin system too (see the
docs in the plugins directory 
).

Multidimensional containers

As said, C-Blosc2 adds a powerful mechanism for adding different metalayers on top of its containers.

Caterva 
_ is a sibling library that adds such a metalayer specifying not only the dimensionality of a dataset, but also the dimensionality of the chunks inside the dataset. In addition, Caterva adds machinery for retrieving arbitrary multi-dimensional slices (aka hyper-slices) out of the multi-dimensional containers in the most efficient way. Hence, Caterva brings the convenience of multi-dimensional containers to your application very easily. For more info, check out the
Caterva documentation 
_.

Python wrapper

We are officially supporting (thanks to the Python Software Foundation) a

Python wrapper for Blosc2 
_. Although this is still in early development, it already supports all the features of the venerable
python-blosc 
package. As a bonus, the
python-blosc2
package comes with wheels and binary versions of the C-Blosc2 libraries, so anyone, even non-Python users can install C-Blosc2 binaries easily with:

.. code-block:: console

pip install blosc2

Compiling the C-Blosc2 library with CMake

Blosc can be built, tested and installed using

CMake 
_. The following procedure describes a typical CMake build.

Create the build directory inside the sources and move into it:

.. code-block:: console

git clone https://github.com/Blosc/c-blosc2 cd c-blosc2 mkdir build cd build

Now run CMake configuration and optionally specify the installation directory (e.g. '/usr' or '/usr/local'):

.. code-block:: console

cmake -DCMAKEINSTALLPREFIX=yourinstallprefix_directory ..

CMake allows to configure Blosc in many different ways, like prefering internal or external sources for compressors or enabling/disabling them. Please note that configuration can also be performed using UI tools provided by CMake (

ccmake
or
cmake-gui
):

.. code-block:: console

ccmake .. # run a curses-based interface cmake-gui .. # run a graphical interface

Build, test and install Blosc:

.. code-block:: console

cmake --build . ctest cmake --build . --target install

The static and dynamic version of the Blosc library, together with header files, will be installed into the specified CMAKEINSTALLPREFIX.

Once you have compiled your Blosc library, you can easily link your apps with it as shown in the

examples/ directory 
_.

Handling support for codecs (LZ4, LZ4HC, Zstd, Zlib) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

C-Blosc2 comes with full sources for LZ4, LZ4HC, Zstd, and Zlib and in general, you should not worry about not having (or CMake not finding) the libraries in your system because by default the included sources will be automatically compiled and included in the C-Blosc2 library. This means that you can be confident in having a complete support for all the codecs in all the Blosc deployments (unless you are explicitly excluding support for some of them).

If you want to force Blosc to use external libraries instead of the included compression sources:

.. code-block:: console

cmake -DPREFEREXTERNALLZ4=ON ..

You can also disable support for some compression libraries:

.. code-block:: console

cmake -DDEACTIVATE_ZSTD=ON ..

Supported platforms ~~~~~~~~~~~~~~~~~~~

C-Blosc2 is meant to support all platforms where a C99 compliant C compiler can be found. The ones that are mostly tested are Intel (Linux, Mac OSX and Windows), ARM (Linux, Mac), and PowerPC (Linux) but exotic ones as IBM Blue Gene Q embedded "A2" processor are reported to work too. More on ARM support in

README_ARM.rst
.

For Windows, you will need at least VS2015 or higher on x86 and x64 targets (i.e. ARM is not supported on Windows).

For Mac OSX, make sure that you have installed the command line developer tools. You can always install them with:

.. code-block:: console

xcode-select --install

For Mac OSX on arm64 architecture, you need to compile like this:

.. code-block:: console

CC="clang -arch arm64" cmake ..

Support for the LZ4 optimized version in Intel IPP ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

C-Blosc2 comes with support for a highly optimized version of the LZ4 codec present in Intel IPP. Here it is a way to easily install Intel IPP using Conda(https://docs.conda.io):

.. code-block:: console

conda install -c intel ipp-static

With that, you can enable support for LZ4/IPP (it is disabled by default) with:

.. code-block:: console

cmake .. -DDEACTIVATE_IPP=OFF

In some Intel CPUs LZ4/IPP could be faster than regular LZ4, although in many cases you may experience different compression ratios depending on which version you use. See #313 for some quick and dirty benchmarks.

Display error messages ~~~~~~~~~~~~~~~~~~~~~~

By default error messages are disabled. To display them, you just need to activate the Blosc tracing machinery by setting the

BLOSC_TRACE
environment variable.

Contributing

If you want to collaborate in this development you are welcome. We need help in the different areas listed at the

ROADMAP 
; also, be sure to read our
DEVELOPING-GUIDE 
and our
Code of Conduct 
. Blosc is distributed using the
BSD license 
.

Tweeter feed

Follow

@Blosc2 
_ so as to get informed about the latest developments.

Acknowledgments

See

THANKS document 
_.

Enjoy data!

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.