Cantor is a persistent data abstraction layer; it provides functionalities to query and retrieve data stored as key/value pairs, sorted sets, map of key/values, or multi-dimensional time-series data points.


Cantor can help in simplifying and reducing the size of the data access layer implementation in an application.

Majority of applications require some form of persistence. The data access object layer implementation usually accounts for a considerable portion of the code in these applications. This layer usually contains code to initalize and connect to some storage system; a mapping to/from the layout of the data in storage and the corresponding representation in the application; and also, composing and executing queries against the storage, handling edge cases, handling exceptions, etc. This is where Cantor can help to reduce the code and its complexity.

Some of the commonly used patterns to access data are:

Cantor tries to provide a set of simple yet powerful abstractions that can be used to address essential needs for the above mentioned use cases. The implementation focues more on simplicity and usability than completeness, performance, or scale. This is not a suitable solution for large scale (data sets larger than few tera-bytes) applications. It is also not recommended for high throughput (more than a hundred thousand operations per second) applications.


The library allows users to persist and query data stored in one of the following forms:

These data structures can be used to solve variety of use-cases for applications; and they are straight forward to implement simply and efficiently on top of relational databases. Cantor provides this implementaion. It also tries to eliminate some of the complexities around relational databases, such as joins, constraints, and stored procedures. The aim of the library is to provide a simple and powerful set of abstractions for users to be able to spend more time on the application's business logic rather than the data storage and retrieval.


There are four main interfaces exposed for users to interact with: the Objects interface for key/value pairs; the Sets interface for persisted sorted sets; and the Events interface for timeseries data.

All methods expect a namespace parameter which can be used to slice data into multiple physically separate databases. A namespace must be first created by making a call to the create(namespace) method. It is also possible to drop the whole namespace by calling drop(namespace) after which point any call to that namespace will result in IOException.


Internally, all objects are stored in a table with two columns; a string column for the key, and a blob column for the value; similiar to the table below:

Operations on key/value pairs are defined in the Objects interface:


Sorted sets are stored in a table with three columns; a string column for the set name, a string column for the entry, and a long column for the weight associated to the entry; simliar to the table below:

Operations on sorted sets are defined in the Sets interface:

Most operations support ranges. A range is described as count entries with weight between a min and a max value, starting from the start index; for example making a get call similar to this: get(namespace, 'sample-set-1', 1, Long.MAX_VALUE, 0, 3, true) against the above dataset, returns maximum of 3 entries from the sample-set-1 where weight is between 1 and Long.MAX_VALUE starting from index 0, ordered ascending.


Events represent multi-dimensional timeseries data points, with arbitrary metadata key/value pairs, and optionally a byte[] payload attached to an event.

Operations on events are defined in the Events interface:

Events are internally bucketed into 1-hour intervals and stored in separate tables based on the different dimensions and metadata keys associated to an event. For example, an event with dimension d1 and metadata m1 is stored in a separate table than one with dimension d2 and metadata m2.

Note that an event cannot contain more than 100 metadata and 400 dimension keys.

How to use?

The projects is divided into a number of sub-modules to allow users to only pull in dependencies that are necessary.

Embedded Cantor

To use the embedded Cantor library which is implemented on top of H2, include the following dependency:


Embedded Cantor on H2

On top of MySQL

To use Cantor on top of MySQL, include the following dependency:


Cantor on MySQL

Why MySQL?

MySQL (or MariaDB) is a stable, performant, and scalable open source relational database, with very active community and variety of tools and services around it.

Client/Server Mode

To use Cantor in a client/server mode, execute the cantor-server.jar similar to this:

$ java -jar cantor-server.jar <path-to-cantor-server.conf>

And include the following dependency on the client:


Cantor Client/Server Model

More details on how to instantiate instances can be found in javadocs.


Clone the repository:

$ git clone

Compile like this:

$ cd cantor/
$ ./