Arabesque: Distributed graph mining made simple

Current Version: 1.0.1-BETA

Arabesque is a distributed graph mining system that enables quick and easy development of graph mining algorithms, while providing a scalable and efficient execution engine running on top of Hadoop.

Benefits of Arabesque:

Arabesque is open-source with the Apache 2.0 license.

Requirements for running

Preparing your input

Arabesque currently takes as input graphs with the following formats:

Vertex ids are expected to be sequential integers between 0 and (total number of vertices - 1).

Test/Execute the included algorithms

You can find an execution-helper script and several configuration files for the different algorithms under the scripts folder in the repository:


  1. Compile Arabesque using

    mvn package

    You will find the jar file under target/

  2. Copy the newly generated jar file, the script and the desired yaml files onto a folder on a computer with access to an Hadoop cluster.

  3. Upload the input graph to HDFS. Sample graphs are under the data directory. Make sure you have initialized HDFS first.

    hdfs dfs -put <input graph file> <destination graph file in HDFS>
  4. Configure the cluster.yaml file with the desired number of containers, threads per container and other cluster-wide configurations.

  5. Configure the algorithm-specific yamls to reflect the HDFS location of your input graph as well as the parameters you want to use (max size for motifs and cliques or support for FSM).

  6. Run your desired algorithm by executing:

    ./ cluster.yaml <algorithm>.yaml
  7. Follow execution progress by checking the logs of the Hadoop containers.

  8. Check any output (generated with calls to the output function) in the HDFS path indicated by the output_path configuration entry.

Implementing your own algorithms

The easiest way to get to code your own implementations on top of Arabesque is by forking our Arabesque Skeleton Project. You can do this via Github or manually by executing the following:

git clone $PROJECT_PATH
git remote rename origin upstream
git remote add origin $YOUR_REPO_URL