Note: This project is no longer actively maintained by Pinterest.


Terrapin: Serving system for Hadoop generated data

Terrapin is a low latency serving system providing random access over large data sets, generated by Hadoop jobs and stored on HDFS clusters.

Terrapin can ingest data from S3, HDFS or directly from a mapreduce job. Terrapin is elastic, fault tolerant and performant enough to be used for various web scale applications (such as serving personalized recommendations on a website). Terrapin exposes a key-value data model.

Terrapin achieves these goals by storing the output of mapreduce jobs on HDFS in a file format that allows fast random access. A Terrapin server process runs on every data node and serves the files stored on that data node. With this design, we get the scalability of HDFS and Hadoop and also, achieve low latencies since the data is being served from local disk. HDFS optimizations such as short-circuit local reads, OS page cache and possibly mmap reduce the tail latency by avoiding round trips over a TCP socket or the network for HDFS reads. A Terrapin controller is responsible for ensuring data locality.

If you already have an HDFS cluster running, very little needs to be done to setup Terrapin. If you are interested in the detailed design, check out DESIGN.md

Key Features

Getting Started

Java 7 is required in order to build terrapin. Currently Terrapin supports Hadoop 2. In order to build, run the following commands from the root of the git repository (note that hbase compiled with Hadoop 2 is not available in the central maven repo but is required for using HFiles).

git clone [terrapin-repo-url]
cd terrapin

# Install HBase 0.94 artifacts compiled against Hadoop 2.
mvn install:install-file \
  -Dfile=thirdparty/hbase-hadoop2-0.94.7.jar \
  -DgroupId=org.apache.hbase \
  -DartifactId=hbase-hadoop2 \
  -Dversion=0.94.7 \
  -Dpackaging=jar

# Building against default Hadoop version - 2.7.1
mvn package

# Building against custom Hadoop version you are using (if different from 2.7.1)
mvn [-Dhadoop.version=X -Dhadoop.client.version=X] package

To setup a terrapin cluster, follow the instructions at SETUP.md.

Usage

Terrapin can ingest data written to S3/HDFS or it can directly ingest data from a MapReduce job.

Once you have your cluster up and running, you can find several usage examples at USAGE.md.

Tools

Terrapin has tools for querying filesets and performing administrative operations such as deleting, rolling back a fileset and diagnosing cluster health.

Querying a Fileset

Run the following command from the root of your git repo.

java -cp client/target/*:client/target/lib/*        \
    -Dterrapin.config={properties_file}             \
    com.pinterest.terrapin.client.ClientTool {fileset} {key}
Deleting a Fileset
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh deleteFileSet {PROPERTIES_FILE} {FILE_SET}

Note that the deletion of a Fileset is asynchronous. The Fileset is marked for deletion and is later garbage collected by the controller.

Rolling back a Fileset
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh rollbackFileSet {PROPERTIES_FILE} {FILE_SET}

The tool will display the different versions (as HDFS directories), you can rollback to. Select the appropriate version and confirm. To utilize this functionality, the fileset must have been uploaded with multiple versions as described in USAGE.md.

Checking health of a cluster
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh {PROPERTIES_FILE} checkClusterHealth

The tool will display any inconsistencies in ZooKeeper state or any filesets not serving properly.

Monitoring/Diagnostics

You can access the controllers web UI at http://{controller_host}:50030/status. The port can be modified by setting the "status_server_binding_port" property. It will show all the filesets on the cluster and their serving health. You can click a fileset to get more information about it (like the current serving version and partition assignment).

You can also retrieve detailed metrics by running curl localhost:9999/stats.txt on the controller or the server. These metrics are exported using Twitter's Ostrich library and are easy to parse. The port can be modified by setting the "ostrich_metrics_port" property. The controller will export serving health across the whole cluster (percentage of online shards for each fileset) amongst other useful metrics. The server will export latency and value size percentiles for each fileset.

Maintainers

Help

If you have any questions or comments, you can reach us at [email protected]