Instaclone

Boink

Instaclone is a simple, configurable command-line tool to publish and later install snapshots of files or directories in S3 (or another store). It keeps a local cache of downloaded snapshots so switching between previously cached snapshots is almost instant -- just a symlink or local copy from the cache.

It works nicely when you might want to save things in Git, but can't, due to files or directories being large or sensitive, or because you have multiple variations of files for one Git revision (for example, Mac and Linux). You can git-ignore the original files, publish them with Instaclone, and instead check in the Instaclone configuration file that references them.

Note that if all you want is to do is put big files in Git, LFS may be what you want. Instaclone is more flexible about backend storage and versioning schemes, and offers a local cache.

Basic idea

Every item (a file or directory) can be published as an immutable snapshot. A snapshot has a local path name, a published location (such as an S3 prefix), and a version string. You can assign a version string yourself and reference it directly, or -- and this is where it's more useful -- assign it implicitly from other sources, such as the hash of another file, the platform, or the output of an arbitrary command. The snapshot is then published with that version string.

Another client can then install that snapshot whenever it requires the same version of the item. Installing the first time on a new client requires a download. Subsequent installs use the local cache.

Exact, cached node_modules snapshots

This tool isn't only for use with Node, but this is a good motivating use case.

If you instaclone publish after you npm install and npm shrinkwrap, you can switch back and forth between Git branches and run instaclone install instantly instead of npm install and waiting around minutes for npm to download, copy dependencies, etc. Your colleagues can do this too -- after you publish, they can run instaclone install and get a byte-for-byte exact copy of your node_modules on their machines, more quickly and reliably than if they had done the npm install themselves. Finally, your CI builds will speed up most of the time -- possibly by a lot!

See below for more info on this.

Features

Installation

Requires Python 2.7+ on Mac or Linux. (Windows untested and probably broken.) Then (with sudo if desired):

pip install instaclone

It requires rsync for faster file operations, as well as s3cmd, aws, s4cmd, or any similar tool you put into your upload_command and download_command settings. These must be in your path.

Configuration

Instaclone requires two things to run:

As an example, here is a marginally self-explanatory instaclone.yml configuration, which you would drop anywhere you want and probably should check into Git. You'd create as many files like this as desired in different directories, taking care you give them distinct remote_paths are unique.

---
# You can have as many items as you like and all will be installed.
# You'll want to git-ignore the local_paths below.
items:
  # A big file lives in this directory. It takes a while to generate, so we're going to
  # reference it in this file by version, instaclone publish, and anyone can
  # instaclone install it. We update the version string manually when we regenerate it.
  - local_path: my-big-and-occasionally-generated-resource.bin
    remote_prefix: s3://my-bucket/instaclone-resources
    remote_path: some/big-resources
    upload_command: s4cmd put -f $LOCAL $REMOTE
    download_command: s4cmd get $REMOTE $LOCAL
    # This is an explicitly set version of the file. It can be any string.
    version_string: 42a

  - local_path: node_modules
    remote_prefix: s3://my-bucket/instaclone-resources
    remote_path: my-app/node-stuff
    upload_command: s4cmd put -f $LOCAL $REMOTE
    download_command: s4cmd get $REMOTE $LOCAL
    # We generate the version string as a hash of the npm-shrinkwrap.json plus the architecture we're on:
    version_hashable: npm-shrinkwrap.json
    version_command: uname

See below for more on the node_modules one.

Usage

Once Instaclone is configured, run:

Run instaclone --help for a complete list of flags and settings.

If you have multiple items defined in the instaclone.yml file, you can list them as arguments to instaclone publish or instaclone install, e.g. instaclone install node_modules.

Finally, note that by default, installations are done with a symlink, but this can be customized in the config file to copy files. As a shortcut, if you run instaclone install --copy, it will perform a fast rsync-based copy of the files. You should use the --copy option if you plan to modify the files after installation.

Why you should Instaclone node_modules

This use case deserves a little more explanation.

While npm is amazingly convenient during development, managing the workflow around npm install can be a pain point in terms of speed, reliability, and reproducibility as you scale out builds and in production:

A simpler and more scalable solution to this is to archive the entire node_modules directory, and put it somewhere reliable, like S3. But it can be large and slow to manage if it's always published and then fetched every time you need it. It's also a headache to script, especially in a continuous integration environment, where you want to re-install fresh on builds on all branches, every few minutes, and reinstall only when the checked-in npm-shrinkwrap.json file changes. Oh, and also the builds are platform-dependent, so you need to publish separately on MacOS and Linux.

Instaclone does all this for you. If you already have an npm shrinkwrap workflow, it's pretty easy. It lets you specify where to store your node_modules in S3, and version that entire tree by the SHA1 hash of the npm-shrinkwrap.json file togetehr with the architecture. You can then work on multiple branches and swap them in and out -- a bit like how nvm caches Node installations.

Copy and edit the example config file to try it. On your CI system, you might want to have some sort of automation that tries to reuse pre-published versions, but if not, publishes automatically:

    echo "Running instaclone install or publish..."
    instaclone install || (rm -rf ./node_modules && npm install && instaclone publish)

Note that in normal scenarios, the installed files are symlinked to the read-only cache. If you want to npm install after doing an instaclone install, use instaclone install --copy instead, and all files will be copied instead.

Maturity

Mostly a one-day hack, but it should now be fairly workable. It performs well in at least one continuous build environment with quite large directories synced regularly on Mac and Linux.

Caveats

Running tests

Tests require s4cmd:

$ TEST_BUCKET=my-s3-bucket tests/run.sh

This is a bash-based harness that runs the test script at tests/tests.sh. Its output can then be git diffed with the previous output.

Contributing

Yes, please! File issues for bugs or general discussion. PRs welcome as well -- just figure out how to run the tests and document any other testing that's been done.

License

Apache 2.