This is a tool for transforming and processing VCF files in a scalable manner based on Apache Beam using Dataflow on Google Cloud Platform.
It can be used to directly load VCF files to BigQuery supporting hundreds of thousands of files, millions of samples, and billions of records. Additionally, it provides a preprocess functionality to validate the VCF files such that the inconsistencies can be easily identified.
Please see this presentation for a high level overview of BigQuery and how to effectively use Variant Transforms and BigQuery. Please also read the blog post about how a GCP customer used Variant Transforms for breakthrough clinical data science with BigQuery.
The easiest way to run the VCF to BigQuery pipeline is to use the
docker image, as it has the binaries and all
dependencies pre-installed. Please ensure you have the latest gcloud
tool by
running gcloud components update
(more details here).
Use the following command to get the latest version of Variant Transforms.
docker pull gcr.io/cloud-lifesciences/gcp-variant-transforms
Run the script below and replace the following parameters:
GOOGLE_CLOUD_PROJECT
: This is your project ID that contains the BigQuery
dataset.GOOGLE_CLOUD_REGION
: You must choose a geographic region for Cloud Dataflow
to process your data, for example: us-west1
. For more information please refer to
Setting Regions.TEMP_LOCATION
: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.INPUT_PATTERN
: A location in Google Cloud Storage where the
VCF file are stored. You may specify a single file or provide a pattern to
load multiple files at once. Please refer to the
Variant Merging documentation if you want
to merge samples across files. The pipeline supports gzip, bzip, and
uncompressed VCF formats. However, it runs slower for compressed files as they
cannot be sharded.OUTPUT_TABLE
: The full path to a BigQuery table to store the output.#!/bin/bash
# Parameters to replace:
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
TEMP_LOCATION=gs://BUCKET/temp
INPUT_PATTERN=gs://BUCKET/*.vcf
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
COMMAND="vcf_to_bq \
--input_pattern ${INPUT_PATTERN} \
--output_table ${OUTPUT_TABLE} \
--job_name vcf-to-bigquery \
--runner DataflowRunner"
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}" \
"${COMMAND}"
--project
, --region
, and --temp_location
are required inputs. You must set all of them, unless your project and region default values
are set in your local gcloud
configuration. You may set the default project
and region using the following commands:
gcloud config set project GOOGLE_CLOUD_PROJECT
gcloud config set compute/region REGION
The underlying pipeline uses Cloud Dataflow. You can navigate to the Dataflow Console, to see more detailed view of the pipeline (e.g. number of records being processed, number of workers, more detailed error logs).
In addition to using the docker image, you may run the pipeline directly from source. First install git, python, pip, and virtualenv:
sudo apt-get install -y git python-pip python-dev build-essential
sudo python -m pip install --upgrade pip
sudo python -m pip install --upgrade virtualenv
Run virtualenv, clone the repo, and install pip packages:
virtualenv venv
source venv/bin/activate
git clone https://github.com/googlegenomics/gcp-variant-transforms.git
cd gcp-variant-transforms
python -m pip install --upgrade .
You may use the DirectRunner (aka local runner) for small (e.g. 10,000 records) files or DataflowRunner for larger files. Files should be stored on Google Cloud Storage if using Dataflow, but may be stored locally for DirectRunner.
Example command for DirectRunner:
python -m gcp_variant_transforms.vcf_to_bq \
--input_pattern gcp_variant_transforms/testing/data/vcf/valid-4.0.vcf \
--output_table GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE \
--job_name vcf-to-bigquery-direct-runner \
--temp_location "${TEMP_LOCATION}"
Example command for DataflowRunner:
python -m gcp_variant_transforms.vcf_to_bq \
--input_pattern gs://BUCKET/*.vcf \
--output_table GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE \
--job_name vcf-to-bigquery \
--setup_file ./setup.py \
--runner DataflowRunner \
--project "${GOOGLE_CLOUD_PROJECT}" \
--region "${GOOGLE_CLOUD_REGION}" \
--temp_location "${TEMP_LOCATION}"
The VCF files preprocessor is used for validating the datasets such that the inconsistencies can be easily identified. It can be used as a standalone validator to check the validity of the VCF files, or as a helper tool for VCF to BigQuery pipeline. Please refer to VCF files preprocessor for more details.
The BigQuery to VCF pipeline is used to export variants in BigQuery to one VCF file. Please refer to BigQuery to VCF pipeline for more details.