Build Status Coverage Status

GCP Census

GAE python based app which regularly collects metadata about BigQuery tables and stores it in BigQuery.

GCP Census was created to answer the following questions:

Now every question above can be easily answered by querying metadata tables in BigQuery or looking at our dashboard created in Google Data Studio.

Query examples using Standard SQL dialect

Dataset cost estimation

Views with costs are only estimation because:

How it works?

Architecture diagram

GCP Census retrieves BigQuery metadata using REST API:

  1. Daily run is triggered by GAE cron (see cron.yaml for exact details)
  2. GCP Census iterates over all projects/datasets/tables to which it has access
  3. A task is created for each table and queued for execution in GAE Task Queue
  4. Task worker retrieves Table metadata and streams it into bigquery.table_metadata table. In case of partitioned tables, GCP Census retrieves also partitions summary by querying the partitioned table and stores metadata in bigquery.partition_metadata table
  5. User can query metadata using BigQuery UI/API
  6. Optionally you can create a Data Studio dashboard based on metadata

GCP Census will retrieve metadata of tables it has read access to, which means the scope is derived from GCP IAM settings.

Build instructions

  1. Create GCP project and assign billing to it
  2. Clone GCP Census repository
  3. Specify metadata output BigQuery location in app.yaml (defaults to 'EU')
  4. Install dependencies (ideally using virtualenv):
    pip install -r requirements.txt
    pip install -t lib -r requirements.txt
  5. Deploy to App Engine using gcloud CLI tool:
    gcloud app deploy --project YOUR-PROJECT-ID -v v1 app.yaml config/cron.yaml config/queue.yaml 
  6. Grant bigquery.dataViewer role to YOUR-PROJECT-ID@appspot.gserviceaccount.com service account at GCP organisation, folder or selected projects level.
  7. Enable BigQuery in the project. BigQuery is automatically enabled in new projects. To activate it in a pre-existing project, enable the BigQuery API.
  8. GCP Census job will be triggered daily by cron, see cron.yaml for exact details
  9. Optionally you can manually trigger cron jobs in the Cloud Console:
    • run /createModels to create BigQuery dataset and table
    • run /bigQuery to start collecting BigQuery metadata

Security

GCP Census endpoints are accessible only for GAE Administrators, i.e. all endpoints are secured with login: admin in app.yaml. Still, anyone may attempt to access your app and will be redirected to Google Account login page.

That's why we strongly recommend enabling GAE Firewall on your project. You can enable it with three simple gcloud commands:

gcloud app firewall-rules create 500 --action allow --source-range 0.1.0.1 --description "Allow GAE cron" --project YOUR-PROJECT-ID
gcloud app firewall-rules create 510 --action allow --source-range 0.1.0.2 --description "Allow GAE tasks" --project YOUR-PROJECT-ID
gcloud app firewall-rules update default --action deny --project YOUR-PROJECT-ID

Development

You can find all development setup/steps in .travis.yml

Community

Subscribe or post to gcp-census-discuss@googlegroups.com to follow along and ask questions about the GCP Census.

Contributing

Feel free to submit feature requests and bug reports under Issues.

If you would like to contribute to our development efforts, please review our contributing guidelines and submit a pull request.