Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
This README file is about building and running the refine.bio project source code.
Refine.bio currently has four sub-projects contained within this repo: - common Contains code needed by both
workers. - foreman Discovers data to download/process and manages jobs. - workers Runs Downloader and Processor jobs. - infrasctructure Manages infrastructure for Refine.bio.
refinebiouses a feature branch based workflow. New features should be developed on new feature branches, and pull requests should be sent to the
devbranch for code review. Merges into
masterhappen at the end of sprints, and tags in
mastercorrespond to production releases.
To run Refine.bio locally, you will need to have the prerequisites installed onto your local machine. This will vary depending on whether you are developing on a Mac or a Linux machine. Linux instructions have been tested on Ubuntu 16.04 or later, but other Linux distributions should be able to run the necessary services. Microsoft Windows is currently unsupported by this project.
Note: The install_all.sh script will configure a git pre-commit hook to auto-format your python code. This will format your code in the same way as the rest of the project, allowing it to pass our linting check.
The easiest way to run Refine.bio locally is to run
./scripts/install_all.shto install all of the necessary dependencies. As long as you are using a recent version of Ubuntu or macOS it should work. If you are using another version of Linux it should still install most of the dependencies as long as you give the appropriate
INSTALL_CMDenvironment variable, but some dependencies may be named differently in your package manager than in Ubuntu's.
The following services will need to be installed: - Python3 and Pip:
sudo apt-get -y install python3-pip- Docker: Be sure to follow the post installation steps so Docker does not need sudo permissions. - Terraform - Nomad can be installed on Linux clients with
sudo ./scripts/install_nomad.sh. - pip3 can be installed on Linux clients with
sudo apt-get install python3-pip- black can be installed on Linux clients with
pip3 install black- git-crypt - jq - iproute2 - shellcheck
Instructions for installing Docker, Terraform, and Nomad can be found by following the link for each service. git-crypt, jq, and iproute2 can be installed via
sudo apt-get install git-crypt jq iproute2 shellcheck.
Once Homebrew is installed, the other required applications can be installed by running:
brew install iproute2mac git-crypt nomad terraform jq black shellcheck.
Many of the computational processes running are very memory intensive. You will need to raise the amount of virtual memory available to Docker from the default of 2GB to 12GB or 24GB, if possible.
./scripts/create_virtualenv.shto set up the virtualenv. It will activate the
dr_envfor you the first time. This virtualenv is valid for the entire
refinebiorepo. Sub-projects each have their own environments managed by their containers. When returning to this project you should run
source dr_env/bin/activateto reactivate the virtualenv.
refinebioalso depends on Postgres and Nomad. Postgres can be run in a local Docker container, but Nomad must be run on your development machine.
To start a local Postgres server in a Docker container, use:
Then, to initialize the database, run:
If you need to access a
psqlshell for inspecting the database, you can use:
or if you have
psqlinstalled this command will give you a better shell experience:
source scripts/common.sh && PGPASSWORD=mysecretpassword psql -h $(get_docker_db_ip_address) -U postgres -d data_refinery
Similarly, you will need to run a local Nomad service in development mode.
However if you run Linux and you have followed the installation instructions, you can run Nomad with:
sudo -E ./scripts/run_nomad.sh
(Note: This step may take some time because it downloads lots of files.)
Nomad is an orchestration tool which Refine.bio uses to run
Processorjobs. Jobs are queued by sending a message to the Nomad agent, which will then launch a Docker container which runs the job. If address conflicts emerge, old Docker containers can be purged with
docker container prune -f.
The common sub-project contains common code which is depended upon by the other sub-projects. So before anything else you should prepare the distribution directory
common/distwith this script:
(Note: This step requires the postgres container to be running and initialized.)
Note: there is a small chance this might fail with a
can't stat, error. If this happens, you have to manually change permissions on the volumes directory with
sudo chmod -R 740 volumes_postgresthen re-run the migrations.
One of the API endpoints is powered by ElasticSearch. ElasticSearch must be running for this functionality to work. A local ElasticSearch instance in a Docker container can be executed with:
And then the ES Indexes (akin to Postgres 'databases') can be created with:
The end to end tests require a separate Nomad client to be running so that the tests can be run without interfering with local development. The second Nomad client can be started with:
sudo -E ./scripts/run_nomad.sh -e test
To run the entire test suite:
(Note: Running all the tests can take some time, especially the first time because it downloads a lot of files.)
You can use the following to get the current status of nomad when running in the test environment.
$ source scripts/common.sh $ set_nomad_test_address $ nomad status
Running the end to end tests is tricky because Nomad's needs to pull images from docker with our code. We have a docker image registry that runs locally, but you'll need to update it with different images in order to make the code run. The script
./scripts/prepare_image.shcan be used to prepare the images before pushing them.
$ ./scripts/prepare_image.sh -i downloaders -d localhost:5000 $ docker push localhost:5000/dr_downloaders:latest
$ ./scripts/prepare_image.sh -i no_op -d localhost:5000 $ docker push localhost:5000/dr_no_op:latest
That's for the images
no_op, the same need to be executed for the other images:
If you want to debug the status of a specific nomad job you can use:
$ nomad status NO_OP_0_2048/dispatch-1567796915-3d7c7c87 $ nomad status f9c1345b $ nomad logs f9c1345b
f9c1345bis the allocation id that it's returned in
These tests will also be run continuously for each commit via CircleCI.
For more granular testing, you can just run the tests for specific parts of the system.
To just run the API tests:
To just run the common tests:
To just run the foreman tests:
To just run the workers tests:
If you only want to run tests with a specific tag, you can do that too. For example, to run just the salmon tests:
./workers/run_tests.sh -t salmon
All of our worker tests are tagged, generally based on the Docker image required to run them. Possible values for worker test tags are: - affymetrix - agilent - downloaders - illumina - no_op - qn (short for quantile normalization) - salmon - smasher - transcriptome
In addition to following pep8, python files must also conform to the formatting style enforced by black.
blackis a highly opinionated auto-formatter. (
black's highly opinionated style is a strict sub-set of pep8.) The easiest way to conform to this style is to run
black . --line-length=100. This will auto-format your code. Running the
./scripts/install_all.shscript will install a pre-commit git hook that will run this formatter on every commit you make locally. Under the hood this uses pre-commit, which you can also install directly by running
pip3 install pre-commit & pre-commit install. Then, if you want to run
pre-commitwithout making a git commit, you can use
pre-commit run --all-files. To install
blacksee the installation instructions. Any Pull Requests that do not conform to the style enforced by
blackwill be flagged by our continous integration and will not be accepted until that check passes.
All user-facing scripts have been linted with
shellcheckfor common warnings and POSIX-correctness. If a script is user-facing, it should ideally be POSIX-compliant and have the extension
.sh, but if bashisms are necessary it should have the extension
.bash. To install
shellcheck, you can run
apt-get install shellcheckor
brew install shellcheck. Then, you can lint scripts with
During development, you make encounter some occasional strangeness. Here's some things to watch out for:
Docker.rawfile filling. You can prune old images with
docker system prune -a.
We have created some utilities to help us keep R stable, reliable, and from periodically causing build errors related to version incompatibilites. The primary goal of these is to pin the version for every R package that we have. The R package
devtoolsis useful for this, but in order to be able to install a specific version of it, we've created the R script
There is another gotcha to be aware of should you ever need to modify versions of R or its packages. In Dockerfiles for images that need the R language, we install apt packages that look like
r-base-core=3.4.2-1xenial1. It's unclear why the version for these is so weird, but it was determined by visiting the package list here: https://cran.revolutionanalytics.com/bin/linux/ubuntu/xenial/ If it needs to be updated then a version should be selected from that list.
Additionally there are two apt packages, r-base and r-base-core, which seem to be very similar except that r-base-core is slimmed down some by not including some additional packages. For a while we were using r-base, but we switched to r-base-core when we pinned the version of the R language because the r-base package caused an apt error.
Once you've built the
common/distdirectory and have the Nomad and Postgres services running, you're ready to run jobs. To run the API you also need the elasticsearch service running.
There are three kinds of jobs within Refine.bio.
The API can be run with:
Surveyor Jobs discover samples to download/process along with recording metadata about the samples. A Surveyor Job should queue
Downloader Jobsto download the data it discovers.
The Surveyor can be run with the
./foreman/run_surveyor.shscript. The first argument to this script is the type of Surveyor Job to run, which will always be
Details on these expected arguments can be viewed by running:
./foreman/run_surveyor.sh survey_all -h
The Surveyor can accept a single accession code from any of the source data repositories (e.g., Sequencing Read Archive, ArrayExpress, Gene Expression Omnibus):
./foreman/run_surveyor.sh survey_all --accession
Example for a GEO experiment:
./foreman/run_surveyor.sh survey_all --accession GSE85217
Example for an ArrayExpress experiment:
./foreman/run_surveyor.sh survey_all --accession E-MTAB-3050 # AFFY ./foreman/run_surveyor.sh survey_all --accession E-GEOD-3303 # NO_OP
Transcriptome indices are a bit special. For species within the "main" Ensembl division, the species name can be provided like so:
./foreman/run_surveyor.sh survey_all --accession "Homo sapiens"
However for species that are in other divisions, the division must follow the species name after a comma like so:
./foreman/run_surveyor.sh survey_all --accession "Caenorhabditis elegans, EnsemblMetazoa"
The possible divisions that can be specified are: * Ensembl (this is the "main" division and is the default) * EnsemblPlants * EnsemblFungi * EnsemblBacteria * EnsemblProtists * EnsemblMetazoa
If you are unsure what division a species falls into, unfortunately the only way to tell is go to check ensembl.com. (Although googling the species name + "ensembl" may work pretty well.)
You can also supply a newline-deliminated file to
survey_allwhich will dispatch survey jobs based on accession codes like so:
./foreman/run_surveyor.sh survey_all --file MY_BIG_LIST_OF_CODES.txt
The main foreman job loop can be started with:
This must actually be running for jobs to move forward through the pipeline.
When surveying SRA, you can supply either run accession codes (e.g., codes beginning in
ERR) or study accession codes (
Run example (single read):
./foreman/run_surveyor.sh survey_all --accession DRR002116
Run example (paired read):
./foreman/run_surveyor.sh survey_all --accession SRR6718414
./foreman/run_surveyor.sh survey_all --accession ERP006872
Building transcriptome indices used for quantifying RNA-seq data requires us to retrieve genome information from Ensembl. The Surveyor expects a species' scientific name in the main Ensembl division as the accession:
./foreman/run_surveyor.sh survey_all --accession "Homo Sapiens"
TODO: Update once this supports organisms from multiple Ensembl divisions
Downloader Jobs will be queued automatically when
Surveyor Jobsdiscover new samples. However, if you just want to queue a
Downloader Jobyourself rather than having the Surveyor do it for you, you can use the
bash ./workers/tester.sh run_downloader_job --job-name= --job-id=
bash ./workers/tester.sh run_downloader_job --job-name=SRA --job-id=12345
./workers/tester.sh run_downloader_job --job-name=ARRAY_EXPRESS --job-id=1
Or for more information run:
bash ./workers/tester.sh -h
Processor Jobs will be queued automatically by successful
Downloader Jobs. However, if you just want to run a
Processor Jobwithout yourself without having a
Downloader Jobdo it for you, the following command will do so:
./workers/tester.sh -i run_processor_job --job-name= --job-id=
bash ./workers/tester.sh -i affymetrix run_processor_job --job-name=AFFY_TO_PCL --job-id=54321
./workers/tester.sh -i no_op run_processor_job --job-name=NO_OP --job-id=1
./workers/tester.sh -i salmon run_processor_job --job-name=SALMON --job-id=1
./workers/tester.sh -i transcriptome run_processor_job --job-name=TRANSCRIPTOME_INDEX_LONG --job-id=1
Or for more information run:
bash ./workers/tester.sh -h
If you want to quantile normalize combined outputs, you'll first need to create a reference target for a given organism. This can be done in a production environment with the following:
nomad job dispatch -meta ORGANISM=DANIO_RERIO CREATE_QN_TARGET
To create QN targets for all organisms, do so with the dispatcher:
nomad job dispatch QN_DISPATCHER
This will at some point move to the foreman and then it will take a list of organisms to create QN targets for.
Creating species-wide compendia for a given species can be done in a production environment by running the following on the Foreman instance:
./run_management_command.sh create_compendia --organisms=DANIO_RERIO --svd-algorithm=ARPACK
or for a list of organisms:
./run_management_command.sh create_compendia --organisms=DANIO_RERIO,HOMO_SAPIENS --svd-algorithm=ARPACK
or for all organisms with sufficient data:
./run_management_command.sh create_compendia --svd-algorithm=ARPACK
Alternatively a compendium can be created which only includes quant.sf files by using the create_quantpentida command:
./run_management_command.sh create_quantpendia --organisms=DANIO_RERIO
Compendia jobs run on the smasher instance. However they require a very large amount of RAM to be able to complete. Our smasher instance does not generally have enough RAM to be able to run them, so if you need to run a smasher job you should temporarily increase the size of the smasher instance. This can be done by changing the terraform variable
smasher_instance_typewhich can be found in
infrastructure/variables.tf. Select an AWS instance type that has enough RAM to run the compendia jobs. At the time of writing, compendia jobs require 180GB of RAM and m5.12xlarge has 192GM of RAM so it is sufficiently large to run the jobs.
Normally we wait until ever sample in an experiment has had Salmon run on it before we run Tximport. However Salmon won't work on every sample, so some experiments are doomed to never make it to 100% completion. Tximport can be run on such an experiment with:
nomad job dispatch -meta EXPERIMENT_ACCESSION=SRP009841 TXIMPORT
Note that if the experiment does not have at least 25 samples with at least 80% of them processed, this will do nothing.
Note: The following instructions assume you have set the environment variable NOMAD_ADDR to include the IP address of your development machine. This can be done with:
source ./scripts/common.sh && export NOMAD_ADDR=http://$(get_ip_address):4646
To check on the status of a job, run:
It should output something like:
ID Type Priority Status Submit Date DOWNLOADER batch/parameterized 50 running 01/31/18 18:34:05 EST DOWNLOADER/dispatch-1517441663-4b02e7a3 batch 50 dead 01/31/18 18:34:23 EST PROCESSOR batch/parameterized 50 running 01/31/18 18:34:05 EST
The rows whose
PROCESSORare the parameterized jobs which are waiting to dispatch Refine.bio jobs. If you don't understand what that means, don't worry about it. All you really need to do is select one of the jobs whose ID contains
Submit Datematches the time when the job you want to check on was run, copy that full ID (in this case
DOWNLOADER/dispatch-1517437920-ae8b77a4), and paste it after the previous command, like so:
nomad status DOWNLOADER/dispatch-1517441663-4b02e7a3
This will output a lot of information about that
Nomad Dispatch Job, of which we're mostly interested in the section titled Allocations. Here is an example:
Allocations ID Node ID Task Group Version Desired Status Created At b30e4edd fda75a5a jobs 0 run complete 01/31/18 18:34:23 EST
If you paste that after the original
nomad statuscommand, like so:
nomad status b30e4edd
you'll see a lot of information about allocation, which probably isn't what you're interested in. Instead, you should run:
nomad logs -verbose b30e4edd
This command will output both the stderr and stdout logs from the container which ran that allocation. The allocation is really a Refine.bio job.
It can be useful to have an interactive Python interpreter running within the context of the Docker container. The
scripts/run_shell.shscript has been provided for this purpose. It is in the top level directory so that if you wish to reference it in any integrations its location will be constant. However, it is configured by default for the Foreman project. The interpreter will have all the environment variables, dependencies, and Django configurations for the Foreman project. There are instructions within the script describing how to change this to another project.
Refine.bio requires an active, credentialed AWS account with appropriate permissions to create network infrastructure, users, compute instances and databases.
Deploys are automated to run via CirlceCI whenever a signed tag starting with a
vis pushed to either the
masterbranches (v as in version, i.e. v1.0.0). Tags intended to trigger a staging deploy MUST end with
v1.0.0-dev. CircleCI runs a deploy on a dedicated AWS instance so that the Docker cache can be preserved between runs.
Instructions for setting up that instance can be found in the infrastructure/deployboxinstance_data.sh script.
To trigger a new deploy, first see what tags already exist with
git tag --list | sort --version-sortWe have two different version counters, one for
devand one for
masterso a list including things like: * v1.1.2 * v1.1.2-dev * v1.1.3 * v1.1.3-dev
However you may see that the
devcounter is way ahead, because we often need more than one staging deploy to be ready for a production deploy. This is okay, just find the latest version of the type you want to deploy and increment that to get your version. For example, if you wanted to deploy to staging and the above versions were the largest that
git tag --listoutput, you would increment
Once you know which version you want to deploy, say
v1.1.4-dev, you can trigger the deploy with these commands:
bash git checkout dev git pull origin dev git tag -s v1.1.4-dev git push origin v1.1.4-dev
git tag -s v1.1.4-devwill prompt you to write a tag message; please try to make it descriptive.
We use semantic versioning for this project so the last number should correspond to bug fixes and patches, the second middle number should correspond to minor changes that don't break backwards compatibility, and the first number should correspond to major changes that break backwards compatibility. Please try to keep the
masterversions in sync for major and minor versions so only the patch version gets out of sync between the two.
Refine.bio uses a number of different Docker images to run different pieces of the system. By default, refine.bio will pull images from the Dockerhub repo
ccdlstaging. If you would like to use images you have built and pushed to Dockerhub yourself you can pass the
-doption to the
To make building and pushing your own images easier, the
scripts/update_my_docker_images.shhas been provided. The
-doption will allow you to specify which repo you'd like to push to. If the Dockerhub repo requires you to be logged in, you should do so before running the script using
docker login. The -v option allows you to specify the version, which will both end up on the Docker images you're building as the SYSTEM_VERSION environment variable and also will be the docker tag for the image.
scripts/update_my_docker_images.shwill not build the draffymetrix image, because this image requires a lot of resources and time to build. It can instead be built with `./scripts/prepareimage.sh -i affymetrix -d `. WARNING: The affymetrix image installs a lot of data-as-R-packages and needs a lot of disk space to build the image. It's not recommended to build the image with less than 60GB of free space on the disk that Docker runs on.
Secrets are stored using git-crypt. Team members can access secret files in the repo by running
An existing team member can add a new team member who provides a GPG key
user.armorwith the following: 1.
gpg --import user.armor2.
git-crypt add-gpg-user --trusted KEYID
git-crypt lock && git-crypt unlockwill reset permission secret files. For the ssh key, this will require running
chmod 600 infrastructure/data-refinery-key.pembefore sshing onto AWS instances.
If you are adding a member to git-crypt, you should also consider adding thier GPG key to the
keys/directory. This directory is used to validate signed tags so that we know only trusted members are pushing deploys.
Once you have Terraform installed,
git-cryptunlocked, and your AWS account credentials installed, you're ready to deploy. The correct way to deploy to the cloud is by running the
deploy.shscript. This script will perform additional configuration steps, such as setting environment variables, setting up Nomad job specifications, and performing database migrations. It can be used from the
infrastructuredirectory like so:
./deploy.sh -u myusername -e dev -r us-east-1 -v v1.0.0 -d my-dockerhub-repo
This will spin up the whole system. It will usually take about 15 minutes, most of which is spent waiting for the Postgres instance to start. The command above would spin up a development stack in the
us-east-1region where all the resources' names would end with
-myusername-dev. All of the images used in that stack would come from
my-dockerhub-repoand would be tagged with
-especifies the environment you would like to spin up. You may specify,
devis meant for individuals to test infrastructure changes or to run large tests.
stagingis to test the overall system before re-deploying to
To see what's been created at any time, you can:
terraform state list
If you want to change a single entity in the state, you can use
And then rerun
deploy.shwith the same parameters you originally ran it with.
Jobs can be submitted via Nomad, either from a server/client or a local machine if you supply a server address and have an open network ingress.
To start a job with a file located on the foreman docker image:
nomad job dispatch -meta FILE=NEUROBLASTOMA.txt SURVEYOR_DISPATCHER
or to start a job with a file located in S3:
nomad job dispatch -meta FILE=s3://data-refinery-test-assets/NEUROBLASTOMA.txt SURVEYOR_DISPATCHER
All of the different Refine.bio subservices log to the same AWS CloudWatch Log Group. If you want to consume these logs, you can use the
awslogstool, which can be installed from
pip install awslogs
or, for OSX El Capitan:
pip install awslogs --ignore-installed six
awslogsis installed, you can find your log group with:
Then, to see all of the logs in that group for the past day, watching as they come in:
awslogs get ALL --start='1 days' --watch
You can also apply a filter on these logs like so:
awslogs get ALL --start='1 days' --watch --filter-pattern="DEBUG"
Or, look at a named log stream (with or without a wildcard.) For instance: (Unfortunately this feature seems to be broken at the moment: https://github.com/jorgebastida/awslogs/issues/158)
awslogs get data-refinery-log-group-myusername-dev log-stream-api-nginx-access-* --watch
will show all of the API access logs made by Nginx.
Automatic snapshots are created automatically by RDS. Manual database dumps can be created by priveledged users with these instructions. Postgres versions on the host (I suggest the PGBouncer instance) must match the RDS instance version:
sudo add-apt-repository "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -sc)-pgdg main" wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - sudo apt-get update sudo apt-get install postgresql-9.6
Archival dumps can also be provided upon request.
Dumps can be restored locally by copying the
backup.sqlfile to the
volumes_postgresdirectory, then executing:
docker exec -it drdb /bin/bash psql --user postgres -d data_refinery -f /var/lib/postgresql/data/backup.sql
This can take a long time (>30 minutes)!
A stack that has been spun up via
deploy.sh -u myusername -e devcan be taken down with
destroy_terraform.sh -u myusername -e dev -r us-east-1. The same username and environment must be passed into
destroy_terraform.shas were used to run
deploy.sheither via the -e and -u options or by specifying
TF_VAR_userso that the script knows which to take down. Note that this will prompt you for confirmation before actually destroying all of your cloud resources.
Refine.bio is supported by Alex's Lemonade Stand Foundation, with some initial development supported by the Gordon and Betty Moore Foundation via GBMF 4552 to Casey Greene.
The table of contents for this README is generated using
doctoccan be installed with:
sudo npm install -g doctocOnce
doctocis installed the table of contents can be re-generated with:
BSD 3-Clause License.