Airflow configuration for Telemetry
Apache Airflow is a platform to programmatically author, schedule and monitor workflows.
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow".
Some links relevant to users and developers of WTMO:
dagsdirectory in this repository contains some custom DAG definitions
Add new Python dependencies into
requirements.in. Run the following commands with the same Python version specified by the Dockerfile.
# As of time of writing, python3.7 pip install pip-tools pip-compile
An Airflow container can be built with
Airflow database migration is no longer a separate step for dev but is run by the web container if necessary on first run. That means, however, that you should run the web container (and the database container, of course) and wait for the database migrations to complete before running individual test commands per below. The easiest way to do this is to run
make upand let it run until the migrations complete.
A single task, e.g.
spark, of an Airflow dag, e.g.
example, can be run with an execution date, e.g.
2018-01-01, in the
make run COMMAND="test example spark 20180101"
docker logs -f telemetryairflow_scheduler_1
Tasks often require credentials to access external credentials. For example, one may choose to store API keys in an Airflow connection or variable. These variables are sure to exist in production but are often not mirrored locally for logistical reasons. Providing a dummy variable is the preferred way to keep the local development environment up to date.
bin/run, please update the
init_variableswith appropriate strings to prevent broken workflows. To test this, run
bin/test-parseto check for errors. You may manually test this by restarting the orchestrated containers and checking for error messages within the main administration UI at
Assuming you're using macOS and Docker for macOS, start the docker service, click the docker icon in the menu bar, click on preferences and change the available memory to 4GB.
To deploy the Airflow container on the docker engine, with its required dependencies, run:
You can now connect to your local Airflow web console at
All DAGs are paused by default for local instances and our staging instance of Airflow. In order to submit a DAG via the UI, you'll need to toggle the DAG from "Off" to "On". You'll likely want to toggle the DAG back to "Off" as soon as your desired task starts running.
Users on Linux distributions will encounter permission issues with
docker-compose. This is because the local application folder is mounted as a volume into the running container. The Airflow user and group in the container is set to
To work around this, replace all instances of
Dockerfile.devwith the host user id.
sed -i "s/10001/$(id -u)/g" Dockerfile.dev
See https://go.corp.mozilla.com/wtmodev for more details.
make build && make up make gke
When done: make clean-gke
From there, connect to Airflow and enable your job.
Dataproc jobs run on a self-contained Dataproc cluster, created by Airflow.
To test these, jobs, you'll need a sandbox account and corresponding service account. For information on creating that, see "Testing GKE Jobs". Your service account will need Dataproc and GCS permissions (and BigQuery, if you're connecting to it). Note: Dataproc requires "Dataproc/Dataproc Worker" as well as Compute Admin permissions. You'll need to ensure that the Dataproc API is enabled in your sandbox project.
Ensure that your dataproc job has a configurable project to write to. Set the project in the DAG entry to be configured based on development environment; see the
ltv.pyjob for an example of that.
From there, run the following:
make build && make up ./bin/add_gcp_creds $GOOGLE_APPLICATION_CREDENTIALS google_cloud_airflow_dataproc
You can then connect to Airflow locally. Enable your DAG and see that it runs correctly.
Note: the canonical reference for production environment variables lives in a private repository.
When deploying to production make sure to set up the following environment variables:
AWS_ACCESS_KEY_ID-- The AWS access key ID to spin up the Spark clusters
AWS_SECRET_ACCESS_KEY-- The AWS secret access key
AIRFLOW_DATABASE_URL-- The connection URI for the Airflow database, e.g.
AIRFLOW_BROKER_URL-- The connection URI for the Airflow worker queue, e.g.
AIRFLOW_BROKER_URL-- The connection URI for the Airflow result backend, e.g.
AIRFLOW_GOOGLE_CLIENT_ID-- The Google Auth client id used for authentication.
AIRFLOW_GOOGLE_CLIENT_SECRET-- The Google Auth client secret used for authentication.
AIRFLOW_GOOGLE_APPS_DOMAIN-- The domain(s) to restrict Google Auth login to e.g.
AIRFLOW_SMTP_HOST-- The SMTP server to use to send emails e.g.
AIRFLOW_SMTP_USER-- The SMTP user name
AIRFLOW_SMTP_PASSWORD-- The SMTP password
AIRFLOW_SMTP_FROM-- The email address to send emails from e.g.
URL-- The base URL of the website e.g.
DEPLOY_ENVIRONMENT-- The environment currently running, e.g.
DEPLOY_TAG-- The tag or branch to retrieve the JAR from, e.g.
tags. You can specify the tag or travis build exactly as well, e.g.
tags/v2.2.1. Not specifying the exact tag or build will use the latest from that branch, or the latest tag.
Also, please set
AIRFLOW_SECRET_KEY-- A secret key for Airflow's Flask based webserver
AIRFLOW__CORE__FERNET_KEY-- A secret key to saving connection passwords in the DB
Both values should be set by using the cryptography module's fernet tool that we've wrapped in a docker-compose call:
Run this for each key config variable, and don't use the same for both!
Some useful docker tricks for development and debugging:
# Stop all docker containers: docker stop $(docker ps -aq)
Remove any leftover docker volumes:
docker volume rm $(docker volume ls -qf dangling=true)
Purge docker volumes (helps with mysql container failing to start)
Careful as this will purge all local volumes not used by at least one container.
docker volume prune
Failing CircleCI 'test-environment' check:
# These commands are from the bin/test-parse script (get_errors_in_listing) # If --detach is unavailable, make sure you are running the latest version of docker-compose docker-compose up --detach
docker-compose logs --follow --tail 0 | sed -n '/[testing_stage_0]/q'
Don't pipe to grep to see the full output including your errors
docker-compose exec web airflow list_dags
main_summaryDAG tree view.
docker exec -it bash. The web server instance is a good choice.
$ airflow backfill main_summary -s 2018-05-20 -e 2018-05-26