Cookiecutter template for data scientists working with Docker containers
.. |travis| image:: https://travis-ci.org/docker-science/cookiecutter-docker-science.svg?branch=master :target: https://travis-ci.org/docker-science/cookiecutter-docker-science
.. contents:: This article consists of the following sections. :depth: 1
Cookiecutter Docker Science_ provides the following features.
Edit codes with favorite editors (Atom, vim, Emacs etc)_
maketargets useful for data analysis (Jupyter notebook, test, lint, docker etc)
NOTE: please visit
home page_ before you get started.
Many researchers and engineers do their machine learning or data mining experiments. For such data engineering tasks, researchers apply various tools and system libraries which are constantly updated, installing and updating them cause problems in local environments. Even when we work in hosting environments such as EC2, we are not free from this problem. Some experiments succeeded in one instance but failed in another one, since library versions of each EC2 instances could be different.
By contrast, we can creates the identical Docker container in which needed tools with the correct versions are already installed in one command without changing system libraries in host machines. This aspect of Docker is important for reproducibility of experiments, and keep the projects in continuous integration systems.
Unfortunately running experiments in a Docker containers is troublesome. Adding a new library into
Dockerfiledoes not installed as if local machine. We need to create Docker image and container each time. We also need to forward ports to see server responses such as Jupyter Notebook UI launch in Docker container in our local PC. Cookiecutter Docker Science provides utilities to make working in Docker container simple.
This project is a tiny template for machine learning projects developed in Docker environments. In machine learning tasks, projects glow uniquely to fit target tasks, but in the initial state, most directory structure and targets in
Makefileare common. Cookiecutter Docker Science generates initial directories which fits simple machine learning tasks.
Cookiecutter 1.6 or later_
Docker version 17 or later_
To generate project from the cookiecutter-docker-science template, please run the following command.
$cookiecutter [email protected]:docker-science/cookiecutter-docker-science.git
Then the cookiecutter command ask for several questions on generated project as follows.
$cookiecutter [email protected]:docker-science/cookiecutter-docker-science.git project_name [project_name]: food-image-classification project_slug [food_image_classification]: jupyter_host_port : description [Please Input a short description]: Classify food images into several categories Select data_source_type: 1 - s3 2 - nfs 3 - url data_source [Please Input data source]: s3://research-data/food-images
Then you get the generated project directory,
The following is the initial directory structure generated in the previous section.
Cookiecutter Docker Science provides many Makefile targets to supports experiments in a Docker container. Users can run the target withmake [TARGET]command.
After cookiecutter-docker-science generate the directories and files, users first run this command.initsetups resources for experiments. Specificallyinitruninit-dockerandsync-from-sourcecommand.
init-dockercommand first creates Docker the images based on
sync-from-sourcedownloads input files which we specified in the project generation. If you want to change the input files, please modify this target to download the new data source.
create-containercommand creates Docker container based on the created image and login the Docker container.
Users can start and login the Docker container with
start containercreated by the
jupytertarget launch Jupyter Notebook server.
profiletarget shows the misc information of the project such as port number or container name.
cleantarget removes the artifacts such as models and *.pyc files.
clean-modelcommand removes model files in
clean-pyccommand removes model files of *.pyc, *.pyo and pycache.
clean-dockercommand removes the Docker images and container generated with
make create-container. When we update Python libraries in
requirements.txtor system tools in
Dockerfile, we need to clean Docker the image and container with this target and create the updated image and container with
distcleantarget removes all reproducible objects. Specifically this target run
cleantarget and remove all files in data directory.
clean-datacommand removes all datasets in
linttarget check if coding style meets the coding standard.
testtarget executes tests.
sync-to-remotetarget uploads the local files stored in
datato specified data sources in such as S3 or NFS directories.
With Cookiecutter Docker Science, data scientists or software engineers do their developments in host environment. They open Jupyter notebook in the browsers in the host machine connecting the Jupyter server launched in Docker container. They also writes the ML scripts or library classes in the host machine. The code modification in host environment are reflected in the container environment. In the containers, they just launch Jupyter server or start ML scripts with make command.
Files and directories ~~~~~~~~~~~~~~~~~~~~~
When you log in a Docker container by
make start-containercommand, the log in directory is
/work. The directory contains the project top directories in host computer such as
model. Actually the Docker container mounts the project directory to
/workof the container and therefore when you can edit the files in the host environment with your favorite editor such as Vim, Emacs, Atom or PyCharm. The changes in host environment are reflected in container environment.
Jupyter Notebook ~~~~~~~~~~~~~~~~~
We can run a Jupyter Notebook in the Docker container. The Jupyter Notebook uses the default port
8888in Docker container (NOT host machine) and the port is forwarded to the one you specify with
JUPYTER_HOST_PORTin the cookiecutter command. You can see the Jupyter Notebook UI accessing "http://localhost:JUPYTERHOSTPORT". When you save notebooks the files are saved in the
Generate Docker Image for production ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
make init-dockercommand creates a Docker image based on
docker/Dockerfile.dev, which contains libraries for developments. The libraries are not needed in production.
To create a Docker image for production which does not contain the development libraries such as Jupyter, we run
make init-dockercommand specifying a environment variable
make init-docker MODE=release.
Override port number for Jupyter Notebook ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the generation of project with cookiecutter, the default port of Jupyter Notebook in host is
8888. The number is common and could have a collision to another server processes.
If we already have the container, we first need to remove the current container with
make clean-container. And then we create the Docker container changing the port number with
make create-containercommand adding the Jupyter port parameter (JUPYTERHOSTPORT). For example the following command creates Docker container forwarding Jupyter default port
make create-container JUPYTER_HOST_PORT=9900
Then you launch Jupyter Notebook in the Docker container, you can see the Jupyter Notebook in http://localhost:9900
Specify suitable Dockerfile in stages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some projects can have multiple Dockerfiles.
Dockerfile.gpucontains the settings for GPU machines.
Dockerfile.cpucontains settings to be that can be used in production for non-GPU machines.
To use one of these specific Dockerfile, override the settings by adding parameters to the make command. For example, when we want to create a container from
docker/Dockerfile.cpu, we run
make create-container DOCKERFILE=docker/Dockerfile.cpu.
Show target specific help ~~~~~~~~~~~~~~~~~~~~~~~~~
helptarget flushes the details of specified target. For example, to get the details of
$make help TARGET=clean target: clean dependencies: clean-model clean-pyc clean-docker description: remove all artifacts
As we can see, the dependencies and description of the specified target (
clean) are shown.
Apache version 2.0