A starter template for Equinor data science / data engineering projects
This is a starter template for data science projects in Equinor, although it may also be useful for others. It contains many of the essential artifacts that you will need and presents a number of best practices including code setup, samples, MLOps using Azure, a standard document to guide and gather information relating to the data science process and more.
As it is impossible to create a single template that will meet every projects needs, this example should be considered a starting point and changed based upon the working and evolution of your project.
Before working with the contents of this template or Data Science projects in general it is recommended to familiarise yourself with the Equinor Data Science Technical Standards (Currently Equinor internal only)
This template is provided as a Cookiecutter template so you can quickly create an instance customised for your project. An assumption is that you have a working python installation.
To get running, first install the latest Cookiecutter if you haven't installed it yet (this requires Cookiecutter 1.4.0 or higher):
pip install -U cookiecutter
Then generate a new project for your own use based upon the template, answering the questions to customise the generated project:
The values you are prompted for are:
| Value | Description | | :--- | --- | | projectname | A name for your project. Used mostly within documentation | | projectdescription | A description to include in the README.md | | reponame | The name of the github repository where the project will be held | | condaname | The name of the conda environment to use | | packagename | A name for the generated python package. | | mlopsname | Default name for Azure ML. | | mlopscomputename | Default Azure ML compute cluster name to use. | | author | The main author of the solution. Included in the setup.py file | | opensourcelicense | What type of open source license the project will be released under | | devops_organisation | An Azure DevOps organisation. Leave blank if you aren't using Azure DevOps |
If you are uncertain about what to enter for any value then just accept the defaults. You can always change the generated project later.
Getting problems? You can always download this repository using the download button above and reference the local copy e.g. cookiecutter c:\Downloads\data-science-template, however ideally fix any git proxy or other issues that are causing problems.
You are now ready to get started, however you should first create a new github repository for your new project and add your project using the following commands (substitute myproject with the name of your project and REMOTE-REPOSITORY-URL with the remote repository url).
cd myproject git init git add . git commit -m "Initial commit" git remote add origin REMOTE-REPOSITORY-URL git remote -v git push origin master
Continuous Integration (CI) increase quality by building, running tests and performing other validation whenever code is committed. The template contains a build pipeline for Azure DevOps, however requires a couple of manual steps to setup:
You are now setup for CI and automated test / building. You should verify the badge link in this README corresponds with your DevOps project, and as a further step might setup any release pipelines for automated deployment.
At this stage the build pipeline doesn't include MLOps steps, although these can be added based uon your needs.
Depending upon the selected options when creating the project, the generated structure will look similar to the below:
├── .gitignore requirements.txt`. Might not be needed if using conda. ├── setup.py
Contributing to This Template
Contributions to this template are greatly appreciated and encouraged.
To contribute an update simply: * Submit an issue describing your proposed change to the repo in question. * The repo owner will respond to your issue promptly. * Fork the desired repo, develop and test your code changes. * Check that your code follows the PEP8 guidelines (line lengths up to 120 are ok) and other general conventions within this document. * Ensure that your code adheres to the existing style. Refer to the Google Cloud Platform Samples Style Guide for the recommended coding standards for this organization. * Ensure that as far as possible there are unit tests covering the functionality of any new code. * Check that all existing unit tests still pass. * Edit this document and the template README.md if needed to describe new files or other important information. * Submit a pull request.
Template development environment
To develop this template further you might want to setup a virtual environment
Setup usingcd data-science-template python -m venv dst-env
Max / Linuxsource dst-env/bin/activate
Install Dependenciespip install -r requirements.txt
To run the template tests, install pytest using pip or conda and then from the repository root runpytest tests
To verify that your code adheres to python standards run linting as shown below:flake8 --max-line-length=120 *.py hooks/ tests/