The official example scripts for the Numerai Data Science Tournament
███╗ ██╗██╗ ██╗███╗ ███╗███████╗██████╗ █████╗ ██╗ ████╗ ██║██║ ██║████╗ ████║██╔════╝██╔══██╗██╔══██╗██║ ██╔██╗ ██║██║ ██║██╔████╔██║█████╗ ██████╔╝███████║██║ ██║╚██╗██║██║ ██║██║╚██╔╝██║██╔══╝ ██╔══██╗██╔══██║██║ ██║ ╚████║╚██████╔╝██║ ╚═╝ ██║███████╗██║ ██║██║ ██║██║ ╚═╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝
The official example scripts for the Numerai Data Science Tournament.
pip install -U pip && pip install -r requirements.txt python example_model.py
The example script model will produce a
validation_predictions.csvfile which you can upload at https://numer.ai/tournament to get model diagnostics.
TIP: The examplemodel.py script takes ~45-60 minutes to run. If you don't want to wait, you can upload `examplediagnostic_predictions.csv` to get diagnostics immediately.
If the current round is open (Saturday 18:00 UTC through Monday 14:30 UTC), you can submit your predictions and start getting results on live tournament data. You can create your submission by uploading the
tournament_predictions.csvfile at https://numer.ai/tournament
Description: Labeled training data
Dimensions: ~2M rows x ~1K columns
Size: ~10GB CSV (float32 features), ~5GB CSV (int8 features), ~1GB Parquet (float32/int8 features)
Notes: Check out the analysisandtips notebook for a detailed walkthrough of this dataset.
Description: Labeled holdout set used to generate validation predictions and for computing validation metrics
Dimensions: ~540K rows x ~1K columns
Size: ~2.5GB CSV (float32 features), ~1.1GB (int8 features), ~210MB Parquet (float32/int8 features)
Notes: It is highly recommended that you do not train on the validation set. This dataset is used to generate all validation metrics in the diagnostics API.
Description: Unlabeled feature data used to generate tournament predictions (updated weekly)
Dimensions: ~1.4M rows x ~1K columns
Size: ~6GB CSV (float32 features), ~2.1GB (int8 features), ~550MB Parquet (float32/int8 features)
Notes: Use this file to generate your tournament submission. This file changes every week, so make sure to download the most recent version of this file each round.
Description: Unlabeled feature data used to generate live predictions only (updated weekly)
Dimensions: 5.3K rows x ~1K columns
Size: ~24MB CSV (float32 features), ~11MB CSV (int8 features), ~3MB Parquet (float32/int8 features)
Notes: Use this file to generate the live only portion of your tournament submission if your test predictions are not changing and saved. This file changes every week, so make sure to download the most recent version of this file each round.
Description: The predictions generated by the examplemodel on the numeraivalidation_data
Dimensions: ~540K rows x 1 column
Size: ~14MB CSV
Notes: Useful for ensuring you can get diagnostics and debugging your prediction file if you receive an error from the diagnostics API. This is what your uploads to diagnostics should look like (same ids and data types).
Description: The predictions generated by the examplemodel on the numeraitournament_data
Dimensions: ~1.4M rows x 1 column
Size: ~37MB CSV
Notes: Useful for ensuring you can make a submission and debugging your prediction file if you receive an error from the submissions API. This is what your submissions should look like (same ids and data types).
Description: The legacy validation data mapped onto the new validation period
Dimensions: ~540K rows x ~310 columns
Size: ~69MB Parquet
Notes: Run your legacy models (models trained on the legacy dataset) against this file to generate validation predictions that are comparable to your new models (models trained on the new dataset).
The example model is a good baseline model, but we can do much better. Check out examplemodeladvanced for the best model made by Numerai's internal research team (takes 2~3 hours to run!) and learn more about the underlying concepts used to construct the advanced example model in the analysisandtips notebook.
Check out the forums for in depth discussions on model research.
Once you have a model you are happy with, you can stake NMR on it to start earning rewards.
To access the API, you must first create your API keys in your account page and provide them to the client:
example_public_id = "somepublicid" example_secret_key = "somesecretkey" napi = numerapi.NumerAPI(example_public_id, example_secret_key)
After instantiating the NumerAPI client with API keys, you can then upload your submissions programmatically:
# upload predictions model_id = napi.get_models()['your_model_name'] napi.upload_predictions("tournament_predictions.csv", model_id=model_id)
The recommended setup for a fully automated submission process is to use Numerai Compute. Please see the Numerai CLI documentation for instructions on how to deploy your models to AWS.
The Numerai Dataset contains decades of historical data on the global stock market. Each era represents a time period and each id represents a stock. The features are made from market and fundamental measures of the companies, and the targets are a measure of return.
The stock ids, features, and targets are intentionally obfuscated.
The historical portions of the dataset (trainingdata, validationdata) are relatively static and is updated about every 3-6 months, usually with just more rows.
The live portion of the dataset (tournament_data) is updated every week and represents the latest state of the global stock market.
Parquet is an efficient and performant file format that is IO optimized for reading in subsets of columns at a time.
Use the parquet versions (instead of the standard CSV) of the dataset files to minimize time spent on IO (downloading and reading the file into memory).
Use the int8 version (features are stored as int8 instead of the standard float32) of the parquet file to further minimize memory usage.
In September of 2021, Numerai released a new version of the dataset. Read more about it here.
Models trained on the legacy dataset will continue to work, but it is highly recommended that everyone upgrade to the new dataset because of the major performance improvements.
All example code in this repo has been updated to work with the new dataset only.
You can continue to download the legacy dataset from the website and the API, but it will be eventually deprecated.
datasetquery in the GraphQL API without passing any round number to download the legacy dataset zip.
The easiest way to get started with the new dataset is to check out the new example models and analysis and tips notebook in this repo.
Also check out this deep dive on the new dataset in the forum.
If something in this repo doesn't work, please file an issue.