Need help with instacart-basket-prediction?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

428 Stars 215 Forks 10 Commits 5 Opened issues


Kaggle | Instacart Market Basket Analysis🥕🥉

Services available


Need anything else?

Contributors list

No Data

Instacart Market Basket Analysis

My solution for the Instacart Market Basket Analysis competition hosted on Kaggle.

The Task

The dataset is an open-source dataset provided by Instacart (source)

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Below is the full data schema (source)

(3.4m rows, 206k users): *
: order identifier *
: customer identifier *
: which evaluation set this order belongs in (see
described below) *
: the order sequence number for this user (1 = first, n = nth) *
: the day of the week the order was placed on *
: the hour of the day the order was placed on *
: days since the last order, capped at 30 (with NAs for
= 1)

(50k rows): *
: product identifier *
: name of the product *
: foreign key *
: foreign key

(134 rows): *
: aisle identifier *
: the name of the aisle

(21 rows): *
: department identifier *
: the name of the department

(30m+ rows): *
: foreign key *
: foreign key *
: order in which each product was added to cart *
: 1 if this product has been ordered by this user in the past, 0 otherwise


is one of the four following evaluation sets (
): *
: orders prior to that users most recent order (~3.2m orders) *
: training data supplied to participants (~131k orders) *
: test data reserved for machine learning competitions (~75k orders)

The task is to predict which products a user will reorder in their next order. The evaluation metric is the F1-score between the set of predicted products and the set of true products.

The Approach

The task was reformulated as a binary prediction task: Given a user, a product, and the user's prior purchase history, predict whether or not the given product will be reordered in the user's next order. In short, the approach was to fit a variety of generative models to the prior data and use the internal representations from these models as features to second-level models.

First-level models

The first-level models vary in their inputs, architectures, and objectives, resulting in a diverse set of representations. - Product RNN/CNN (code): a combined RNN and CNN trained to predict the probability that a user will order a product at each timestep. The RNN is a single-layer LSTM and the CNN is a 6-layer causal CNN with dilated convolutions. - Aisle RNN (code): an RNN similar to the first model, but trained at the aisle level (predict whether a user purchases any products from a given aisle at each timestep). - Department RNN (code): an RNN trained at the department level. - Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model. - Order size RNN (code): an RNN trained to predict the next order size, minimizing RMSE. - Order size RNN mixture model (code): an RNN trained to predict the next order size, maximizing the likelihood of a gaussian mixture model. - Skip-Gram with Negative Sampling (SGNS) (code): SGNS trained on sequences of ordered products. - Non-Negative Matrix Factorization (NNMF) (code): NNMF trained on a matrix of user-product order counts.

Second-level models

The second-level models use the internal representations from the first-level models as features. - GBM (code): a lightgbm model. - Feedforward NN (code): a feedforward neural network.

The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score.


64 GB RAM and 12 GB GPU (recommended), Python 2.7

Python packages: - lightgbm==2.0.4 - numpy==1.13.1 - pandas==0.19.2 - scikit-learn==0.18.1 - tensorflow==1.3.0

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.