by sjvasquez

Kaggle | Instacart Market Basket Analysis🥕🥉

425 Stars 214 Forks Last release: Not found 10 Commits 0 Releases

Available items

No Items, yet!

The developer of this repository has not created any items for sale yet. Need a bug fixed? Help with integration? A different license? Create a request here:

Instacart Market Basket Analysis

My solution for the Instacart Market Basket Analysis competition hosted on Kaggle.

The Task

The dataset is an open-source dataset provided by Instacart (source)

This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.

Below is the full data schema (source)

(3.4m rows, 206k users): *
: order identifier *
: customer identifier *
: which evaluation set this order belongs in (see
described below) *
: the order sequence number for this user (1 = first, n = nth) *
: the day of the week the order was placed on *
: the hour of the day the order was placed on *
: days since the last order, capped at 30 (with NAs for
= 1)

(50k rows): *
: product identifier *
: name of the product *
: foreign key *
: foreign key

(134 rows): *
: aisle identifier *
: the name of the aisle

(21 rows): *
: department identifier *
: the name of the department

(30m+ rows): *
: foreign key *
: foreign key *
: order in which each product was added to cart *
: 1 if this product has been ordered by this user in the past, 0 otherwise


is one of the four following evaluation sets (
): *
: orders prior to that users most recent order (~3.2m orders) *
: training data supplied to participants (~131k orders) *
: test data reserved for machine learning competitions (~75k orders)

The task is to predict which products a user will reorder in their next order. The evaluation metric is the F1-score between the set of predicted products and the set of true products.

The Approach

The task was reformulated as a binary prediction task: Given a user, a product, and the user's prior purchase history, predict whether or not the given product will be reordered in the user's next order. In short, the approach was to fit a variety of generative models to the prior data and use the internal representations from these models as features to second-level models.

First-level models

The first-level models vary in their inputs, architectures, and objectives, resulting in a diverse set of representations. - Product RNN/CNN (code): a combined RNN and CNN trained to predict the probability that a user will order a product at each timestep. The RNN is a single-layer LSTM and the CNN is a 6-layer causal CNN with dilated convolutions. - Aisle RNN (code): an RNN similar to the first model, but trained at the aisle level (predict whether a user purchases any products from a given aisle at each timestep). - Department RNN (code): an RNN trained at the department level. - Product RNN mixture model (code): an RNN similar to the first model, but instead trained to maximize the likelihood of a bernoulli mixture model. - Order size RNN (code): an RNN trained to predict the next order size, minimizing RMSE. - Order size RNN mixture model (code): an RNN trained to predict the next order size, maximizing the likelihood of a gaussian mixture model. - Skip-Gram with Negative Sampling (SGNS) (code): SGNS trained on sequences of ordered products. - Non-Negative Matrix Factorization (NNMF) (code): NNMF trained on a matrix of user-product order counts.

Second-level models

The second-level models use the internal representations from the first-level models as features. - GBM (code): a lightgbm model. - Feedforward NN (code): a feedforward neural network.

The final reorder probabilities are a weighted average of the outputs from the second-level models. The final basket is chosen by using these probabilities and choosing the product subset with maximum expected F1-score.


64 GB RAM and 12 GB GPU (recommended), Python 2.7

Python packages: - lightgbm==2.0.4 - numpy==1.13.1 - pandas==0.19.2 - scikit-learn==0.18.1 - tensorflow==1.3.0

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.