A general-purpose framework for solving problems with machine learning applied to predicting customer churn
This project demonstrates applying a 3 step general-purpose framework to solve problems with machine learning. The purpose of this framework is to provide a scaffolding for rapidly developing machine learning solutions across industries and datasets.
The end outcome is a both a specific solution to a customer churn use case, with a reduction in revenue lost to churn of more than 10%, as well as a general approach you can use to solve your own problems with machine learning.
Machine learning currently is an ad-hoc process requiring a custom solution for each problem. Even for the same dataset, a slightly different prediction problem requires an entirely new pipeline built from scratch. This has made it too difficult for many companies to take advantage of the benefits of machine learning. The standardized procedure presented here will make it easier to solve meaningful problems with machine learning, allowing more companies to harness this transformative technology.
The notebooks in this repository document a step-by-step application of the framework to a real-world use case and dataset - predicting customer churn. This is a critical need for subscription-based businesses and an ideal application of machine learning.
The dataset is provided by KKBOX, Asia's largest music streaming service, and can be downloaded here.
Within the overall scaffolding, several standard data science toolboxes are used to solve the problem:
The final results comparing several models are shown below:
| Model | ROC AUC | Recall | Precision | F1 Score | |-------------------------------------------|---------|--------|-----------|----------| | Naive Baseline (no ml) | 0.5 | 3.47% | 1.04% | 0.016 | | Logistic Regression | 0.577 | 0.51% | 2.91% | 0.009 | | Random Forest Default | 0.929 | 65.2% | 14.7% | 0.240 | | Random Forest Tuned for 75% Recall | 0.929 | 75% | 8.31% | 0.150 | | Auto-optimized Model | 0.927 | 2.88% | 64.4% | 0.055 | | Auto-optimized Model Tuned for 75% Recall | 0.927 | 75% | 9.58% | 0.170 |
Final Confusion Matrix
To scale the feature engineering to a large dataset, the data was partitioned and automated feature engineering was run in parallel using Apache Spark with PySpark.
Featuretools supports scaling to multiple cores on one machine natively or to multiple machines using a Dask cluster. However, this approach shows that Spark can also be used to parallelize feature engineering resulting in reduced run times even on large datasets.
Featuretools is an open source project created by Feature Labs. To see the other open source projects we're working on visit Feature Labs Open Source. If building impactful data science pipelines is important to you or your business, please get in touch.
Any questions can be directed to [email protected]