Need help with learning-papers?
Click the β€œchat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

131 Stars 14 Forks MIT License 16 Commits 0 Opened issues


Landmark Papers in Machine Learning

Services available


Need anything else?

Contributors list

# 260,058
16 commits

Landmark Papers in Machine Learning

This document attempts to collect the papers which developed important techniques in machine learning. Research is a collaborative process, discoveries are made independently, and the difference between the original version and a precursor can be subtle, but I’ve done my best to select the papers that I think are novel or significant.

My opinions are by no means the final word on these topics. Please create an issue or pull request if you have a suggestion.


| Icon | | | ---- | ------------------------------------------------------------ | | πŸ”’ | Paper behind paywall. In some cases, I provide an alternative link to the paper if it comes directly from one of the authors. | | πŸ”‘ | Freely available version of paywalled paper, directly from the author. | | πŸ’½ | Code associated with the paper. | | πŸ›οΈ | Precursor or historically relevant paper. This may be a fundamental breakthrough that paved the way for the concept in question to be developed. | | πŸ”¬ | Iteration, advancement, elaboration, or major popularization of a technique. | | πŸ“” | Blog post or something other than a formal publication. | | 🌐 | Website associated with the paper. | | πŸŽ₯ | Video associated with the paper. | | πŸ“Š | Slides or images associated with the paper. |

Papers proceeded by β€œSee also” indicate either additional historical context or else major developments, breakthroughs, or applications.

Association Rule Learning

  • Mining Association Rules between Sets of Items in Large Databases (1993), Agrawal, Imielinski, and Swami, @CiteSeerX.

  • See also: The GUHA method of automatic hypotheses determination (1966), HΓ‘jek, Havel, and Chytil, @Springer πŸ”’ πŸ›οΈ.


  • The Enron Corpus: A New Dataset for Email Classification Research (2004), Klimt and Yang, @Springer πŸ”’ / @author πŸ”‘.
  • See also: Introducing the Enron Corpus (2004), Klimt and Yang, @author.
  • ImageNet: A large-scale hierarchical image database (2009), Deng et al., @IEEE πŸ”’ / @author πŸ”‘.
  • See also: ImageNet Large Scale Visual Recognition Challenge (2015), @Springer πŸ”’ / @arXiv πŸ”‘ + @author 🌐.

Decision Trees

  • Induction of Decision Trees (1986), Quinlan, @Springer.

Deep Learning

AlexNet (image classification CNN)
  • ImageNet Classification with Deep Convolutional Neural Networks (2012), @NIPS.
Convolutional Neural Network
  • Gradient-based learning applied to document recognition (1998), LeCun, Bottou, Bengio, and Haffner, @IEEE πŸ”’ / @author πŸ”‘.
  • See also: Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position (1980), Fukushima, @Springer πŸ›οΈ.
  • See also: Phoneme recognition using time-delay neural networks (1989), Waibel, Hanazawa, Hinton, Shikano, and Lang, @IEEE πŸ›οΈ.
  • See also: Fully Convolutional Networks for Semantic Segmentation (2014), Long, Shelhamer, and Darrell, @arXiv.
DeepFace (facial recognition)
  • DeepFace: Closing the Gap to Human-Level Performance in Face Verification (2014), Taigman, Yang, Ranzato, and Wolf, Facebook Research.
Generative Adversarial Network
  • General Adversarial Nets (2014), Goodfellow et al., @NIPS + @Github πŸ’½.
  • Improving Language Understanding by Generative Pre-Training (2018) aka GPT, Radford, Narasimhan, Salimans, and Sutskever, @OpenAI + @Github πŸ’½ + @OpenAI πŸ“”.
  • See also: Language Models are Unsupervised Multitask Learners (2019) aka GPT-2, Radford, Wu, Child, Luan, Amodei, and Sutskever, @OpenAI πŸ”¬ + @Github πŸ’½ + @OpenAI πŸ“”.
  • See also: Language Models are Few-Shot Learners (2020) aka GPT-3, Brown et al., @arXiv + @OpenAI πŸ“”.
Inception (classification/detection CNN)
  • Going Deeper with Convolutions (2014), Szegedy et al., + @Github πŸ’½.
  • See also: Rethinking the Inception Architecture for Computer Vision (2016), Szegedy, Vanhoucke, Ioffe, Shlens, and Wojna, πŸ”¬.
  • See also: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), Szegedy, Ioffe, Vanhoucke, and Alemi, πŸ”¬.
Long Short-Term Memory (LSTM)
  • Long Short-term Memory (1995), Hochreiter and Schmidhuber, @CiteSeerX.
Residual Neural Network (ResNet)
  • Deep Residual Learning for Image Recognition (2015), He, Zhang, Ren, and Sun, @arXiv.
Transformer (sequence to sequence modeling)
  • Attention Is All You Need (2017), Vaswani et al., @NIPS.
U-Net (image segmentation CNN)
  • U-Net: Convolutional Networks for Biomedical Image Segmentation (2015), Ronneberger, Fischer, Brox, @Springer πŸ”’ / @arXiv πŸ”‘.
VGG (image recognition CNN)
  • Very Deep Convolutional Networks for Large-Scale Image Recognition (2015), Simonyan and Zisserman, @arXiv + @author 🌐 + @ICLR πŸ“Š + @YouTube πŸŽ₯.

Ensemble Methods

  • A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting (1997β€”published as abstract in 1995), Freund and Schapire, @CiteSeerX.

  • See also: Experiments with a New Boosting Algorithm (1996), Freund and Schapire, @CiteSeerX πŸ”¬.

  • Bagging Predictors (1996), Breiman, @Springer.
Gradient Boosting
  • Greedy function approximation: A gradient boosting machine (2001), Friedman, @Project Euclid.
  • See also: XGBoost: A Scalable Tree Boosting System (2016), Chen and Guestrin, @arXiv πŸ”¬ + @GitHub πŸ’½.
Random Forest
  • Random Forests (2001), Breiman and Schapire, @CiteSeerX.


  • Mastering the game of Go with deep neural networks and tree search (2016), Silver et al., @Nature.
Deep Blue
  • IBM's deep blue chess grandmaster chips (1999), Hsu, @IEEE πŸ”’.
  • See also: Deep Blue (2002), Campbell, Hoane, and Hsu, @ScienceDirect πŸ”’.


  • Adam: A Method for Stochastic Optimization (2015), Kingma and Ba, @arXiv.
Expectation Maximization
  • Maximum likelihood from incomplete data via the EM algorithm (1977), Dempster, Laird, and Rubin, @CiteSeerX.
Stochastic Gradient Descent
  • Stochastic Estimation of the Maximum of a Regression Function (1952), Kiefer and Wolfowitz, @ProjectEuclid.
  • See also: A Stochastic Approximation Method (1951), Robbins and Monro, @ProjectEuclid πŸ›οΈ.


Non-negative Matrix Factorization
  • Learning the parts of objects by non-negative matrix factorization (1999), Lee and Seung, @Nature πŸ”’.
  • The PageRank Citation Ranking: Bringing Order to the Web (1998), Page, Brin, Motwani, and Winograd, @CiteSeerX.
DeepQA (Watson)
  • Building Watson: An Overview of the DeepQA Project (2010), Ferrucci et al., @AAAI.

Natural Language Processing

Latent Dirichlet Allocation
  • Latent Dirichlet Allocation (2003), Blei, Ng, and Jordan, @JMLR
Latent Semantic Analysis
  • Indexing by latent semantic analysis (1990), Deerwater, Dumais, Furnas, Landauer, and Harshman, @CiteSeerX.
  • Efficient Estimation of Word Representations in Vector Space (2013), Mikolov, Chen, Corrado, and Dean, @arXiv + @Google Code πŸ’½.

Neural Network Components

  • Autograd: Effortless Gratients in Numpy (2015), @ICML + @ICML πŸ“Š + @Github πŸ’½.
  • Learning representations by back-propagating errors (1986), Rumelhart, Hinton, and Williams, @Nature πŸ”’.
  • See also: Backpropagation Applied to Handwritten Zip Code Recognition (1989), LeCun et al., @IEEE πŸ”’πŸ”¬ / @author πŸ”‘.
Batch Normalization
  • Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (2015), Ioffe and Szegedy @ICML via PMLR.
  • Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014), Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, @JMLR.
Gated Recurrent Unit
  • Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (2014), Cho et al, @arXiv.
  • The Perceptron: A Probabilistic Model for Information Storage and Organization in The Brain (1958), Rosenblatt, @CiteSeerX.

Recommender Systems

Collaborative Filtering
  • Using collaborative filtering to weave an information tapestry (1992), Goldberg, Nichols, Oki, and Terry, @CiteSeerX.
Matrix Factorization
  • Application of Dimensionality Reduction in Recommender System - A Case Study (2000), Sarwar, Karypis, Konstan, and Riedl, @CiteSeerX.
  • See also: Learning Collaborative Information Filters (1998), Billsus and Pazzani, @CiteSeerX πŸ›οΈ.
  • See also: Netflix Update: Try This at Home (2006), Funk, @author πŸ“” πŸ”¬.
Implicit Matrix Factorization
  • Collaborative Filtering for Implicit Feedback Datasets (2008), Hu, Koren, and Volinsky, @IEEE πŸ”’ / @author πŸ”‘.


Elastic Net
  • Regularization and variable selection via the Elastic Net (2005), Zou and Hastie, @CiteSeer.
  • Regression Shrinkage and Selection Via the Lasso (1994), Tibshirani, @CiteSeerX.
  • See also: Linear Inversion of Band-Limited Reflection Seismograms (1986), Santosa and Symes, @SIAM πŸ›οΈ.


  • MapReduce: Simplified Data Processing on Large Clusters (2004), Dean and Ghemawat,
  • TensorFlow: A system for large-scale machine learning (2016), Abadi et al., + @author 🌐.
  • Torch: A Modular Machine Learning Software Library (2002), Collobert, Bengio and MariΓ©thoz, @Idiap + @author 🌐.
  • See also: Automatic differentiation in PyTorch (2017), Paszke et al., @OpenReview πŸ”¬+ @Github πŸ’½.

Supervised Learning

k-Nearest Neighbors
  • Nearest neighbor pattern classification (1967), Cover and Hart, @IEEE πŸ”’.
  • See also: E. Fix and J.L. Hodges (1951): An Important Contribution to Nonparametric Discriminant Analysis and Density Estimation (1989), Silverman and Jones, @JSTOR πŸ”’.
Support Vector Machine
  • Support Vector Networks (1995), Cortes and Vapnik, @Springer.


The Bootstrap
  • Bootstrap Methods: Another Look at the Jackknife (1979), Efron, @Project Euclid.
  • See also: Problems in Plane Sampling (1949), Quenouille, @Project Euclid πŸ›οΈ.
  • See also: Notes on Bias Estimation (1958), Quenouille, @JSTOR πŸ›οΈ.
  • See also: Bias and Confidence in Not-quite Large Samples (1958), Tukey, @Project Euclid πŸ”¬.


A special thanks to Alexandre Passos for his comment on this Reddit thread, as well as the responders to this Quora post. They provided many great papers to get this list off to a great start.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.