Need help with seaibib?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

214 Stars 22 Forks Creative Commons Zero v1.0 Universal 129 Commits 1 Opened issues


Software Engineering for AI/ML -- An Annotated Bibliography

Services available


Need anything else?

Contributors list

# 706,356
4 commits

Software Engineering for AI/ML -- An Annotated Bibliography

PRs Welcome

Author: Christian Kästner, CMU


While the academic literature on solving software engineering problems with machine learning techniques (ML4SE or AI4SE) abounds and has a long history, there is far less academic research on how to improve the engineering of systems with AI/ML components (SE4ML or SE4AI).

With an interest primarily for teaching a software engineering class for AI/ML systems (course web page, paper about the course), I tried to read up on the field. I had several interesting discussions since and have shared my collection of papers on the topic repeatedly and received some pointers to other papers as well. In an effort to help the community I'm sharing here a list of papers that I found interesting and useful with some sparse notes.

All notes are my personal opinions.

If you have suggestions for other readings (with or without annotations) consider sending a pull request :)

Key Resources

Hulten, Geoff. Building Intelligent Systems: A Guide to Machine Learning Engineering. Apress. 2018

This is the best book that I'm aware of that covers the software engineering aspects of building ML systems, including coverage of requirements, architecture, quality assurance, and process. We used it as textbook for our course and assigned many chapters as reading. Coverage is often more broad than deep and there is an overwhelming amount of bulleted lists (not uncommon for SE textbooks), but it is a great introduction to the topic.

Sculley, David, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. "Hidden technical debt in machine learning systems." In Advances in neural information processing systems, pp. 2503-2511. 2015.

Position paper. Probably the most cited paper in this field (and the original title "Machine learning: The high interest credit card of technical debt" is one of my favorite paper titles), describes the challenges in building ML systems and how poor engineering choices can be very expensive. While the "technical debt" metaphor is a bit forced and does not align with how I'd teach technical debt and most of the description is rather abstract and not well grounded in the SE literature, the paper provides a great argument that a machine learning system is more than just the ML model and building and operating it is a serious undertaking.

Pro tip: A google scholar alert for citations to this paper is a good way to notice new SE4ML publications.

Amershi, Saleema, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. "Software engineering for machine learning: A case study." In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pp. 291-300. IEEE, 2019.

A good description of the SE4ML challenges at Microsoft, characterizing the challenges of different roles and grounded in interviews and a large scale survey. Motivating well how pervasive ML is in modern systems and describing best practices and some challenges.

Ozkaya, Ipek. "What Is Really Different in Engineering AI-Enabled Systems?" IEEE Software 37, no. 4 (2020): 3-6.

Concise opinion paper (editorial), discussing that building software system with AI components is harder but not necessarily that different from building traditional software systems. This very much mirrors my own opinion shared in my class and talks. Touches on many great points regarding specifications, requirements, and safety. Essentially a call to action that we need more software engineering.

Quality Assurance

There is a lot of work that covers testing ML systems in some form, most of it seems focused on fairly narrow properties of a model. Testing an ML-enabled system should be much broader, including system testing and testing of the infrastructure (e.g., learning pipeline and update mechanisms) and testing in production.

Breck, Eric, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. 2017. “The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction.” IEEE International Conference on Big Data (Big Data). 2017.

Nice position paper that discusses the many different aspects of quality assurance in an ML project, beyond just model and data quality. Includes some examples and a checklist of QA steps to consider. Good introduction to the problem and based on practical experience at Google. (Note this is a slightly extended version from a similarly titled paper)

Kaestner, Christian. "Machine Learning is Requirements Engineering — On the Role of Bugs, Verification, and Validation in Machine Learning." Medium Blog Post. 2020.

My own discussion of the role of specifications in machine learning. Many testing papers below that focus on model quality (rather than infrastructure quality, as Breck above) are rather vague and confusing to me with regard to specifications. I argue that machine learning corresponds to the requirements engineering phase of a project rather than the implementation phase and, as such, terminology that relates to validation (i.e., do we build the right system, given stakeholder needs) is more suitable than terminology that relates to verification (i.e., do we build the system right, given a specification). That is, machine learning suggests a specification (like specification mining and invariant detection) rather than provides an implementation for a known specification (like synthesis).

Siebert, Julien, Lisa Joeckel, Jens Heidrich, Koji Nakamichi, Kyoko Ohashi, Isao Namba, Rieko Yamamoto, and Mikio Aoyama. "Towards Guidelines for Assessing Qualities of Machine Learning Systems." In International Conference on the Quality of Information and Communications Technology, pp. 17-31. Springer, Cham, 2020.

The paper discusses quality attributes of a production ML system beyond just the ML components. It is explicit about considering multiple views of different components, of the entire system, and of the environment the system is embedded it. These views are useful for guiding a discussion of which qualities to consider. Importantly, the paper provides a concrete example of a system by Fujitsu and lists a large number of qualities considered in Table 1 -- very useful for teaching.

Tang, Diane, Ashish Agarwal, Deirdre O'Brien, and Mike Meyer. "Overlapping experiment infrastructure: More, better, faster experimentation." In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 17-26. 2010.

Bakshy, Eytan, Dean Eckles, and Michael S. Bernstein. "Designing and deploying online field experiments." In Proceedings of the 23rd international conference on World wide web, pp. 283-292. 2014.

Two papers discussing infrastructure for A/B testing (which is a good foundation and frequently used for testing ML systems in production) at Google and Facebook. More focused on infrastructure around it, but useful for getting into a discussion on how to test in production.

Nushi, Besmira, Ece Kamar, Eric Horvitz, and Donald Kossmann. "On human intellect and machine failures: troubleshooting integrative machine learning systems." In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 1017-1025. 2017.

The paper is really about interfaces and blame assignment between multiple ML models that are composed in a system. It discusses the problem with a concrete example (nice!) with three ML models for creating text labels for images, illustrating non-local and non-monotonic effects. The concrete solution requires that individual models can be evaluated and improved with crowd workers, but may be less relevant than the overall framework and example.


Ashmore, Rob, Radu Calinescu, and Colin Paterson. "Assuring the machine learning lifecycle: Desiderata, methods, and challenges." arXiv preprint arXiv:1905.04223. 2019.

Survey on testing in machine learning, going through the stages of an ML pipeline. Many pointers and reasonable organization. Seems more from an ML perspective than an SE perspective, but broadly covers many aspects including data aquisition, data quality, robustness, safety, monitoring, and so forth. A little vague on specifications as usual and little focus on the overall system quality. No information on the used research process.

Zhang, Jie M., Mark Harman, Lei Ma, and Yang Liu. "Machine learning testing: Survey, landscapes and horizons)." IEEE Transactions on Software Engineering (2020).

Another broad survey on testing in machine learning. Includes many pointers, including different test strategies and different kinds of testing. While the pointers are useful, I was frustrated with many descriptions, definitions, and classifications and find little synthesis in this paper; e.g., I was hoping for clearer definitions of "ML bug", "data bug" or a clear discussion of specifications. Grey literature is not discussed either.

Riccio, Vincenzo, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. "Testing machine learning based systems: a systematic mapping." Empirical Software Engineering (2020): 1-62.

Yet another survey paper on ML testing. It frames the problem as testing the entire system with an ML component, but rarely goes beyond quality assurance just for the model. Somewhat biased toward self-driving cars. Again, I have quite some quibbles with many claims and definitions, e.g. model vs system or what is a specification or what is a bug. As for the other surveys, I'd suggest to use it for an overview and skim for the pointers.

Braiek, Houssem Ben, and Foutse Khomh. "On testing machine learning programs." Journal of Systems and Software 164 (2020): 110542.

Yet another survey on ML model testing, possibly the earliest of these (submitted in 2018)

Huang, Xiaowei, Daniel Kroening, Wenjie Ruan, James Sharp, Youcheng Sun, Emese Thamo, Min Wu, and Xinping Yi. "A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability." Computer Science Review 37 (2020): 100270.

Yet another survey, this one extremely broad and ambitious in scope, covering robustness verification, test case generation, test coverage, adversarial attack and defense strategies, and even explanation techniques. This one focuses exclusively on the model and on invariants (mostly robustness) at the model level; it does not relate the techniques or invariants to actual safety concerns at the system level. Nice overview, even though I again have many issues with specific definitions and claims.

(Slicing Test Data)

Several papers focus on the idea of looking at subsets of test data evaluate the model on different subpopulations, rather than just looking at a single average accuracy result.

Barash, Guy, Eitan Farchi, Ilan Jayaraman, Orna Raz, Rachel Tzoref-Brill, and Marcel Zalmanovici. "Bridging the gap between ML solutions and their business requirements using feature interactions." In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1048-1058. 2019.

Discusses how to slice validation data into subsets to observe how the model is doing on different subpopulations. Simple yet useful idea. Nice analogy to blackbox testing (they frame it in terms of combinatorial testing, though equivalence classes may be a better fit). Approach is used to slice data rather than generate new test data, which aligns also with Hulten's recommendations and my lecture/blog post on testing.

Ré, Christopher, Feng Niu, Pallavi Gudipati, and Charles Srisuwananukorn. "Overton: A Data System for Monitoring and Improving Machine-Learned Products." arXiv preprint arXiv:1909.05372 (2019).

Overview of a system design (at Apple) that focuses on slicing and improving training data incrementally. The model building part is automated and the system focuses on the training and validation data, making it easy to add more data and labels (using semi-supervised learning ideas). Nice demonstration of the importance and effectiveness of slicing data for the evaluation (see above).

See also papers in the requirements section below and the Slice Finder paper in the debugging section or my blog post on the topic.

(Unit Testing / Testing Capabilities)

A number of different papers discuss (more or less explicitly) how to test specific capabilities or partial specifications of a model – which relates closely to traditional unit testing. Capabilities are usually concepts that a model shall learn that mirror domain knowledge or human strategies toward the problem, but that only capture a specific part of the solution. The approaches usually curate or generate test datasets specifically for these capabilities – i.e., unit tests consisting of multiple data points. Assuring capabilities is associated (more or less explicitly) with selecting more robust models that better generalize beyond the training distribution. Several capabilities also relate to invariants described below.

Kaestner, Christian. "Rediscovering Unit Testing: Testing Capabilities of ML Models." Medium Blog Post. 2021.

Shameless plug of my blogpost discussing this area and several papers in it more systematically, including how to identify capabilities and how to curate tests.

Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList." In Proceedings ACL, p. 4902–4912. (2020).

Paper suggests ways of generating test cases for capabilities of NLP models: First step is to list specific capabilities of a model (e.g., to handle negation or typos in sentences); for each capability then test cases are curated. Test cases are created by writing simple template-based generators (think: QuickCheck) of sentences with holes, where tool support is given for identifying values for those holes (synonyms etc). The model is then tested on those generated sentences, either by giving an expected value (e.g., all sentences generated from a template are positive) or by using invariants (metamorphic relations), see below. The paper shows many examples of tests and invariants; the evaluation shows that this approach of testing is useful for production teams to test whether the models actually learn basic capabilities.

Naik, Aakanksha, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. "Stress test evaluation for natural language inference." Proceedings of the 27th International Conference on Computational Linguistics, p. 2340--2353 (2018).

Similar to the CheckList paper, this paper identifies 6 capabilities for NLP models, after performing an error analysis of common wrong predictions (think: root cause analysis). Capability tests are separated as competency tests, distraction tests, and noise tests, each encoding tests for specific capabilities, such as understanding antonyms or ignoring spelling errors. Test data is generated following handwritten patterns. The term “stress test” is explicitly introduced, but I consider it very misleading from a software testing perspective.

McCoy, R. Thomas, Ellie Pavlick, and Tal Linzen. "Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference." Proc. ACL (2019).

Inverse strategy from testing capabilities: Testing for known common shortcuts (maybe anti-capabilities). They identify three shortcut heuristics often used by NLP models (no discussion how identified and how this would generalize) and then create test data using patterns, not unlike the stress testing and CheckList papers above. That is, instead of testing capabilities, the absence of shortcut heuristics is tested.

Kaushik, Divyansh, Eduard Hovy, and Zachary C. Lipton. "Learning the difference that makes a difference with counterfactually-augmented data." Proc. International Conference on Learning Representations (ICLR), (2020).

Another paper that (implicitly) discusses capabilities of NLP systems. In the context of a sentiment analysis, they instruct crowd-workers to minimally modify existing sentences to change the sentiment of the text. They similarly instruct crowd-workers to minimally modify sentences to change the outcome of inference tasks. By manually analyzing the kind of changes their participants make, they find 15 common patterns, which could be interpreted as capabilities they want the model to understand, such as sarcasm, distinguishing facts from hopes, and negation. While aimed at collecting additional training data rather than generating test data, the same strategy can be used to identify capabilities and to curate test data for those capabilities.

Talmor, Alon, Yanai Elazar, Yoav Goldberg, and Jonathan Berant. "oLMpics-on what language model pre-training captures." Transactions of the Association for Computational Linguistics 8 (2020): 743-758.

Another paper that proposes generators for creating test data for 9 specific capabilities for NLP reasoning tasks, including numeric reasoning, distinguishing always from often, handling conjunctions of facts and multi-hop reasoning. Origin of the capabilities not discussed.

Barnes, Jeremy, Lilja Øvrelid, and Erik Velldal. "Sentiment Analysis Is Not Solved! Assessing and Probing Sentiment Classification." In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 12-23. 2019.

Systematic analysis of mistakes of mistakes of sentiment analysis models to identify capabilities (or patterns) with which the models struggle. Good example of identifying capabilities from the data (similar to Naik et al above). No testing proposed.

D'Amour, Alexander, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen et al. "Underspecification presents challenges for credibility in modern machine learning." arXiv preprint arXiv:2011.03395 (2020).

The paper argues strongly for evaluating models with tests of core abstractions of the task (similar to the checklist paper above, which they call stress tests) beyond just evaluating accuracy on data samples taken from the same population. While the paper focuses on showing that models with similar accuracy can differ in how they generalize to other distributions (is this really surprising?), it contains several pointers to papers where the community seems to pick up on the importance of additional tests.

Geirhos, Robert, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. "Shortcut learning in deep neural networks." Nature Machine Intelligence 2, no. 11 (2020): 665-673.

Excellent and easy to read survey of the problem of shortcut learning (i.e. not using intended capabilities) and some directions to address it. Good introduction and examples; very good (visual) description of i.i.d. vs out-of-distribution concerns.

Gardner, Matt, Yoav Artzi, Victoria Basmov, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi et al. "Evaluating Models’ Local Decision Boundaries via Contrast Sets." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1307-1323. 2020.

Well argued position paper that ML competitions should include challenging datasets at the decision boundary to test specific capabilities. Good visual illustration of i.i.d. challenges. No discussion of how to identify relevant capabilities though, left to the author of the dataset.

Weber, Noah, Leena Shekhar, and Niranjan Balasubramanian. "The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models." In Proceedings of the Workshop on Generalization in the Age of Deep Learning, pp. 24-27. 2018.

Very short paper showing with an illustrative experiment the importance of learning the right structure for a task, rather than surface learning. Very explicit about how the right linguistic structure (and nothing else) is needed to generalize beyond the training distribution (“linguistic generalization”).

(Model Invariants)

Segura, Sergio, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. "A survey on metamorphic testing." IEEE Transactions on software engineering 42, no. 9 (2016): 805-824.

This paper discusses the idea of metamorphic testing and summarizes much work in this area. The key idea is that in the absence of a specification about what the model should do (which is the whole point of using ML), we might still be able to give partial specifications in terms of invariants about how outputs should relate for input pairs. For example, robustness conditions might be expressed by indicating that similar inputs should produce the same output or that linearly scaling all features or adding irrelevant features should not affect the outcome for certain classifiers. This is an interesting direction that enables testing of a number of invariants for ML models. The paper gives many pointers to other work in this area.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Semantically equivalent adversarial rules for debugging NLP models." In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 856-865. 2018.

Example of testing for invariants in NLP models (in line with metamorphic testing, even though they do not refer to that concept). Here one expects certain linguistic invariants, for example, replacing "isn't" by "is not" should not change the outcome. These invariants are used to derive and prioritize tests and to then improve the model with more input data.

Singh, Gagandeep, Timon Gehr, Markus Püschel, and Martin Vechev. "An abstract domain for certifying neural networks." Proceedings of the ACM on Programming Languages 3, no. POPL (2019): 1-30.

This is an example of a large group of paper that proofs specific invariants about models. For example, the specification could be that a model is robust to certain changes to its inputs (e.g., changing the brightness of pixels by +/- 20%) on the training set or certain classes of inputs. In contrast to much testing work, here results are verified over all input changes covered by a specification.

Galhotra, Sainyam, Yuriy Brun, and Alexandra Meliou. "Fairness testing: testing software for discrimination." In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 498-510. 2017.

Another application of testing invariants: here, for fairness certain outputs should be independent of changes to sensitive attributes. To simple for practical use in my book; limited to "fairness through blindness" (or "unawareness" or "anti-classification") and group fairness (but seemingly without addressing correlations in the data, since samples are generated uniformly).

Ding, J., D. Zhang, and X. Hu. 2016. “A Framework for Ensuring the Quality of a Big Data Service.” In 2016 IEEE International Conference on Services Computing (SCC), 82–89.

Another example of using metamorphic testing, using domain knowledge about the problem to come up with invariants. Nice concrete example, though not the easiest read.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. "Anchors: High-precision model-agnostic explanations." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Approach for explaining black-box models that is essentially invariant mining, not that far from Daikon. The paper identifies rules that, with high probability, are sufficient to explain a specific prediction for a subset of all inputs. These invariants can then explain part of the model and could potentially be used as partial specifications or test cases.

Kang, Daniel, Deepti Raghavan, Peter Bailis, and Matei Zaharia. "Model Assertions for Monitoring and Improving ML Model." In Proceedings of MLSys 2020.

Discusses use cases of “soft” assertions or invariants that show inconsistencies, especially across time series data and across multiple classifiers, without having to have access to labels. Table 5 in the appendix has a very good overview of different invariants and the paper discusses several plausible examples. Note that these invariants are probabilistic in nature rather than hard tests. This can be used for testing and runtime monitoring (e.g., runtime adaptation and safety mechanisms) without the need for labels! Unfortunately, the paper is very vague about concrete interfaces and technical details and focuses primarily on ML details for additional active learning and weak supervision use cases.

(Testing ML Frameworks)

Srisakaokul, Siwakorn, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. "Multiple-implementation testing of supervised learning software." In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Good example of a paper that tests the implementation of the ML algorithm (not the model or resulting system). Here differential testing is used to compare multiple student implementations. Nice demonstration of the idea, though it is not clear how far this would be practical beyond fairly simple learning algorithms implemented in student projects.

Xie, Xiaoyuan, Joshua WK Ho, Christian Murphy, Gail Kaiser, Baowen Xu, and Tsong Yueh Chen. "Testing and validating machine learning classifiers by metamorphic testing." Journal of Systems and Software 84, no. 4 (2011): 544-558.

This paper applies metamorphic relations to the learning framework itself. That is, it defines invariants/assertions about how inputs (training data) and outputs (learned models) should relate.

Cheng, Dawei, Chun Cao, Chang Xu, and Xiaoxing Ma. "Manifesting bugs in machine learning code: An explorative study with mutation testing." In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS), pp. 313-324. IEEE, 2018.

Using mutation testing to mutate the machine learning code (e.g. SVM) to see how robust the ML implementations are to subtle faults. Shows that many mutations to the learning code lead to small degradation of accuracy of the produced models without a crash and that metamorphic testing (as in the paper above) was also not useful to find these kinds of faults.

(Other Quality Assurance Work)

Renggli, Cedric, Bojan Karlaš, Bolin Ding, Feng Liu, Kevin Schawinski, Wentao Wu, and Ce Zhang. "Continuous integration of machine learning models with Towards a rigorous yet practical treatment." arXiv preprint arXiv:1903.00278 (2019).

This paper emphasizes an interesting aspect with regard to learning-validation-testing data splits and overfitting that I didn’t previously appreciate but that’s obvious in hindsight: When using a dataset to make decisions about a model (here with a CI framing) results from that evaluation will eventually leak into the model and lead to overfitting, hence new test data is needed. The core of the paper is about deciding how much test data is needed and when it needs to be replaced. The CI framing seems rather incidental or even misleading though, as the key problem does not relate to regression testing but toward deciding whether a model outperforms another.

Pei, Kexin, Yinzhi Cao, Junfeng Yang, and Suman Jana. "DeepXplore: Automated whitebox testing of deep learning systems." In proceedings of the 26th Symposium on Operating Systems Principles, pp. 1-18. 2017.

Differential testing between multiple models (e.g., learned from the same data with different parameters) and input generation that aims to identify inputs that have different outputs across the models. Unclear to me how useful this would be in practice.

Zhang, Yuhao, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. "An empirical study on TensorFlow program bugs." In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 129-140. 2018.

Zhang, Tianyi, Cuiyun Gao, Lei Ma, Michael R. Lyu, and Miryung Kim. "An empirical study of common challenges in developing deep learning applications." In The 30th IEEE International Symposium on Software Reliability Engineering (ISSRE). 2019.

These two papers are examples of papers that analyzes public bug reports from issue trackers or question-answer sites (stackoverflow) for machine-learning frameworks. They characterize the kinds of problems developers and users tend to have, some some solutions. Many issues seem to be common framework issues, such as documentation issues and breaking APIs. Some challenges, such as probabilistic correctness and missing debuggers, seem more ML specific.

Seshia, Sanjit A., Dorsa Sadigh, and S. Shankar Sastry. "Towards verified artificial intelligence." arXiv preprint arXiv:1606.08514 (2016).

Good framing why formal verification (or really any form of testing) is so difficult in machine learning systems in the first two sections: We neither have a specification, nor a good grasp of the environment, and in addition the system is often evolving itself. After explaining why it is so difficult, the paper points to several potentially interesting research areas, but none of them seem to overcome the fundamental problems, especially that of missing specifications.

Tramer, Florian, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, Jean-Pierre Hubaux, Mathias Humbert, Ari Juels, and Huang Lin. "FairTest: Discovering unwarranted associations in data-driven applications." In 2017 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 401-416. IEEE, 2017.

Approach to explore correlations between protected attributes and prediction outcomes in subpopulations of the dataset. The key idea is that there may not be such a correlation (generalizes to various fairness measures on the confusion matrix) for the entire population, but it may well exists for certain subpopulations, e.g., only among low-income residence do we observe gender-based discrimination. The key contribution is an efficient search among subpopulations inspired by decision tree learning algorithms. Explicitly adopts testing and debugging terminology and goes beyond the simple invariants (anti-classification) in many other fairness testing papers, i.e., it can account for unfairness through correlated attributes.


Molnar, Christoph. "Interpretable machine learning. A Guide for Making Black Box Models Explainable", 2019.

Though primarily about interpretability, this book provides a great overview of techniques to understand and debug models. The book covers both inherently interpretable models and many techniques for ex-post explanations inferred from blackbox models. Explanations can be shown to the user (though the book is fairly shallow on the challenges of doing this and larger system design questions), but they also seem very useful in understanding model (mis-)behavior, understanding individual wrong predictions and outliers. The book covers both techniques for understanding the model and understanding characteristics of the training data. Written in a fairly accessible and pragmatic style, covering the important math but also providing good intutions.

Lipton, Zachary C. "The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery." Queue 16, no. 3 (2018): 31-57.

Good and compact overview of different goals and notions of explainablity, interpretability, trust, and transparency and a quick overview of different techniques in this area.

Ma, Shiqing, Yingqi Liu, Wen-Chuan Lee, Xiangyu Zhang, and Ananth Grama. "MODE: automated neural network model debugging via state differential analysis and input selection." In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 175-186. 2018.

Uses a SE mindset (delta debugging, slicing) to approach the problem of debugging deep neural networks, especially to identify features that are critical for misbehavior with the goal of providing better learning data (framed as "root cause identification" of "training bugs"). Fairly pragmatic.

Ma, Shiqing, Yousra Aafer, Zhaogui Xu, Wen-Chuan Lee, Juan Zhai, Yingqi Liu, and Xiangyu Zhang. "LAMP: data provenance for graph based machine learning algorithms through derivative computation." In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, pp. 786-797. 2017.

Interesting idea to identify which inputs to a machine-learning algorithm have large effects on the produced model (i.e., some form of sensitivity analysis). Focused on specific classes of graph-based algorithms like Pagerank.

Bhatt, Umang, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. "Explainable machine learning in deployment." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 648-657. 2020.

An interview study asking practioners about how they use explainability techniques. Finding that they are mostly used internally, mostly for debugging, and mostly using SHAP values. Also somewhat useful includes abstract examples/scenarios of how explainability techniques are used in 6 companies.

Chung, Yeounoh, Neoklis Polyzotis, Kihyun Tae, and Steven Euijong Whang. "Automated data slicing for model validation: A big data-AI integration approach." IEEE Transactions on Knowledge and Data Engineering (2019).

Approach to automatically identify subsets of the data for which the model accuracy is lower than average/other subsets. This automatically slices the validation data based on different features (like decision tree learning). Not that different from approaches in the Test Data Curation section, but not guided by humans, hence this seems more useful for debugging than for (regression) testing. Also briefly mentions applications to different quality functions, including fairness and accuracy difference after model update. Paper is heavily focused on underlying technical approach.

Amershi, Saleema, Max Chickering, Steven M. Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. "Modeltracker: Redesigning performance analysis tools for machine learning." In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 337-346. 2015.

Description and evaluation of a visual debugging tool to explore wrong predictions and possible causes. Supports use cases for detecting mislabled data, missing features, and outliers. Integrated into the larger ML pipeline for labeling and data management. Also nice overview of other visualization techniques in related work.

Zhang, Xuezhou, Xiaojin Zhu, and Stephen Wright. "Training Set Debugging Using Trusted Items." In Thirty-Second AAAI Conference on Artificial Intelligence. 2018.

Neat idea on debugging training data: Use a trusted subset of the data to see whether other untrusted training data is responsible for wrong predictions in this subset, or what’s the smallest change to the training data to make the model work on the trusted data.

Data Quality and Data Management

Schelter, Sebastian, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. “Automating Large-Scale Data Quality Verification.” Proceedings of the VLDB Endowment International Conference on Very Large Data Bases 11 (12): 1781–94.

Good paper discussing data validation, including schema validation and checking distributions. The key innovation in the paper is doing this at scale (at Amazon), whereas the specification mechanisms seem rather straightforward, but generally well written and teachable.

Polyzotis, Neoklis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. "Data validation for machine learning." Proceedings of Machine Learning and Systems 1 (2019): 334-347.

Similar in spirit to the Amazon paper above, describes data validation efforts at Google, mostly at a fairly high level. Mostly relates to schema enforcement and detecting drift between feature value distributions between batches of data. Also describes some efforts at suggesting schema inference (effectively specification/invariant mining) to help with adoption and evolution. Also suggests some simple fuzzing to detect crashing bugs in ML components for generated inputs that fit the schema. Can be read as support for schemas as interfaces between teams and components in ML-based systems.

Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. “Data Management Challenges in Production Machine Learning.” In Proceedings of the 2017 ACM International Conference on Management of Data, 1723–26. ACM.

Short tutorial notes, not very deep, but providing a decent overview of past work in the database community

Hynes, Nick, D. Sculley, and Michael Terry. "The data linter: Lightweight, automated sanity checking for ml data sets." In NIPS MLSys Workshop. 2017.

Neat idea to develop a linter that looks for common "code smells" in dataset and internal data structures (e.g., when integers are stored as strings). This is an early but great example of how some classic SE ideas (static analysis tools, dynamic invariant detection, code smells) can be translated to solve new problems in an ML context. It is not quite obvious how far one can take this idea, but it is a great starting point for discussions and a nice illustration of potential SE4ML work.

Ratner, Alexander, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. "Snorkel: Rapid training data creation with weak supervision." Proceedings of the VLDB Endowment, 11(3), 269-282, 2017.

Description of Snorkel, an approach to semi-automatically label training data, where humans provide partial training functions (e.g., labeling some instances based simply on some keywords). The system will learn automatically which of these partial labels to trust and will produce training labels at scale. This is an interesting strategy to gather training data that may work in a number of domains and has been used by a number of companies -- worth exploring. Many further discussions can be based on this, e.g., how to combine manual labels with different confidence, how to combine manual and automatic labels, which labels to trust, how much to involve crowdworkers in labeling and on which data points, etc.

Barrak, A., Eghan, E.E. and Adams, B. 2021. "On the Co-evolution of ML Pipelines and Source Code-Empirical Study of DVC Projects". In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 422-433). IEEE.

DVC ( is an open source tool for versioning data and ML workflows. Paper conducts empirical analysis of 391 open source projects which use DVC. The 25 most active projects are further analysed to understand prevalence of ML versioning tools and the challenges associated with them. The results indicate that ML versioning is a relatively young practise with data versioning being the most utilised feature. A high coupling between source code and DVC files is noted while coupling between test files and DVC files is negligible.

Sambasivan, Nithya, Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh, and Lora M. Aroyo. "“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI". In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1-15. 2021.

Emphasis is usually put on the model creation process in academia and education. Models are evaluated using toy datasets and are optimised to perform well on them. However, in practise the focus often shifts to the data since most problems require the data to be collected, curated and annotated from scratch. The authors conducted a survey with 53 AI practitioners from India, USA and East and West African countries. The authors introduce the notion of "data cascades" which are defined as compounding, negative events caused due to data issues that result in technical debt over time. The results indicate that data cascades are highly relevant amongst AI practitioners, especially in "high stakes" AI, promoting a shift from model to data goodness.

Requirements Engineering

I have the impression that many practical problems in building AI-enabled systems are really requirements engineering problems. For example: There are many different notions of fairness, but which one is the right one for the project? How does the model interact with the environment and what safeguards should be installed to detect feedback loops or assure safety in those interactions? Unfortunately, the literature explicitly on requirements engineering for AI-enabled systems seems rather sparse – but there are many papers that are implicitly mostly about requirements engineering.

Vogelsang, Andreas, and Markus Borg. "Requirements Engineering for Machine Learning: Perspectives from Data Scientists." In Proc. of the 6th International Workshop on Artificial Intelligence for Requirements Engineering (AIRE), 2019.

Strong argument for the importance of requirements engineering when building ML-enabled systems. Covering many qualities and what kind of requirements should be solicited, including requirements about data quality and quanitity, provenance, monitoring, and protected classes and attributes. Also makes a good case for the role of a requirements engineer to identify a suitable measure of accuracy to mediate between users and data scientists. Based on 4 interviews only and sometimes reads more like a well argued opinion paper.

Rahimi, Mona, Jin LC Guo, Sahar Kokaly, and Marsha Chechik. "Toward Requirements Specification for Machine-Learned Components." In 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW), pp. 241-244. IEEE, 2019.

Idea paper that outlines a path of how requirements engineering can be useful in better understanding domain and context of a problem and how this helps in better curating a high-quality dataset for training and also how to drive validation of the model. Uses a running vision example of pedestriant detection.

Kulynych, Bogdan, Rebekah Overdorf, Carmela Troncoso, and Seda Gürses. "POTs: protective optimization technologies." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 177-188. 2020.

This is the only ML paper I have seen that explicitly discusses requirements engineering with Jackson's world vs machine framing. It nicely frames the fairness problem as a system problem and explicitly discusses the interface between the environment and the implementation's specification and how assuring fairness only at the model level is insufficient. The rest of the paper, then essentially suggests adversarial hacking through changes in the environment to drive the model to make fairer or globally better decisions -- that is, distributing the problem, affected people can increase the cost of certain outcomes thus making an ML system recommend other outcomes that are more benefitial to them. Interesting position paper, even if just for the framing in Sec 2.

Subbaswamy, Adarsh, Peter Schulam, and Suchi Saria. "Preventing failures due to dataset shift: Learning predictive models that transport." In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3118-3127. PMLR, 2019.

Interesting idea that: If we know how data is generated or what dependencies underly the system and the input features, and more importantly know which of those processes/relationships/features are likely to change (coming from domain knowledge elicited with requirements engineering I assume), we can integrate this information in the modeling process and make the model more robust with regard to these kind of changes. That is we do not learn on certain relationships and thus don’t need to update the model for data/concept drift as often. Unclear to me how practical and the evaluation seems fairly artificial, but I like the direction.

Wiens, Jenna, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu, Finale Doshi-Velez, Kenneth Jung et al. "Do no harm: a roadmap for responsible machine learning for health care." Nature medicine 25, no. 9 (2019): 1337-1340.

Position paper with a fairly high-level overview of what it takes to build and deploy machine learning products in health care settings. Lot’s of emphasis on what are essentially all requirements engineering challenges of understanding the domain, involving stakeholders, integrating the product into practice, etc etc.


Much of the interesting fairness discussions are really about requirements engineering in my book. Here are a couple of papers that I liked.

Corbett-Davies, Sam, and Sharad Goel. "The measure and mismeasure of fairness: A critical review of fair machine learning." arXiv preprint arXiv:1808.00023 (2018).

A comprehensive an critical discussion of different fairness measures often discussed in the ML community. It's kind of depressing how many different notions and problems there are and how few solutions. Even though they don't discuss requirements engineering in the paper, I put it here because it shows the complicated mess that fairness makes when trying to identify the right fairness requirements for an ML-enabled system. A requirements engineer must work with stakeholders to identify suitable requirements and make a lot of decisions, which won't be easy but will be necessary. This paper will probably give the requirements engineer the right vocabulary to talk to ML experts who need to assure the requirement and will provide enough pointers to identify what issues to probe for when talking to stakeholders.

Holstein, Kenneth, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. "Improving fairness in machine learning systems: What do industry practitioners need?" In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-16. 2019.

An interview study with engineers that consider fairness in their ML-enabled products. This one looks at the system/software-engineering level, beyond narrow fairness properties at the model level and discusses problems and strategies broadly, including fairness considerations across all lifecycle stages, possible checklists, best practices, holistic auditing, possibly even resulting in changes in system design. It does not mention requirements, but many issues are deeply rooted in requirements engineering and system design work.

Madaio, Michael A., Luke Stark, Jennifer Wortman Vaughan, and Hanna Wallach. "Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI." In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1-14. 2020.

Another great paper that looks at requirements engineering for fairness in production ML systems in practice. It looks at challenges and suggest a checklist. It argues that it's really all about the process integration and about taking a system level view (rather than focusing on model properties). Includes the actual checklist.

Bietti, Elettra. "From ethics washing to ethics bashing: a view on tech ethics from within moral philosophy." In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 210-219. 2020.

Arguing that most ethics discussions around AI and tech are too narrow as a tool for a purpose without allowing broader investigation involving all stakeholders, thus leading to a frustration with corporate ethics initiatives (ethics washing, ethics bashing). The paper then outlines the real role that moral philosophy can plan, which to me reads a lot like requirements engineering at the societal level: Stepping back, viewing all perspectives, trying to balance all views. May not be immediately useful, but I found the discussion interesting.

Binns, Reuben. "Fairness in machine learning: Lessons from political philosophy." In Conference on Fairness, Accountability and Transparency, pp. 149-159. PMLR, 2018.

Interesting overview of the philosophical arguments around discrimination and fairness. Useful introduction to the topic.

Yeom, Samuel, and Michael Carl Tschantz. "Avoiding Disparity Amplification under Different Worldviews." In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 273-283. 2021.

Another paper that points out how different views and goals of fairness lead to very different approaches. The key point is that bias can be difficult to assess if your observations may be biased, but different assumptions about whether observations are biased lead to very different fairness approaches.

Rakova, Bogdana, Jingying Yang, Henriette Cramer, and Rumman Chowdhury. "Where responsible AI meets reality: Practitioner perspectives on enablers for shifting organizational practices." Proceedings of the ACM on Human-Computer Interaction 5, no. CSCW1 (2021): 1-23.

Another excellent interview study on fairness practice, showing current practices but also wants and aspirations. Strong focus on embedding practices into processes and establishing clear responsibilities; also discusses the role of activists in an organization. Based on 26 interviews in 19 organizations.

Software Architecture and Design

Lakshmanan, Valliappa, Sara Robinson, and Michael Munn. Machine learning design patterns. O'Reilly Media, 2020.

Interesting book collecting a number of design patterns for building and deploying ML models. Patterns are described in lots of details with examples and concrete code snippets. It is very model focused and example driven, but covers the whole machine learning pipeline including feature engineering, training, and serving. Especially the coverage of MLOps (stateless serving function, transform, feature store, ...) seem useful and cover many modern tools and practices.

Washizaki, Hironori, Hiromu Uchida, Foutse Khomh, and Yann-Gaël Guéhéneuc. "Machine Learning Architecture and Design Patterns." Paper draft, 2019

A useful collection of design patterns, architectural patterns, and anti-patterns derived from a literature survey of 19 papers and 19 grey literature articles. Not a detailed discussion and no detailed pattern description, but pointers to the sources, a list of patterns names, and an overview of their relations, that are likely useful for further exploration and discussions, see their web site. Many of the patterns seem to relate more to big-data processing than ML, but are likely useful when building ML-enabled systems.

Daniel Smith. "Exploring Development Patterns in Data Science." TheoryLane Blog Post. 2017.

Short blog post arguing to decompose ML pipeline from a single notebook into multiple services using cloud infrastructure. Bit polemic but well argued, short opinion piece.

Yokoyama, Haruki. "Machine learning system architectural pattern for improving operational stability." In 2019 IEEE International Conference on Software Architecture Companion (ICSA-C), pp. 267-274. IEEE, 2019.

One of the few papers actually discussing architectural design and tradeoffs of an ML-enabled system explicitly, discussing the integration of ML and non-ML parts. While I feel the design is maybe fairly standard, the paper has actually a nice concrete example and corresponding architectural models that seems very useful for teaching.

Hazelwood, Kim, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy et al. "Applied machine learning at facebook: A datacenter infrastructure perspective." In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620-629. IEEE, 2018.

Paper discussing the infrastructure for learning and serving ML models at Facebook (including discussions of hardware design). Discusses many quality attributes and constraints that are relevant in operation, including cost, latency, model-updated frequency needs, large amounts of data, and ability to recover from outages. For example, it includes an explicit discussion of the cost of outdated models. Much of this is useful input for discussing design constraints and architectural tradeoffs.

Peng, Zi, Jinqiu Yang, Tse-Hsun Chen, and Lei Ma. "A first look at the integration of machine learning models in complex autonomous driving systems: a case study on Apollo." In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1240-1250. 2020.

Interesting paper analyzing the source code and architecture of the Baidu’s Apollo self-driving car software. Finding 28 models that are often chained and combined with nontrivial glue code and non-ML components. This makes a strong case to go beyond testing individual models but to test also the interactions with non-ML code and the interactions among multiple models.

Serban, Alex, and Joost Visser. "An Empirical Study of Software Architecture for Machine Learning." arXiv preprint arXiv:2105.12422 (2021).

Interview study with 10 software architects working on ML-based projects, combined with literature analysis and small-scale survey. Lists several challenges and possible solutions, though scope seems to drift beyond software architecture, describing many well known challenges around model development and deployment.


Studer, Stefan, Thanh Binh Bui, Christian Drescher, Alexander Hanuschkin, Ludwig Winkler, Steven Peters, and Klaus-Robert Mueller. "Towards CRISP-ML (Q): A Machine Learning Process Model with Quality Assurance Methodology." arXiv preprint arXiv:2003.05155 (2020).

Fairly detailed overview of an ML-development process, starting early with business needs and covering all steps to deployment. Not necessarily new or grounded in new data, but a very nice and well-written overview. Good early reading to understand the full scope of SE4ML, beyond just building models.

Serban, Alex, Koen van der Blom, Holger Hoos, and Joost Visser. "Adoption and Effects of Software Engineering Best Practices in Machine Learning." In Proc. ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (2020).

Collection of best practices, mostly focused on the model pipeline not the larger system. In a larger survey, practitioners indicate which practices they follow, resulting in interesting ranks in Table 2. Results show that MLOps topics like versioning, monitoring and experiment management are fairly broadly adopted, whereas quality assurance practices and hyperparameter tuning are less adopted. Comes with a nice list of the practices on their web page.

Haakman, Mark, Luís Cruz, Hennie Huijgens, and Arie van Deursen. "AI Lifecycle Models Need To Be Revised. An Exploratory Study in Fintech." arXiv preprint arXiv:2010.02716 (2020).

Nice interview study with 17 participants at ING bank about their process in building and deploying ML models. Shows many concerns about provenance, regulation, and security driving decisions and identifying extra steps in common data science process models. Mostly focused on the model and its deployment (data science view), not a larger system it may be integrated in.

Martínez-Plumed, Fernando, Lidia Contreras-Ochando, Cesar Ferri, José Hernández Orallo, Meelis Kull, Nicolas Lachiche, Maréa José Ramírez Quintana, and Peter A. Flach. "CRISP-DM Twenty Years Later: From Data Mining Processes to Data Science Trajectories." IEEE Transactions on Knowledge and Data Engineering (2019).

Discussion how today's data science differs from past data mining (less prescriptive, more exploratory) and how hence CRISP-DM is too rigid a model to cover the diverse processes. The paper essentially proposes a megamodel with different process steps, which projects might pass through in different orders (called trajectories). Gives examples of different trajectories that may be appropriate in different projects. Limited evaluation, but interesting discussion. Focused on data science side, not building entire system though.

Safety & Robustness

There is a significant amount of machine learning research on safety and security, especially on narrow properties of robustness, but also some system-level discussions on safeguard mechanisms. A software engineering professional should probably understand the tools and techniques.

Borg, Markus, Cristofer Englund, Krzysztof Wnuk, Boris Duran, Christoffer Levandowski, Shenjian Gao, Yanwen Tan, Henrik Kaijser, Henrik Lönn, and Jonas Törnqvist. "Safely entering the deep: A review of verification and validation for machine learning and a challenge elicitation in the automotive industry." Journal of Automotive Software Engineering. Volume 1, Issue 1, Pages 1 - 19. 2019

Summary of the current thinking about safety of ML components in self-driving cars, based on literature review, a survey, and several industry workshops. Provides a decent overview from an engineering perspective, providing pointers to a long history of ML-related safety work in aerospace and automotive research that is likely a good starting point for further exploration. Mostly fairly high-level; among others, emphasizes the system nature of the problem and ideas with regard to design strategies for safety, such as safety cages and fail-safe systems.

Salay, Rick, Rodrigo Queiroz, and Krzysztof Czarnecki. "An analysis of ISO 26262: Using machine learning safely in automotive software." arXiv preprint arXiv:1709.02435 (2017).

Salay, Rick, and Krzysztof Czarnecki. "Using machine learning safely in automotive software: An assessment and adaption of software process requirements in ISO 26262." arXiv preprint arXiv:1808.01614 (2018).

The first provides a decent overview of safety thinking at the system level in the automotive industry and how that thinking aligns with current safety standards. While it does not go into specific safety techniques, it shows the overall mindset and the relevant concerns. The second, longer report goes more in into details and provides several concrete architectural patterns (esp. Sec 7).

Singh, Gagandeep, Timon Gehr, Markus Püschel, and Martin Vechev. "An abstract domain for certifying neural networks." Proceedings of the ACM on Programming Languages 3, no. POPL (2019): 1-30.

(Discussed also above under model invariants) This is an example of a large group of paper that determines whether all pertubations of an input within a certain bound all produce the same prediction. The search space by these pertubations and their combinations is typically very large and formal methods are used to produce conservative results (i.e., when it reports robustness against these pertubations it actually guaranteed to always return the same prediction, but there can be false positives). This is one of the more readable papers in this area that I have seen, but I'm not aware of any paper that really discusses how to use these robustness guarantees in production (typical scenarios seem to be to test robustness for a training set, for a set of important test cases, or to test it at runtime to make sure that a specific result is not just caused by a pertubation) and whether the considered pertubations are relevant or interesing for practical cases.

Cohen, Jeremy M., Elan Rosenfeld, and J. Zico Kolter. "Certified adversarial robustness via randomized smoothing." In Proc. International Conference on Machine Learning, p. 1310--1320, 2019.

Discusses randomized smoothing, an interesting approach to evaluating robustness: Sample predictions not only about a specific input but also about perturbed versions of that input around that input and report the most common prediction as the result. As a consequence, the prediction will be less sensitive to minor changes in the input. Black box technique that works for any kind of model. If sampled frequently enough, as discussed in the paper, also probabilistic bounds can be provided, although at significant computational costs (they use 100K model inferences per input).

Shneiderman, Ben. "Bridging the gap between ethics and practice: Guidelines for reliable, safe, and trustworthy Human-Centered AI systems." ACM Transactions on Interactive Intelligent Systems (TiiS) 10, no. 4 (2020): 1-31.

Essay-style paper on the many layers of considerations necessary for building safe software systems, with a nice framing of software engineering practices framed by culture and regulation (see Fig 1!). The specific discussions seem somewhat shallow and ad-hoc, but there are several interesting pointers throughout.

Ishikawa, Fuyuki, and Yutaka Matsuno. "Continuous argument engineering: Tackling uncertainty in machine learning based systems." In International Conference on Computer Safety, Reliability, and Security, pp. 14-21. Springer, Cham, 2018.

Well argued short position paper on how to integrate uncertainty and monitoring in safety cases and goal modeling notations. Essentially acknowledges that not all problems can be foreseen, but instead provides means to express uncertainty and a mandate to monitor the system in production and add more concerns and evidence when new problem categories are surfaced.

Security & Privacy

Huang, Ling, Anthony D. Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J. Doug Tygar. "Adversarial machine learning." In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, pp. 43-58. 2011.

Broad overview of security and privacy concerns (with examples of SVNs, predates DNN). Illustrates concerns quite well with two running examples of spam detection and network traffic analysis. Focuses on the model only.

Liu, Qiang, Pan Li, Wentao Zhao, Wei Cai, Shui Yu, and Victor CM Leung. "A survey on security threats and defensive techniques of machine learning: A data driven view." IEEE access 6 (2018): 12103-12117.

Fairly accessible survey on security threats and defense techniques, providing a decent overview. Focus on the model itself.

McGraw, Gary, Harold Figueroa, Victor Shepardson, and Richie Bonett. "An architectural risk analysis of machine learning systems: Toward more secure machine learning." Technical report, Berryville Institute of Machine Learning, v 1.0 (2020).

Fairly systematic and comprehensive list of possible security problems and corresponding controls and high-level strategies (not explicitly grounded or evaluated). Contains a useful architectural breakdown of the ML parts of a system, especially with regards to data. Could very well be used as a checklist during threat modeling.

Reproducibility, Provenance

Halevy, Alon, Flip Korn, Natalya F. Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. "Goods: Organizing Google's datasets." In Proceedings of the 2016 International Conference on Management of Data, pp. 795-806. 2016.

This is a neat paper describing how data dependencies can be extracted automatically from log files at Google. While the solution probably does not generalize to many smaller organizations, the problem description is well done and the solution is interesting.

Cummaudo, Alex, Scott Barnett, Rajesh Vasa, John Grundy, and Mohamed Abdelrazek. "Beware the evolving 'intelligent' web service! An integration architecture tactic to guard AI-first components." arXiv preprint arXiv:2005.13186 (2020).

Paper discussing the problem of (silent) API evolution of webservices that provide ML predictions, such as image classification, which trigger problems in client applications: Models of those services are updated, both internally and with regard to the domain of possible outputs without any public versioning or communication. The paper focuses on a mechanism to detect this when one has no control over the service, but I think the larger picture may be even more important for thinking about interface design: How to version (public or internal) prediction services and how to communicate such changes.

Computational Notebooks

There is quite a bit of work on computational notebooks recently, mostly focused on the exploratory work that data scientists perform. Some of this is focused on poor software engineering practices, e.g., with regard to testing, modularity, reuse, versioning, dependency management, and determinism.

Pimentel, Joao Felipe, Leonardo Murta, Vanessa Braganholo, and Juliana Freire. "A large-scale study about quality and reproducibility of Jupyter notebooks." In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 507-517. IEEE, 2019.

Interesting mining repository exploration of Jupyter notebooks on GitHub, finding among others that many cannot be reproduced

Psallidas, Fotis, Yiwen Zhu, Bojan Karlas, Matteo Interlandi, Avrilia Floratou, Konstantinos Karanasos, Wentao Wu et al. "Data Science through the looking glass and what we found there." arXiv preprint arXiv:1912.09536 (2019).

Similar study of notebooks, but at much larger scales, including 6 million notebooks from GitHub and 2 million from Microsoft. Lots of statistics on usage and trends, sometimes a bit shallow but at a massive scale: for example, that most notebooks use only few libraries, that there are huge numbers of notebook authors, and that static analysis seems feasible.

Kery, Mary Beth, Marissa Radensky, Mahima Arya, Bonnie E. John, and Brad A. Myers. "The story in the notebook: Exploratory data science using a literate programming tool." In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1-11. 2018.

Interview study with Jupyter users to understand their practices and challenges, highlighting especially the exploratory nature and the challenges with regard to cleaning up work and versioning. Subsequent work by the same authors suggested versioning tooling.

Chattopadhyay, Souti, Ishita Prasad, Austin Z. Henley, Anita Sarma, and Titus Barik. "What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities." In Proceedings of the CHI Conference on Human Factors in Computing Systems, 2020.

An interesting and very useful study about pain points in using Notebooks, mostly through interviews and short-term field observations, followed up with a mid-sized survey. Lots of interesting insights about day-to-day problems, including large challenges due to poor tool support, scalability challenges, lack of refactoring, difficulty sharing and deploying work.

Kandel, Sean, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. "Enterprise data analysis and visualization: An interview study." IEEE Transactions on Visualization and Computer Graphics 18, no. 12 (2012): 2917-2926.

Earlier work on problems that data scientists have (here data scientists seem to work more on reporting, less on software teams). Not only covering notebooks, but similar results to the more recent studies: Tracking rationale in exploratory programming is a key challenge, no abstraction, little reuse, negative results get lost; data scientists have different profiles from developers but often work with them; data scientists rarely collaborate with other data scientists.

Head, Andrew, Fred Hohman, Titus Barik, Steven M. Drucker, and Robert DeLine. "Managing messes in computational notebooks." In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-12. 2019.

Nice project to provide tool support for Jupyter notebooks, creating dependency graphs and using static slicing to clean notebooks.

Liu, Jiali, Nadia Boukhelifa, and James R. Eagan. 2019. “Understanding the Role of Alternatives in Data Analysis Practices.” IEEE Transactions on Visualization and Computer Graphics, August.

Interview study to understand how data scientists explore variations when developing models. Useful context to understand how data scientists operate.

Matthew Seal, Kyle Kelley, and Michelle Ufford. "Part 2: Scheduling Notebooks at Netflix." Netflix Technology Blog. 2018

Personally, I find it rather surprising to use Notebooks for automation in production. Might be an interesting architectural and infrastructure decision worth discussing.

Human-AI Interaction

This area is interesting and has seen significant attention recently, but goes beyond my typical software engineering considerations. See the Human-AI Interaction class at CMU for more pointers.

Kocielnik, Rafal, Saleema Amershi, and Paul N. Bennett. "Will you accept an imperfect AI? Exploring designs for adjusting end-user expectations of AI systems." In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1-14. 2019.

Cool mturk-style study to explore different means of setting expectations for users that the ML predictions of the system might be occasionally wrong. This is useful to think at the system level about mitigations for incorrect predictions, here focusing on how to present them to users. Discusses three concrete UI designs for a realistic setting (scheduling assistant).

Stumpf, Simone, Adrian Bussone, and Dympna O’sullivan. "Explanations considered harmful? user interactions with machine learning systems." In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems (CHI). 2016.

Interesting short paper with a small experiment, illustrating that explanations of a model’s predictions can foster trust for the prediction up to the point that the users (here physicians for a medical diagnosis) trust the system over their own judgment and accept more wrong predictions. The sentiment is that “the model seems to know more than me” even if the prediction and explanation are wrong.

Amershi, Saleema, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh et al. "Guidelines for Human-AI Interaction." In Proceedings of the 2019 CHI conference on human factors in computing systems, pp. 1-13. 2019.

Curated collection of a number of high-level design guidelines extracted from (grey) literature survey and heavily workshoped and evaluated. Table 1 contains the main guidelines.

Kulesza, Todd, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. "Principles of explanatory debugging to personalize interactive machine learning." In Proceedings of the 20th international conference on intelligent user interfaces, pp. 126-137. 2015.

Nice example of a user interface with explanations (here Bayes model similar to spam filter) that helps users to build a mental model and adjust weights if needed to fix wrong predictions. Very explicit about design goals and decisions; lots of pointers to related work on mental models and debugging for end users.

Cai, Carrie J., Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. "’Hello AI’: Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making." Proceedings of the ACM on Human-Computer Interaction 3, no. CSCW (2019): 1-24.

Interesting overall discussion about what information users (in this case medical practitioners) need to trust and adopt an ML model’s predictions, based on an study with pathologists. For example, need to convey a mental model of how the AI works, need to explain what data was used to train the model and what features it uses (visible and not visible in UI) to gain trust. Lots of design and data decisions here.

Jacovi, Alon, Ana Marasović, Tim Miller, and Yoav Goldberg. "Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI." Proc. FAccT (2021).

Interesting conceptual discussion trying to define trust and trustworthiness, including an important recognition that risk is necessary for trust. They distinguish intrinsic trust which requires that a user’s prior believe align with how the model reasons (here explanations and capability tests are useful) and extrinsic trust from a positive track record in production or a rigorous evaluation as the two key directions to increase trust. They specifically talk about how adhering to partial contracts may be sufficient to establish trust, but remain somewhat vague on the concept beyond some vague references to subpopulations and capabilities.

Springer, Aaron, Victoria Hollis, and Steve Whittaker. "Dice in the black box: User experiences with an inscrutable algorithm." AAAI Spring Symp. (2017).

Interesting small experiment: designing a ML-based system to rate text that is actually just random. Users will go to great length to interpret meaning into the system’s actions and seek confirmation bias for their own mental model, even if the system does not have any intelligence.

Interdisciplinary teams

Kim, Miryung, Thomas Zimmermann, Robert DeLine, and Andrew Begel. "Data scientists in software teams: State of the art and challenges." IEEE Transactions on Software Engineering 44, no. 11 (2017): 1024-1038.

Discusses the role of data scientists and their challenges through interviews and surveys. See also the subsequent Amershi paper for a more ML-specific view.

Ryan Orban. "Bridging the Gap Between Data Science & Engineer: Building High-Performance Teams." Presentation 2016

Short presentation about the different roles of data scientists and software engineers and how to build interdisciplinary teams. Matches very closely my view of software engineering for AI-enabled systems, in that we still need data scientists and software engineers (and operators, and ...) as separate disciplines, but that we need a Devops-like integration around a common understanding and a joint mission.

Yang, Qian, Jina Suh, Nan-Chen Chen, and Gonzalo Ramos. "Grounding interactive machine learning tool design in how non-experts actually build models." In Proceedings of the 2018 Designing Interactive Systems Conference, pp. 573-584. 2018.

Interesting view on how people without data science training (mostly software engineers) build machine learning models. In short, they often don’t check for generalization, don’t analyze their data or features much, and strongly prefer to write code. Points out communication and education gaps. Nice contrast between data scientists and software engineers, here focused on the data science tasks.

Wang, Dakuo, Justin D. Weisz, Michael Muller, Parikshit Ram, Werner Geyer, Casey Dugan, Yla Tausczik, Horst Samulowitz, and Alexander Gray. "Human-AI Collaboration in Data Science: Exploring Data Scientists' Perceptions of Automated AI." Proceedings of the ACM on Human-Computer Interaction 3, no. CSCW (2019): 1-24.

IBM-internal interview study of how data scientists and software engineers think of AutoML and similar tools. Anticipating that AutoML is here to stay and that it will lead to augmentation rather than replacement of data scientists and will provide teaching opportunities.

Lvov, Ilia. "Project Management in Social Data Science: integrating lessons from research practice and software engineering." PhD diss., University of St Andrews, 2019.

In-depth case studies of interdisciplinary data science projects (from a social science angle), discussing many observed challenges related to collaboration, process, and various tradeoffs. Lots of interesting observations. Afterward, an SE-inspired process model is suggested (and evaluated!) with checklists for all stages to foster planning and measuring progress. It seems that these checklists are useful to bring different stakeholders together and give them a joint planning document (boundary object).


Various recommended and useful resources that do not fit well into the other categories.

Hopkins, Aspen, and Serena Booth. "Machine Learning Practices Outside Big Tech: How Resource Constraints Challenge Responsible Development." In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (AIES ’21) (2021).

Very interesting interview studies with 17 participants from organizations building ML products outside of Big Tech companies, showing a wide range of challenges. The initial focus is on model fairness, but it shows how resource constraints influence many design decisions and tradeoffs and create many pain points in teams. Focused mostly on model development, not the larger system.

Martínez-Fernández, Silverio, Justus Bogner, Xavier Franch, Marc Oriol, Julien Siebert, Adam Trendowicz, Anna Maria Vollmer, and Stefan Wagner. "Software Engineering for AI-Based Systems: A Survey." arXiv preprint arXiv:2105.01984 (2021).

Incredibly ambitious and broad survey, covering all of software engineering for systems with AI components (not just model building and deployment). The scope is so wide that it won’t cover all papers in this area, often just selecting a few papers for a field with lots of research and tools (and they exclude grey literature), but it provides a very good overview and puts a lot of effort in organizing the literature.

Ameisen, Emmanuel. "Building Machine Learning Powered Applications: Going From Idea to Product." O'Reilly Media, 2020.

Nice book that takes a wider view of building ML applications than most, starting with discussing product goals (not just ML model goals) and going up to deployment and monitoring. It mostly stays fairly close to the model with more detailed content on the ML pipeline (including guidance for quality assurance and debugging), whereas the coverage of requirements engineering and deployment is more high-level and shallow. Overall a nice book with a healthy engineering attitude and a view that exceeds beyond the ML model.

Zinkevich, Martin. 2017 “Rules of Machine Learning: Best Practices for ML Engineering.”

Blog post. List of advice about engineering ML-enabled systems based on experience at Google. Many good pointers.

Akkiraju, Rama, Vibha Sinha, Anbang Xu, Jalal Mahmud, Pritam Gundecha, Zhe Liu, Xiaotong Liu, and John Schumacher. "Characterizing Machine Learning Processes: A Maturity Framework." In International Conference on Business Process Management, pp. 17-31. Springer, Cham, 2020.

Misleading title: Paper contains a list of advice for building AI-enabled systems based on experience at IBM. Not visibly grounded or validated, but some interesting parts, especially about pain points and collaboration points. Not actually discussing a maturity model (only a short draft excerpt shown at the very end).

Horneman, Angela, Andrew Mellinger, and Ipek Ozkaya. AI Engineering: 11 Foundational Practices. Carnegie Mellon University Pittsburgh United States, 2020.

Very short technical report from the SEI, with a list of all-plausible recommendations (somewhat similar to the Zinkevich post above).

Sculley, D., Matthew Eric Otey, Michael Pohl, Bridget Spitznagel, John Hainsworth, and Yunkai Zhou. 2011. “Detecting Adversarial Advertisements in the Wild.” Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’11.

Nice case study paper from a team at Google that explicitly discusses the various challenges of building an ML-based system that go beyond just the modeling part, including issues like automatic calibration, composing many models, and teamwork in interdisciplinary teams. Include an architecture diagram. Despite some vagueness, we use it in class as a case study to show the importance of software engineering in ML projects.

Sendak, Mark P., William Ratliff, Dina Sarro, Elizabeth Alderton, Joseph Futoma, Michael Gao, Marshall Nichols et al. "Real-World Integration of a Sepsis Deep Learning Technology Into Routine Clinical Care: Implementation Study." JMIR medical informatics 8, no. 7 (2020): e15182.

Another case study I like and which use in class to discuss the system nature of machine-learning project. The paper mostly focuses on organizational and requirements aspects of building and deploying an ML-enabled system, with relatively little details on the actual model training.

Géron, Aurélien. "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems". 2nd Edition, O'Reilly Media, 2019.

One of many books explaining how various machine-learning techniques (including a quite extensive treatment of deep learning), and also shows pragmatically how to use the corresponding techniques with various libraries. Spends about half a page talking very superficially about deployment and evaluation in production (Section "Launch, Monitor, and Maintain Your System"), but is otherwise squarely in the data science camp. Overall accessible and quite detailed introduction that invites to immediate play with some datasets and build some models.

Polyzotis, Neoklis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. "Data lifecycle challenges in production machine learning: a survey." ACM SIGMOD Record 47, no. 2 (2018): 17-28.

Essay/position paper with many pointers from the Tensorflow team. Key message is that ML can learn from data management community. Interesting, but not very deep and the survey part seems ad-hoc.

Arpteg, Anders, Björn Brinne, Luka Crnkovic-Friis, and Jan Bosch. "Software engineering challenges of deep learning." In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 50-59. IEEE, 2018.

Discussion of SE challenges in deep learning projects. Useful list of challenges, apparently grounded in interviews with engineers in 7 projects. The discussion remains rather abstract with little explicit grounding in the interview data.

Hermann, Jeremy, Mike Del Balso, Rene Schmidt, and Jakob Holdgaard Thomsen. 2017. “Meet Michelangelo: Uber’s Machine Learning Platform.” Uber Engineering Blog. September 5, 2017. .

Interesting blog post about the ML infrastructure at Uber and the challenges of building production systems.

Patel, Kayur, James Fogarty, James A. Landay, and Beverly Harrison. "Investigating statistical machine learning as a tool for software development." In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 667-676. 2008.

Early paper on how data science practices are different from software engineering practices (pre notebook days), finding how iterative and exploratory programming are pervasive. Emphasizes the difficulty of understanding features, evaluating quality, and a lack of tooling for iterative/exploratory programming. Fig 3 is pretty cool, showing how participants in their experiments incrementally improved accuracy of their models.

O'Leary, Katie, and Makoto Uchida. "Common problems with Creating Machine Learning Pipelines from Existing Code." Proc. Third Conference on Machine Learning and Systems (MLSys) (2020).

Short paper describing experience from Google's customer workshops with where developers struggle building ML systems. Key results: Need a mindset that focuses on pipelines not models and need to develop best practices and reusable patterns/fragments/abstractions that can be composed in pipelines.

Rudin, Cynthia. "Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead." Nature Machine Intelligence 1, no. 5 (2019): 206-215.

Interesting (potentially controversial) position paper emphasizing the difference between interpretability and explainability in machine learning. Argues that in many cases simple and interpretable models may perform similarly well to opaque models.

Kleppmann, Martin. “Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems.” O'Reilly Media, Inc., 2017.

Excellent book about distributed and big data systems. Not directly related to machine learning, but covers fundamentals of data storage and data processing (batch, stream, ...) at scale. Excellent discussion of principles and tradeoffs.

Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and Policy Considerations for Deep Learning in NLP." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645-3650. 2019.

Short paper illustrating very clearly the computation cost and environmental impact of training deep neural networks (esp. with hyperparameter optimization). Makes a strong argument that these costs should be considered when designing models or systems.

Baylor, Denis, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal et al. "Tfx: A tensorflow-based production-scale machine learning platform." In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1387-1395. 2017.

High-level description of the machine learning infrastructure for building pipelines and model deployment by Google. Discusses engineering concerns and integrated tools across the entire pipeline, including among others automation and tracking in data transformation, training, and deployment.

Wagstaff, Kiri. "Machine learning that matters." In Proceedings of the 29 th International Conference on Machine Learning, (2012).

Nice essay about how much of ML research focuses on improving ML algorithms and evaluating on benchmarks with simple metrics, rather than focusing on impact and deployments. The latter requires a much broader scope of thinking in terms of the end-users' goals and the larger system. Good motivation for studying SE4AI/production ML.

Humbatova, Nargiz, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. "Taxonomy of real faults in deep learning systems." In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1110-1121. 2020.

Extensive study of problems data scientists face when using deep learning frameworks, using interviews, issue analysis (of projects using these frameworks!), and stackoverflow questions. Problems mostly related to network architecture, hyperparameters, and data quality.

Jiang, Ray, Silvia Chiappa, Tor Lattimore, András György, and Pushmeet Kohli. "Degenerate feedback loops in recommender systems." In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 383-390. 2019.

Very formal paper using simulation to study the effects of feedback loops of recommender systems on filter bubbles and echo chambers. Helps understanding the problem and thinking of solutions as part of the system’s design.

Others with notes

Not necessarily papers and books I recommend, but they may be of interest for specific concerns.

Smith, Jeff. Machine Learning Systems: Designs that Scale. Manning Publications Co., 2018.

Book on building scalable machine learning systems. Focuses on elastic/reactive system design and suggests specific design decisions, such as working with immutable data. Many concrete implementation examples given in Scala. The book is very readable but opinionated and somewhat narrow and low level. It describes specific implementation patterns for design solutions it suggests, but it definitively reflects on tradeoffs and software engineering concerns.

Stoica, Ion, Dawn Song, Raluca Ada Popa, David Patterson, Michael W. Mahoney, Randy Katz, Anthony D. Joseph et al. "A Berkeley view of systems challenges for AI." arXiv:1712.05855 (2017).

High-level position paper, often focused on robotics.

Kanewala, Upulee, and James M. Bieman. 2014. “Testing Scientific Software: A Systematic Literature Review.” Information and Software Technology 56 (10): 1219–32.

Survey on testing techniques, covering also approaches for testing systems without oracles, some of which apply to ML.

Zhang, Mengshi, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. "DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems." In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pp. 132-142. 2018.

Practical application of metamorphic testing for generating test cases

Zhou, Hucheng, Jian-Guang Lou, Hongyu Zhang, Haibo Lin, Haoxiang Lin, and Tingting Qin. "An Empirical Study on Quality Issues of Production Big Data Platform." IEEE/ACM 37th IEEE International Conference on Software Engineering. 2015

Study of operational issues of large map reduce cluster. Mostly hardware failures not captured by redundancy mechanisms. No direct ML aspect, but useful for discussing large distributed ML jobs at scale and some discussion of debugging from logging data.

Chen, Zhenpeng, Yanbin Cao, Yuanqiang Liu, Haoyu Wang, Tao Xie, and Xuanzhe Liu. "Understanding Challenges in Deploying Deep Learning Based Software: An Empirical Study." Proc. FSE (2020).

Another StackOverflow study, this time focused on questions that developers ask regarding deploying deep learning models. Mostly API/documentation style questions close to the model rather than broader system deployment or MLOps questions and fairly shallow insights, but still potentially useful to see what developers struggle with current infrastructure at the API level.

Lwakatare, Lucy Ellen, Aiswarya Raj, Jan Bosch, Helena Holmström Olsson, and Ivica Crnkovic. "A taxonomy of software engineering challenges for machine learning systems: An empirical investigation." In International Conference on Agile Software Development, pp. 227-243. Springer, 2019.

Brief discussion of 7 case studies, mostly finding MLOps-style problems. The paper concludes with a potentially interesting maturity model (Sec 5), that, while not visibly grounded in data, may be a good starting point for discussions for framing different kinds of ML projects.

Wan, Zhiyuan, Xin Xia, David Lo, and Gail C. Murphy. "How does machine learning change software development practices?" IEEE Transactions on Software Engineering (2019).

Discussion of challenges in developing ML systems based on interviews and a survey. Unfortunately, the paper does not distinguish consistently between the ML model, the framework used to learn the model, and the the system with an ML component, nor does it clearly distinguish between roles of software engineers or data scientists, leading to rather murky and questionable claims.

Ishikawa, Fuyuki, and Nobukazu Yoshioka. "How do engineers perceive difficulties in engineering of machine-learning systems? Questionnaire survey." In 2019 IEEE/ACM Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practice (SER&IP), pp. 2-9. IEEE, 2019.

Fairly large-scale survey about pain points in developing ML-based systems. Mostly results as expected, but nice plots summarizing the most common difficulties.

Bernardi, Lucas, Themistoklis Mavridis, and Pablo Estevez. "150 successful machine learning models: 6 lessons learned at" In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1743-1751. 2019.

Experience report of building various models in a production system, examples of ML-driven products (seems to start with "what could we predict" rather than with "what does this product need"). Good illustration with concrete examples how model accuracy may not align with business success measures of the product (sales) and some speculation why. Strong focus on A/B testing with business metrics rather than model accuracy.

Cambo, Scott Allen, and Darren Gergle. "User-centred evaluation for machine learning." In Human and Machine Learning, pp. 315-339. Springer, Cham, 2018.

Book chapter on how to think about the design and evaluation of an ML component in a fitness tracker. Good argument why model accuracy alone may not be a good metric and how one should probably start much earlier in requirements and design to consider how model accuracy interacts with other parts of the system and user experience.

Ishikawa, Fuyuki, and Yutaka Matsuno. "Evidence-driven Requirements Engineering for Uncertainty of Machine Learning-based Systems." In 2020 IEEE 28th International Requirements Engineering Conference (RE), pp. 346-351. IEEE, 2020.

Proposes a goal model extension for requirements engineering that explicitly considers alternative designs dependenting on how well a ML model performs (yet to be determined, deferring the decision during design). The paper has an interesting discussion about the challenge of prototyping in ML, where it may be very difficult to establish feasbility until very late in the project.

Lwakatare, Lucy Ellen, Aiswarya Raj, Ivica Crnkovic, Jan Bosch, and Helena Holmström Olsson. "Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions." Information and Software Technology 127 (2020): 106368.

Literature survey of mostly published industrial experience reports about building ML-based systems, analyzing reported challenges and recommendations from those papers. Very useful for pointers to a number of experience reports and some summary of common concerns, see Table 4. Identify the common concers around availability, scalability, privacy, and safety, whereas security and usability are rarely mentioned in their papers. The solution part seems maybe somewhat shallow or obvious (Table 6). Grey literature not included and intermixing experience reports with academic studies and position papers; some papers and corresponding problems are also seem a bit dated now with recent advances.

Bosch, Nathan, and Jan Bosch. "Software Logging for Machine Learning." arXiv preprint arXiv:2001.10794 (2020).

Good overview of problems with analyzing log files produced by software systems (including ML-enabled systems) and some collection of pointers to related work on log file parsing etc. The specific solution essentially requires some schema management and writing log files in a compact-machine readable format and more centralized planning.

Islam, Md Johirul, Hoan Anh Nguyen, Rangeet Pan, and Hridesh Rajan. "What do developers ask about ML libraries? A large-scale study using stack overflow." arXiv preprint arXiv:1906.11940 (2019).

Large scale of stackoverflow questions related to 10 machine learning frameworks. Largely quantitative reporting, for example, which stages are most covered in those questions.

de Souza Nascimento, Elizamary, Iftekhar Ahmed, Edson Oliveira, Márcio Piedade Palheta, Igor Steinmacher, and Tayana Conte. "Understanding Development Process of Machine Learning Systems: Challenges and Solutions." In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1-6. IEEE, 2019.

Small-scale interview study in 3 small Brazilian companies eliciting the ML process and three problems: identifying business goals, data definition, and undefined process. Suggest a checklist for each.

Haq, Fitash Ul, Donghwan Shin, Shiva Nejati, and Lionel Briand. "Can Offline Testing of Deep Neural Networks Replace Their Online Testing?" Empirical Software Engineering (2021).

Paper compares model accuracy with a safety property assessed in a simulator in the context of determining the steering angle in a self-driving car, showing that accuracy is not a meaningful predictor for assuring the safety property. Not considering much of the system beyond a single model, and not really a fair comparison, and results shouldn’t be surprising, but another good reminder that prediction accuracy evaluation evaluations alone are usually not meaningful to establish system quality.

Bosch, Jan, Helena Holmström Olsson, and Ivica Crnkovic. "Engineering AI systems: A research agenda." In Artificial Intelligence Paradigms for Smart Cyber-Physical Systems, pp. 1-19. IGI Global, 2021.

Position paper listing a number of open research challenges for AI engineering in several application domains. These observations are based on consulting and interviews with companies (small start-ups to large corporations) in various domains, though no clear traceability is provided. Main focus is on ML components rather than the whole system.

Other Lists

Miryung Kim has been teaching seminars at UCLA on debugging and data science/ML that is a great source for recommended readings as well, see and

Pooyan Jamschidi is teaching "Machine Learning Systems" a UofSC and recommended several readings to me when preparing for my course:

Larysa Visengeriyeva has assembled an amazing list of resources on MLOps and related topics (books, talks, blog posts):

Awesome Software Engineering for Machine Learning:

Annotated bibliography on ML security papers:

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.