Machine learning is quite a popular choice to build complex systems and is often marketed as a quick win solution. Unfortunately, building production grade systems with integration of Machine learning is quite complicated. What makes deployment of an ML system can be broken down due to the following reasons:

  • ML systems are prone to break numerous software engineering paradigms and is rightly called the ‘High-interest credit card of technical debt’ by Sculley.
  • ML systems require significant infrastructure and data engineering. Moreover, the infrastructure is leveraged by multiple teams.
  • ML systems also come with the inherent issues of data privacy, security and data loss prevention.
  • Writing test cases for ML systems is still an open engineering problems; whereas there is good advice around writing the ‘test cases’, the test cases can be quite difficult to implement and are not comprehensive.

In this post, I’ll discuss some common anti-patterns and we’ll look into the guidelines and frameworks that can help us to stay clear of these problems. The intended audience of this post are the data science managers and machine learning engineers alike. To be honest, you’d probably need to compromise on some of the aspects depending on your team composition, resources and experience.

Production Grade ML

Let me qualify what I mean by production grade machine learning systems. If you are building a machine learning system that requires scalability of different components, continuous training and monitoring and known downstream applications, then you’ve been tasked with building a production grade ML system. From a Data Scientist’s perspective, this is quite different than the regular requirement of wrangling through data once, applying the transformation and iterating over ML models with only one objective: maximization of performance metric.

Broadly speaking, a production grade machine learning system will be comprise of a subset of the following components:

  1. Data Engineering pipeline
  2. Data loss prevention pipeline
  3. Data validation system
  4. Machine learning model distributed training and hyperparameter tuning resources
  5. Connection to business applications (REST, gRPC, message queues)
  6. Throughput and latency requirements for the ML application
  7. Re-training of models and Continuous deployment
  8. ML canary testing
  9. Continuous monitoring, health checks and recovery
  10. Logging
  11. Auditing/Versioning
  12. Monitoring
  13. CI/CD
  14. Tracking metrics and concept drifts

alt text

Anti-patterns

The anti-patterns of ML systems can be divided into the following three broader categories:

  1. Development Anti-patterns
  2. Deployment Anti-patterns
  3. Operations Anti-patterns

Development Anti-patterns

The development anti-patterns can further be sub-divided into two main categories: Project Management related anti-patterns and ML algorithm Anti-patterns. This post will only discuss the Anti-patterns from a perspective of project management and software engineering paradigms. I’d write down a separate post for ML algorithm anti-patterns as it will deviate from the goal of this post. For curious audience, by ML Anti-patterns, I mean the common mistakes that data scientists make while working on different ML models (sampling issues; assumptions on data required for using different clustering methods; hyperparameter tuning; splitting data; among others).

Depending on the Data Science Unicorn (Super hero)

Anyone with some good experience in industry would have seen this scenario: A high dependence on a single resource to take the system from prototyping to production. This characteristics of this superhero (or as industry like to call data science unicorns) are a combination of:

  1. ML researcher
  2. Data Engineer
  3. Infrastructure and ops engineer
  4. Product Manager
  5. Site reliability engineer
  6. Cloud expert

Whereas, it will be great to have one of these in each team, such a resource is called a unicorn for good reason. Moreover, the dependence on such a resource will cause problems in the long run. So if you are lucky to have such a resource in team, my suggestion would be to narrow down a primary role for her and make your whole team work around her. This way, you’ll automatically train your team and in case the unicorn flies away, and the project will hopefully not be binned after the unicorn flies away.

alt text

Ideally, the team should be divided into the following roles:

  1. Machine learning engineer: Building ML models
  2. Devops engineer: Infrastructure set up and operations
  3. Data Engineer: Setting up the data pipelines
  4. Data Science Manager: Leading the team
  5. Business Decision Makers: Executives+clients+ML engineers

Black box scenario

This anti-pattern can be considered to be the reverse of the previous scenario. That is, the segregation of roles is so strong that by the end of your machine learning pipeline, no one has the complete knowledge of how the end to end system works. For example, in one of our projects, the researcher created a complicated ensemble of a machine learning model as the prototype; the software engineer performed code refactoring and engineering without completely understanding the ML model and the Devops engineer containerzied the application for deployment. By the end of this exercise, the ML researcher was unable to understand the code and debugging and feature enhancement was a nightmare. So, while the segregation of roles is important, the team should work closely and not in silos to understand all the engineering components.

Technical Debt and Machine learning Systems

This section will be a summary of the issues discussed in detail by Dr. Sculley in his paper and the readers are referred to consult it for further details.

Technical debt refers to the implied cost of additional rework that is caused by choosing an easy (limited) solution now instead of using a better approach that would take longer. If technical debt is not repaid, it can accumulate ‘interest’, making it harder to implement changes later on. Unaddressed technical debt increases software entropy. In software engineering, some of the common reasons to accrue technical debt are the complexity of the system; entanglement; creating dependencies; poor readability; lack of test cases and anti-patterns of software design paradigms. Most of the technical debt can be decreased by following these steps:

  • Freeze on additional features
  • Refactoring of infrastructure/architecture
  • Refactoring of Code
  • Reducing complexity
  • Writing tests
  • Enforcing abstractions and barriers
  • Increasing readability and documentation

Machine learning systems are specially prone to accruing technical debt as in addition to regular software engineering debt scenarios, there are numerous ML specific issues.

Boundary erosion

Whereas in Software engineering, there is a strong focus on creating abstraction boundaries by encapsulation and modular designs, it is difficult to enforce the same kind of boundaries for machine learning systems. Since the Machine learning models are essentially driven by data, the logic of software cannot be effectively segregated from external data. The remaining section looks into some examples of how it can cause technical debt.

  • Entanglement: A machine learning package is essentially a data blender. For example, the feature engineering often requires combination of features to provide new ones. Similarly, changing the composition of one feature will change the weights on all the remaining features. Even if the feature engineering is kept the same, a change in distribution of a particular feature can cause all the model weights to change. In addition to features, a change in hyperparameters can also affect all the weights. This is referred to as Changing Anything Changes Everything. The net result of such changes is that prediction behavior may alter, either subtly or dramatically. The issue of entanglement is in some sense innate to machine learning, regardless of the particular learning algorithm being used. In practice, this all too often means that shipping the first version of a machine learning system is easy, but that making subsequent improvements is unexpectedly difficult.

  • Hidden Feedback loops: The real world machine learning systems are also prone to hidden feedback loops. An easy to understand such a case would be with recommendation engines. For example, if you have an ML model providing recommendations of the first page of a website, there is a high probability of users’ clicking the product because of its placement on the website. If clicking on the recommended link is then used as training data to re-train the recommendation engine without considering the position of the item, it creates a hidden feedback loop. It is recommended to have a discussion among team to identify any such possible feedback loops and address accordingly.

Dependency Debt

In machine learning systems, data dependencies introduces similar debt as code dependencies does in traditional software engineering.

  • Unstable Data dependencies: In production systems, the ML services are downstream applications that consume data from other sources. These sources are often handled by other engineering teams. The data ownership team might regularly roll-out changes in input signals which are then consumed by the ML services without changing the model. As we have seen in the entanglement section, this can have dire consequences to the output of ML models. Another common scenario is slowly changing dimensions. The ML model could have been trained at a certain state of input data, where as the data might go through various changes (DML operations) through its lifecycle. If the engineering ownership for these roles are with different teams, this can cause serious issues in the expected metrics of the ML model.

  • Underutilized Data Dependencies The most common scenario of underutilized data dependencies are caused by epsilon improvements. The machine learning researchers often focus so much on small improvements that they over-engineer the system. For example, an additional derived feature (that takes input of tens of signals) might provide an uplift of 1\% accuracy. A common mitigation strategy for under-utilized dependencies is to regularly evaluate the effect of removing individual features from a given model and act on this information whenever possi- ble. More broadly, it may be important to build cultural awareness about the long-term benefit of underutilized dependency cleanup.

Deployment Anti-patterns

Deployment Anti-patterns can mostly be avoided nowawdays by programming within one of an ML production framework. Most of these Anti-patterns occur due to glue codes and lack of test cases around data. If you have a team composition with Data Scientists coming from a non-programming background (PhD, Statisticians, Mathematicians, Engineers among others), I’d strongly recommend to team them up with experienced software engineers. The watchpoints for Machine learning in production are quite different than the ones required in research.

Lack of ML lifecycle management

Data Scientists tend to develop general purpose solutions as self-contained packages. Using self-contained solutions results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages. Let us walk through an ML pipeline and see if it sounds familiar. The data scientist launches a notebook, works from POC to production, performs data preparation, transformation, training, validation all in the Notebook to provide results and performance metrics to the executives. The data scientist thus solves the ML problem quickly, by using an existing ML pipeline (Scikit, Caret,…) and the data wrangling is done by using non-scalable solutions like Pandas and R dataframes. Therefore, by the end of the ML journey, you get a complicated series of jobs that run in special sequences. Some common smells of such a scenario are the use of different jobs, multiple languages and multiple framework. Now, if the client or executive asks the team to scale this solution, or re-train or set up for canary testing, or to make the pipeline automated, the technical debt of such a pipeline will be quite high. Unfortunately, I’ve seen this happen while working with inexperienced data scientists quite often. To correct this, my suggestion would be for the Data Science managers to ask the team to follow a framework, like KubeFlow, KubeFlow Pipelines, AWS Sagemaker or GCP’s CloudAI. Following the paradigms and framework will insure that a lot of issues are automatically resolved.

Data Validation

For traditional IT systems, behavior is defined through code and validation is conducted through unit tests. For ML systems, the behavior is actually determined through data. Therefore, the data should be treated just as code and it is imperative to write test cases for data validation. I’d suggest looking in to TFX data validation as it can help write quite a few test cases for the ML pipelines. Some common test cases that can be written to operate while serving are the detection of anomalies, inference and testing of schema and computation of descriptive statistics. It uses Apache Beam to perform these computations and therefore this computation can be pushed on to Cloud Dataflow, Spark or Flink as the runner.

Operations Anti-patterns

From an operations perspective, there are a few concepts to be mindful about. The first one is the training-serving skew. As Martin Zinkevich puts it:

Training-serving skew is the difference between performance during training and performance during serving. This skew can be caused by:

  • A discrepancy between how you handle data in the training and serving pipelines.
  • A change in the data between when you train and when you serve.
  • A feedback loop between your model and your algorithm.

The best solution to this is to explicitly monitor the system and data and gauge for any changes in data statistics. As mentioned before, TFX data validation provides a great tool to gauge these changes. Due to this, the model accuracy can drop over time. Therefore, to know the freshness requirements and model’s accuracy, a thorough monitoring strategy should be deployed. My recommendations are to look into TFX model analysis to gauge the freshness requirements and a canary deployment environment as configured by Seldon.io to tackle these challenges.

General guidelines to avoid Anti-patterns

What follows are some of my favorite rules that I’d suggest the readers to undertake so that most of these anti-patterns can be avoided while development.

  1. Don’t be afraid of starting with heuristics
  2. First define objective and metrics
  3. Choose ML over complex heuristics
  4. Keep the first model simple and focus on infrastructure
  5. Test the infrastructure independently from ML
  6. Turn heuristics into features
  7. Be mindful of ML model life (freshness of model)
  8. Canary test before exporting model
  9. Use proxy objectives instead of direct ones
  10. Start with interpretable models
  11. Always keep in mind the policy layer filtering (pre and post)
  12. Plan to launch and iterate
  13. Make the first layer features understable for humans
  14. Measure the delta between models
  15. Utilitarian performance trumps predictive performance
  16. Be wary of long-term and short-term behavior; training-serving skew.
  17. Use same code for serving and training
  18. If you train model on the month of June, test on data for month of July.
  19. Avoid feedback loops with positional features