Overview of Testing in MLOps

Saksham Gulati
8 min readJan 6, 2022

--

This article is meant to give user a thorough understanding of the underlying principles of MLOps Testing and underline its importance in the Machine Learning lifecycle.

MLOps Principles

MLOps Lifecycle [source]

As machine learning and AI propagate in software products and services, we need to establish best practices and tools to test, deploy, manage, and monitor ML models in real-world production. In short, with MLOps we strive to avoid “technical debt” in machine learning applications.

The level of automation of the Data, ML Model, and Code pipelines determines the maturity of the ML process. With increased maturity, the velocity for the training of new models is also increased. The objective of an MLOps team is to automate the deployment of ML models into the core software system or as a service component. This means, to automate the complete ML-workflow steps without any manual intervention. Triggers for automated model training and deployment can be calendar events, messaging, monitoring events, as well as changes on data, model training code, and application code.

To adopt MLOps, we see three levels of automation, starting from the initial level with manual model training and deployment, up to running both ML and CI/CD pipelines automatically.

  1. Manual process- This is a typical data science process, which is performed at the beginning of implementing ML. This level has an experimental and iterative nature. Every step in each pipeline, such as data preparation and validation, model training and testing, are executed manually. The common way to process is to use Rapid Application Development (RAD) tools, such as Jupyter Notebooks.
  2. ML pipeline automation- The next level includes the execution of model training automatically. We introduce here the continuous training of the model. Whenever new data is available, the process of model retraining is triggered. This level of automation also includes data and model validation steps.
  3. CI/CD pipeline automation- In the final stage, we introduce a CI/CD system to perform fast and reliable ML model deployments in production. The core difference from the previous step is that we now automatically build, test, and deploy the Data, ML Model, and the ML training pipeline components.
    Automated testing helps discovering problems quickly and in early stages. This enables fast fixing of errors and learning from mistakes.

Testing

The complete development pipeline includes three essential components, data pipeline, ML model pipeline, and application pipeline. In accordance with this separation we distinguish three scopes for testing in ML systems: tests for features and data, tests for model development, and tests for ML infrastructure.

Features and Data Tests

• Data validation: Automatic check for data and features schema/domain.
• Action: In order to build a schema (domain values), calculate statistics from the training data. This schema can be used as expectation definition or semantic role for input data during training and serving stages.
• Features importance test to understand whether new features add a predictive power.
• Action: Compute correlation coefficient on features columns.
• Action: Train model with one or two features.
• Action: Use the subset of features “One of k left out and train a set of different models.
• Measure data dependencies, inference latency, and RAM usage for each new feature. Compare it with the predictive power of the newly added features.
• Drop out unused/deprecated features from your infrastructure and document it.

Tests for Reliable Model Development

We need to provide specific testing support for detecting ML-specific errors.
• Testing ML training should include routines, which verify that algorithms make decisions aligned to business objective. This means that ML algorithm loss metrics (MSE, log-loss, etc.) should correlate with business impact metrics (revenue, user engagement, etc.)
• Action: The loss metrics — impact metrics relationship, can be measured in small scale A/B testing using an intentionally degraded model.
• Model staleness test. The model is defined as stale if the trained model does not include up-to-date data and/or does not satisfy the business impact requirements. Stale models can affect the quality of prediction in intelligent software.
• Action: A/B experiment with older models. Including the range of ages to produce an Age vs. Prediction Quality curve to facilitate the understanding of how often the ML model should be trained.
• Assessing the cost of more sophisticated ML models.
• Action: ML model performance should be compared to the simple baseline ML model (e.g. linear model vs neural network).
• Validating performance of a model.
• It is recommended to separate the teams and procedures collecting the training and test data to remove the dependencies and avoid false methodology propagating from the training set to the test set (source).
• Action: Use an additional test set, which is disjoint from the training and validation sets. Use this test set only for a final evaluation.
• Fairness/Bias/Inclusion testing for the ML model performance.
• Action: Collect more data that includes potentially under-represented categories.
• Action: Examine input features if they correlate with protected user categories.
• Conventional unit testing for any feature creation, ML model specification code (training) and testing.

ML infrastructure test

• Training the ML models should be reproducible, which means that training the ML model on the same data should produce identical ML models.
• Diff-testing of ML models relies on deterministic training, which is hard to achieve due to non-convexity of the ML algorithms, random seed generation, or distributed ML model training.
• Action: determine the non-deterministic parts in the model training code base and try to minimize non-determinism.
• Test ML API usage. Stress testing.
• Action: Unit tests to randomly generate input data and training the model for a single optimization step (e.g gradient descent).
• Action: Crash tests for model training. The ML model should restore from a checkpoint after a mid-training crash.
• Test the algorithmic correctness.
• Action: Unit test that it is not intended to completing the ML model training but to train for a few iterations and ensure that loss decreases while training.
• Avoid: Diff-testing with previously build ML models because such tests are hard to maintain.
• Integration testing: The full ML pipeline should be integration tested.
• Action: Create a fully automated test that regularly triggers the entire ML pipeline. The test should validate that the data and code successfully finish each stage of training and the resulting ML model performs as expected.
• All integration tests should be run before the ML model reaches the production environment.
• Validating the ML model before serving it.
• Action: Setting a threshold and testing for slow degradation in model quality over many versions on a validation set.
• Action: Setting a threshold and testing for sudden performance drops in a new version of the ML model.
• ML models are canaried before serving.
• Action: Testing that an ML model successfully loads into production serving and the prediction on real-life data is generated as expected.
• Testing that the model in the training environment gives the same score as the model in the serving environment.
• Action: The difference between the performance on the holdout data and the “next­day” data. Some difference will always exist. Pay attention to large differences in performance between holdout and “next­day” data because it may indicate that some time-sensitive features cause ML model degradation.
• Action: Avoid result differences between training and serving environments. Applying a model to an example in the training data and the same example at serving should result in the same prediction. A difference here indicates an engineering error.

Types of tests
There are many five majors types of tests which are utilized at different points in the development cycle:

  1. Unit tests: Tests on individual components that each have a single responsibility (ex. function that filters a list).
  2. Integration tests: Tests on the combined functionality of individual components (ex. data processing).
  3. System tests: Tests on the design of a system for expected outputs given inputs (ex. training, inference, etc.).
  4. Acceptance tests: Tests to verify that requirements have been met, usually referred to as User Acceptance Testing (UAT).
  5. Regression tests: Tests errors we’ve seen before to ensure new changes don’t reintroduce them.
Types of Tests

How should we test?

The framework to use when composing tests is the `Arrange Act Assert methodology.
• Arrange: set up the different inputs to test on.
• Act: apply the inputs on the component we want to test.
• Assert: confirm that we received the expected output.

In Python, there are many tools, such as unittest, pytest, etc., that allow us to easily implement our tests while adhering to the Arrange Act Assert framework above. These tools come with powerful built-in functionality such as parametrization, filters, and more, to test many conditions at scale.

Pytest

We’re going to be using pytest as our testing framework for it’s powerful built-in features such as parametrization, fixtures, markers, etc.

Configuration

Pytest expects tests to be organized under a tests directory by default. However, we can also use our pyproject.toml file to configure any other test path directories as well. Once in the directory, pytest looks for python scripts starting with tests_*.py but we can configure it to read any other file patterns as well.

Assertions
Let’s see what a sample test and it’s results look like. Assume we have a simple function that determines whether a fruit is crisp or not (notice: single responsibility):

To test this function, we can use assert statements to map inputs with expected outputs:

Execution
We can execute our tests above using several different levels of granularity:

Please note that this article was collated from multiple sources. I would like to give a huge shoutout to -

--

--