Introduction

DevOps practices over the past few years have brought incredible advances to organizations’ ability to speed up and secure application delivery. It’s now possible to release application software to production tens or even hundreds of times per day. This allows enterprises to react faster to changing market conditions, drive superior customer experiences, and achieve greater enterprise success.

Kubernetes platforms such as Red Hat OpenShift have played a central role in the growth and success of DevOps in many ways:

  • Such platforms can usually plug into Continuous Integration and Continuous Delivery (CICD) engines such as Jenkins, Tekton, and others to automate and speed up the application delivery process.
  • As well as speeding up software delivery, CICD approaches also allow best practices such as shifting-left-on-security, as espoused by Forsgren, Humble, and Kim in Accelerate, their seminal book on DevOps. Shifting left on security involves moving security enforcement to earlier in the application life cycle, such as the CICD pipeline phase. This results in a more effective, less costly approach than applying manual security audits later. In addition to security, quality control can be built into the CICD pipeline, through code analysis, automated testing, and other techniques.
  • Hybrid and multi-cloud Kubernetes distributions such as OpenShift extend this speed, quality, and security beyond a single infrastructure to any public or private cloud, physical, or virtual environments and increasingly to the edge. See Hybrid Cloud reference below.

So what does all of this DevOps and Kubernetes goodness have to do with Artificial Intelligence and Machine Learning (AI/ML)? As it turns out, a lot. Many of the solutions that Kubernetes-based DevOps brings to classic software delivery can also be applied to AI/ML model delivery. Machine Learning Operations (MLOPs) refers to DevOps applied to AI/ML.

In fact, the value that MLOPs brings to AI/ML can potentially be even greater than DevOps to software delivery, as there tend to be more silos, personas, and frequently a more iterative process than with software delivery.

In 2019, Gartner famously predicted that through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization. Gartner argues one of the primary reasons that AI projects fail to deliver their business potential is that AI skills don’t scale. Furthermore, they assert that combining new skills with AI-based automation will unlock scale potential.

I subscribe to this viewpoint and assert that MLOPs can go some way to fill this automation void.

The Case for MLOPs

Google, in its discussion Hidden Technical Debt in Machine Learning Systems, highlights that the actual ML is only a small element in an effective workflow. Success in delivering AI/ML requires a coordinated approach across a range of disciplines, as depicted in this diagram:.

Those elements colored  green can be automated using MLOPs practices and tools.

This diagram depicts a typical AI/ML workflow from problem definition to production:

MLOPs covers the final 3 phases of the AI/ML workflow:

  • Model validation
  • Model deployment
  • Monitoring, validation

Prior to the MLOPs stages, the data scientist extracts important features, then trains and tunes the model, typically splitting their data into a training set and a testing set. Once they reach a satisfactory accuracy level, they check their model and its code into source control triggering the MLOPs flow.

The flow proceeds as follows:

Model Validation

This stage is typically executed in an automated fashion by a CICD engine.

Typical checks include:

  1. Model accuracy verification using a different dataset than that which the data scientist used, providing independent quality assurance prior to production deployment.
  2. Security vulnerability checking can take place at this stage in accordance with shifting-left-on-security best practices.
  3. Automated checks on model code quality can be applied by CICD based tools to ensure adherence to enterprise standards.
  4. Towards the end of this stage, the container housing the model is often signed, and that signature will be checked prior to running in production.

Model Deployment

Next, the model is typically pushed to a runtime model-serving component, exposing the model using an HTTP based RESTful API, allowing intelligent applications to easily call the model for inference, adding value to those applications.

Monitoring, Validation

Post-deployment continuous analysis of model performance and behavior is critical. Visualization and alerting tools to highlight and inform on deviations over time are absolutely critical.

As mentioned, there are striking parallels to the benefits and activities comprising DevOps flows and MLOPs flows. The requirement for an effective and robust MLOPs system is even stronger with AI/ML. This is largely because MLOPs flows are more iterative because of what’s known as model-drift.

At a high level, model drift occurs as the dataset used to train the model deviates further and further from the data used to call the model in production. For example, consider the effect of Covid on city traffic prediction models. New traffic patterns due to increased working from home and a preference for one’s own vehicle over public transport rapidly decreased the accuracy of such models.

What value does Kubernetes and OpenShift bring to MLOPs?

McKinsey, in the State of AI 2020, found that enterprises realizing above-average business value from AI/ML are much more likely than others to say their companies have built a standardized end-to-end platform for AI-related data science. This is also our experience, and the business value achievable using an enterprise-grade Kubernetes container platform, such as Red Hat OpenShift, is the subject of this blog series.

So how can OpenShift add business value through MLOPs? OpenShift provides a flexible, non-opinionated platform enabling you to bring your tools of choice to MLOPs workflows. Furthermore, the platform provides reference architectures for AI/ML such as The Open Data Hub and Kubeflow, which expose many tools that facilitate MLPs at various stages.

Model Validation

As discussed, model validation normally encompasses a CICD pipeline flow, which can also enforce security, code quality, and speed of deployment through automation. OpenShift can facilitate and simplify these concerns by providing access to many operator[1]-backed open-source tools such as:

  • CICD solutions such as Jenkins, Tekton, and Kubeflow Pipelines
  • GitOps Continuous Delivery solutions such as ArgoCD
  • Much more through Red Hat’s partner and independent software vendor (ISV) ecosystem

Model Deployment

We covered the capability to deploy the model and expose it behind an HTTP- based RESTful API. This type of service is available and frequently implemented on OpenShift through open source tools such as Seldon, Tensorflow Serving, and others. Some of these tools now offer explainability as well as pure inference. For example, Alibi from Seldon can output the reasons the model came to the prediction it did, which can be useful for auditing and legal purposes.

Other OpenShift- based tools such as Service Mesh can add value to model hosting and deployment, through end-to-end encryption, AB, and canary releases, where new models are released on a staged basis, possibly to a specific client subset until quality and performance is firmly established.

Monitoring, Validation

Open source tools such as Prometheus and Grafana provide visualization and monitoring , as well as alerting based on the state of the model and its client applications in production. Furthermore, solutions such as Apache Airflow can be employed to automatically retrain models using current datasets and configured to only release to production when a given accuracy threshold is met.

In summary, there are many ways organizations can expect to realize business value through an effective OpenShift MLOPs strategy, including:

  • A fast and streamlined CICD pipeline based mechanism to move models into production, allowing rapid response to changing market conditions.
  • An inbuilt feedback and workflow loop, allowing models to be retrained in a rapid and often automated fashion and simultaneously maximizing value from intelligent applications while maximizing productivity of expensive data scientist resources.
  • Built-in security and quality, reducing potentially expensive problems in production.
  • A proven set of proven open source tools for model serving and inference in production and for visualization, performance monitoring, and alerting. These capabilities allow rapid remedial action to be taken in response to changing conditions and continuity of an excellent customer experience.

Follow the link at the end of this post to learn how ExxonMobil uses OpenShift as an end-to-end AI platform, including its utilization for MLOPs.

Conclusion

DevOps practices including CICD have positively transformed the process of application delivery in recent years. The MLOPs equivalent for AI/ML workflows offers even greater payoffs due to the highly transitory and iterative nature of AI/ML models and workflows.  When integrated into a platform based approach to AI/ML using an enterprise-grade Kubernetes container platform, such as Red Hat OpenShift, MLOPs represents yet another area the platform drives superior business value.

References

Kubernetes: the Savior of AI/ML Business value?
Business Centric AI/ML With Kubernetes - Part 2: Data Preparation

Business Centric AI/ML With Kubernetes - Part 3: GPU Acceleration

Forsgren, Humble & Kim - Accelerate 

Gartner - Top Strategic Predictions for 2019 

Predicts 2020: Artificial Intelligence — the Road to Production

McKinsey - state of AI 2020

The Open Data Hub

Kubeflow

Google - Hidden Technical Debt in Machine Learning Systems

OpenShift and Machine Learning at ExxonMobil

Hybrid Cloud Ubiquity with OpenShift