Artificial Intelligence and Machine Learning will play an increasingly important role in all our lives. From self-driving cars to better medical diagnoses to improved fraud detection to the provision of superior investment advice, the list goes on. AI/ML’s scope of influence will be far and wide.

Up till now, however, the realization of all of this promise has been hampered by several factors in bringing AI/ML models from idea inception to fully functioning in the real world. In fact, such is the effect of these challenges that Gartner, recently claimed, “Through 2020, 80% of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization.” Some of these challenges include:

  1. Difficulty in sifting through huge rafts of Data now available and then cleansing and preparing that data for AI/ML model training.
  2. Silos and difficulties in workflow handoffs between the many personas involved in AI/ML is another huge issue. Moreover, speedy and safe movement of AI/ML models, data, and applications through testing environments to production are other stubborn issues besetting AI/ML.
  3. Model and Intelligent application visualization, analysis, and retraining post-deployment is another contributor to disappointing AI/ML business value.
  4. Inefficiencies in provisioning compute and GPU accelerated hardware often leads to extended wait times for those resources.
  5. Similarly, IT bottlenecks when it comes to the provisioning of software tooling required for AI/ML can result in even further idle and down times for AI/ML professionals.
  6. Another big impediment to the realization of business value is inconsistencies in AI/ML tooling. Whenever disparate IT systems, including ad hoc servers and isolated laptops, are used to deliver AI/ML, they can lead to version mismatches between tooling across infrastructures.

Furthermore, Venturebeat in 2019 predicted that 87% of AI projects will never make it into production.

In this article, we will provide an overview on how Kubernetes can help address these challenges and cite examples of how Red Hat’s industry-leading Kubernetes distribution (OpenShift Container Platform) has capabilities designed to alleviate each one.

How Can Kubernetes Help?

Red Hat has identified a Kubernetes-based AI/ML workflow that addresses many of the pain points we have described, as shown in Figure1[1]:

Figure 1: Kubernetes based AI/ML workflow

 

First business leadership identifies and defines business objectives and metrics to assess the success of the AI/ML venture.

 

Then data engineers collect available and relevant data, process and refine it so that it is consumable by data scientists who train the model. The data needs to be properly formatted and labeled so it is usable by the data scientist. This is an absolutely critical stage whose importance is often overlooked. Without high-quality data, you cannot produce a high-quality model.

The next step is the actual training of the model. The aim here is to produce predictable and useful responses to new incoming new data.

 

Following the training stage, the data scientist needs to verify their model and their hypotheses using a different set of data than was used to train the model.

 

Next, the model is deployed into a real-world production scenario, likely integrating with intelligent applications that make calls to the model for predictions in real time.

 

And finally, we need monitoring and validation of models post deployment. One of the big challenges of AI/ML models is concept drift. This is the idea that the nature of the data we see in the real world can gradually drift away from the data we trained the model with. The result is our model no longer reflects current reality. This necessitates constant and frequent validation and possibly retraining.

For this workflow to be truly successful, all of these stages need to be accompanied by self- service infrastructure, enabling both software tools and hardware infrastructure to be provisioned on demand without bottlenecks from IT OPs.

The workflow should be automated, enabling fast and safe movement of workloads and handoffs between parties. A Dev-Ops approach to AL/ML - ML-OPs, if you will.

A Proven Approach

So how does this workflow actually address the obstacles to business success we outlined?

Let’s walk through the challenges we introduced earlier and find out how with some examples taken from Red Hat’s enterprise Kubernetes distribution (OpenShift Container Platform).

1. Data

We have outlined the importance of preparing useful and high-quality data for model training. Enterprise Kubernetes systems provide self service and quick access to powerful data management capabilities, for example, Apache Spark, Apache Kafka, and storage capabilities such as Ceph. These radically simplify the management of huge quantities of data from disparate sources needed in AI/ML.

Autonomous vehicles require vast quantities of data to train AI models.

 

Reference:

See how BMW's experience with Red Hat OpenShift in their autonomous car initiatives provides them massively scalable data collection, processing, and storage capabilities, simulating up to 240 million kilometers of test data.

 

2. Overcoming Silos and Accelerating Delivery to Production

Dev-Ops involves elimination of silos between Developer and IT Operations in traditional software delivery. Silos and difficulties in workflow and artifact handoffs can be an even more confounding issue in AI/ML as there are even more personas involved.

From Data Analysis, Engineering and Science to IT Operations to Application Development to lines of business, Kubernetes can be used to break down these barriers and accelerate collaboration between parties through automation and workflow-driven approaches such as common storage, Continuous Integration and Continuous Delivery (CI/CD) and shared image registries.

 

Reference:

See how ExxonMobil created a self-service collaborative and workflow-driven AI/ML platform that enabled them to overcome silos and rapidly accelerate the pace of model delivery using OpenShift. 

 

3. Analysis, Validation, and Retraining Post-Deployment

Kubernetes offers a swathe of container-based visualization and analytics tools to provide ongoing feedback and verification of model accuracy or drift. These include open source metrics and visualization tools such as Prometheus and Grafana available from the Open Data Hub.

4. Hardware Provisioning

Kubernetes is inherently suitable and efficient when it comes to provisioning of hardware resources for AI/ML workloads. Kubernetes claims the desired or allocated resources, including memory, CPU, GPU, storage. These resources are reserved for the duration of the particular workload, be it training the model, testing through CICD, or serving out the model at runtime. Then when no longer required, these resources are freed up and consumable by other workloads. This effective pooling leads to enormous efficiency when it comes to hardware both in terms of its utilization and its on-demand nature.

Reference:

Read about how RBC in Canada partnered with Red Hat OpenShift and Nvidia to produce a Kubernetes based on-demand GPU based AI platform.  

 

5. Software and Tool Provisioning

Self-service of data and machine-learning tools for data engineers and data scientists leads to extraordinary efficiencies, such as the elimination of IT bottlenecks and higher utilization of the time of these expensive and scarce professionals. Kubernetes can expose container catalogs and operator-backed data and AI/ML services such as those available from the Open Data Hub facilitating rapid uptake and consumption.

Reference:

See how ExxonMobil created a self-service collaborative and workflow driven AI/ML platform through OpenShift - bringing them enormous efficiencies and cost savings.

 

6. Inconsistencies in AI/ML Tooling

Usage of a common set of tools and services across the organization, such as those in OpenShift’s container catalogs and operator-backed data and AI/ML service, can help eliminate versioning mismatches and associated inconsistencies.

Reference:

Learn about KPMG Ignite - an AI/ML based platform built on Red Hat OpenShift provides consistency across the entire lifecycle.

 

In Conclusion

The business challenges that widely pertain to AI/ML are varied and significant, yet we have seen that Kubernetes-based container orchestration platforms such as Red Hat OpenShift can simplify and alleviate many of these problems. I have also highlighted how many organizations are exploring these solutions firsthand and moving towards much greater capitalization of the promise of AI/ML: sustained and enhanced problem solving, customer satisfaction, and, ultimately, business value.

For more on how we can help you, check out openshift.com/learn/topics/ai-ml or feel free to reach out to me directly on Linked In.