Red Hat blog

Business Centric AI/ML With Kubernetes - Part 2: Data Preparation

February 11, 2021Tom Corcoran

Introduction

You may have heard the phrase Data is the new Oil. Such is the value being attributed to data particularly with the emergence of Artificial Intelligence and Machine Learning (AI/ML).

In this blog, the second in a series exploring challenges and solutions in unlocking AI/ML business value, we address data management and how an enterprise Kubernetes platform such as Red Hat OpenShift can streamline data retrieval, preparation, as well as delivery to Data Scientists for AI model creation.

Why Is Data Now So important?

So what’s driving the increased importance of data in AI? AI has existed for more than 50 years; hasn’t it always needed data?

To answer this, there are some forces to consider that are driving not only the demand for data but also the demand for preparation and cleansing of that data:

Big Data
Over the past decade, there has been an explosion in the volume of data produced. This includes data from newer sources such as webcams, smartphones, sensors, Enterprise IT systems on the cloud and at the edge, as well as more traditional sources such as the data center. McKinsey has recently reported that we're generating 2 exabytes of data daily. That’s 100 billion Gigabytes. IDC estimates that 90% of the data in the world has been generated in the past 2 years.
Most ML Algorithms Require Structured and Often Numeric Data
ML is the subset of AI that’s most commonly used these days. ML requires large amounts of organized data to train algorithms. The raw unstructured data comprising Big Data, typically stored in repositories known as Data Lakes, is generally unusable in ML. Furthermore, ML Algorithms often require input data to be numerical. So even when the data is tabulated into the correct dimension and shape, some of it will likely require conversion to a pure numerical representation.
These demands also create the need for skills and tools to perform Extract-Transform-Load (ETL) tasks on data to prepare and cleanse it for consumption by ML.
Deep Learning

There is a newer form of ML, known as Deep Learning using Deep Neural Networks. Neural Networks are so-called because they mirror the structure of human brains, consisting of many connected neurons whose output leads to the input to other neurons and eventually into the output of the model, yielding a value, a classification, prediction, or other finding.
Deep Neural Networks are very powerful and can perform feature engineering or the extraction of distinguishing and meaningful information from the data set ( for example, what is it in this data that distinguishes a fraudulent transaction from a valid one?) or precisely which features of this image lead to the conclusion it’s a dog, not a cat?

This is tremendously powerful and is a strong factor in the increasing democratisation of ML, as the requirement for expert human domain knowledge is reduced, as the model takes on more of this role.

However, Deep Learning tends to require vast inputs of data at the training phase.

Neural Networks mimic the layered interconnected structure of neurons in the human brain. They often require vast amounts of data to train.

So the ever-increasing ubiquity of data, the preparation required, as well as the huge data volume requirements of modern Deep Learning algorithms have resulted in”

a large and growing demand for data preparation and management
a massive value placed on that data for those who can capitalize on it.

Challenges to an Effective Data Preparation and Management Strategy

Many organizations struggle in the data preparation and management areas, however. Andrew Ng, chief scientist at Baidu Research, Stanford lecturer, and AI thought leader, cites the unavailability of meaningful and structured data as one of the two big challenges obstructing widespread AI/ML adoption (The other related factor, he argues, is the shortage of talent).

The following are some of the challenges to producing high-quality data that are consumable by ML algorithms:

IT Bottlenecks
Requests for data management tools for Data Engineers, those professionals who are responsible for the ETL tasks, can take weeks or months.
Contention
Competition for scarce data management tools, such as Apache Spark servers, Kafka messaging and data streaming clusters, and centralized or restricted databases.
Inconsistencies
Inconsistent versions of these tools across the organisation, leading to difficult-to- identify bugs and instabilities
Challenges in Collaboration and Sharing of Data
Lack of a smooth and effective handover mechanism for sharing with data scientists who consume the prepared data.
Wasted Talent.
The above factors can lead to inefficient utilization of data professionals’ time and energy, frustrations and lost opportunities arising from slower work output.

How Kubernetes and OpenShift Can Address Many of These Challenges

Using an enterprise-grade Kubernetes container platform, such as Red Hat OpenShift is one of the most effective moves organisations can take to address these challenges. OpenShift can not only help address data management challenges, but it can tackle issues faced at all stages of the AI/ML workflow. McKinsey, in their State of AI report 2020, found that companies achieving their AI/ML business objectives are much more likely to have established a standardized end-to-end platform for AI-related data science, data engineering, and application development.

Here are some of the ways an enterprise Kubernetes system such as OpenShift can solve data related problems:

IT bottlenecks
OpenShift provides a self-service catalog of certified tools for Data Engineers, Data Scientists and Application Developers, including those available from cutting-edge open source projects such as Kubeflow and the Open Data Hub (such as Apache Kafka, Apache Spark, and myriad other databases and ETL tools). As they are available on demand, without any IT involvement, they provide a massive boost in the agility and productivity of Data professionals.
Contention
Contention for scarce resources and tools is all but eliminated. They are provisioned on demand, and the OpenShift scheduler allocates resources for the tool, for example, a dedicated Spark cluster. This again provides a powerful boost to productivity.
Inconsistencies
The OpenShift administrator can centrally mandate enterprise-wide versions of tools, thereby eliminating problems occurring due to version mismatches.
Challenges in Collaboration and Sharing of Data
OpenShift provides a range of essential functional add-ons to raw Kubernetes, many of which facilitate a workflow-driven approach supporting a handover of assets between personas. Examples include shared object storage, and Continuous Integration and Continuous Delivery (CICD) tools (Ceph, Argo CD, Tekton, Jenkins, and many others from OpenShift’s large ecosystem of partners).
Wasted Talent.
The agility and efficiency facilitated by the above can lead to more productive, energized and fulfilled data professionals in turn leading to an improvement in talent retention.

In summary, there are many ways organisations can expect to realize business value utilizing data management on OpenShift including:

Greater Productivity
Data engineers as their tooling requirements are fulfilled through self service with no lengthy and expensive wait times for IT to provision hardware and software.
More Efficient Hardware Utilization and Lower Costs
These benefits are enabled through OpenShift’s on-demand job scheduling and fast return of resources to the central resource pool on completion of jobs.
A faster Data Pipeline fFlow
This results in quicker realization of value from AI/ML models and intelligent applications in production.
Less Downtime
Downtime while investigating difficult-to-solve issues arising from tool version mismatches is reduced.

Follow the links below to find out how Royal Bank of Canada and HCA Healthcare used Red Hat OpenShift as an innovative Kubernetes data platform for AI/ML.

Conclusion

The proliferation of data coinciding with more mainstream adoption of Machine Learning and Deep Learning has elevated the importance of effective data management strategies. Data now represents a strong competitive opportunity and advantage for those who get it right; equally, it can be a competitive threat for those who don’t.

McKinsey in The state of AI in 2020 places effective data strategies amongst the key differentiators for those who are profiting most from AI/ML.

Kubernetes as a platform provides enormous potential to alleviate challenges around scheduling, scaling, and movement of data workloads. That potential is realized through OpenShift with its catalogs self-service data management and preparation tools. This enables organizations to achieve and indeed commodify high quality data pipeline production for downstream consumption by data scientists.

About the author

Tom Corcoran

Solution Architect at Red Hat

Read full bio

Platform products

Try & buy

Featured cloud services

By category

By organization type

By customer

Featured

Topics

Articles

More to explore

For customers

For partners

About us

Open source

Company details

Communities

Recommendations

Select a language

Select a language

Business Centric AI/ML With Kubernetes - Part 2: Data Preparation

Introduction

Why Is Data Now So important?

Challenges to an Effective Data Preparation and Management Strategy

How Kubernetes and OpenShift Can Address Many of These Challenges

Conclusion

References

Forbes - Is data more important than algorithms?

About the author

Tom Corcoran

Products

Tools

Try, buy, & sell

Communicate

About Red Hat

Select a language

Red Hat legal and privacy links

Red Hat legal and privacy links