In this blog, we will be describing the text to speech demo that was showcased at Red Hat Summit 2022. Following an introduction of machine learning in text to speech applications and more in depth information on the model development process, we will show how Red Hat OpenShift Data Science (RHODS) helped in the development of this demo.
What is Natural Language Processing?
Currently, we are experiencing a movement towards artificial intelligence (AI) to evaluate and improve current business practices, customer satisfaction, efficiency within their organization, and more. Machine learning (ML) is a type of AI where computers can create models and predictions without explicit instruction. We used natural language processing, an application of ML, to demonstrate how RHODS supports data scientists with model development and deployment.
Natural language processing or NLP is the ability for machines to process and analyze natural language, usually written and spoken language. A user using dictation to write out their text and having their device read back a reply are examples of speech-to-text (STT) and text-to-speech (TTS) respectively. In gaming, AI-powered non-playable characters (NPCs) can engage a wider range and more natural dialog with players. Smart assistants as well, like Apple’s Siri or Amazon’s Alexa, are services that use Natural Language Processing and a speech synthesizer to engage with the user. Accessibility is one of the largest applications for STT/TTS, broadening the spectrum of how users interact with technology. Screen readers are a form of TTS accessibility, which dictates or produces braille output for images and text.
Red Hat OpnShift Data Science Role in Text-to-Speech Development
To develop the TTS demo, we used Coqui TTS as a toolkit library and RHODS to train and deploy the model. RHODS is a managed cloud service that gives data scientists and engineers the infrastructure to create intelligent applications. It creates an easily managed, standard environment for data scientists to build and train their models within Jupyter using packages such as TensorFlow or PyTorch. OpenShift is the underlying application platform, powered by Kubernetes, that allows for portability via containers and scalability via auto scaling pods. This provides a streamlined process for data scientists to create their own custom environments, while enabling IT operations to manage these environments more easily. An example of how OpenShift can facilitate this is through Jupyter. Jupyter is integrated into RHODS and can make it very easy for a data scientist to get access to a Jupyter notebook in a self-service fashion. When deployed on OpenShift, we can take advantage of many of the features of the container platform like resource quotas, isolation, role-based access control, and more. Simply by adopting OpenShift for data scientists, you can eliminate many of the technical hurdles facing data scientists getting started and performing their key functions.
Using RHODS to Create a TTS Model
Machine learning includes a wide range of applications and technologies. Natural Language Processing, a subset of machine learning, includes all of the Artificial Intelligence focused on language, whether that be text, audio, or imagery. In order for a piece of code to comprehend a sentence like a human does, we transform the text into a format computers understand: numbers. There are many useful models out there, such as BERT, that can create “embeddings” from words and sentences. These new number values are then forwarded along to the next machine learning task.
In our case, the encoded speech waves and their corresponding text are passed to a decoder. Our decoder, GlowTTS, is a specialized model architecture that learns how to generate audio waves from text input. You can train such a model on any voice, language, or accent, so long as you have enough data! We chose to use the LJSpeech dataset as it has hours of recordings. Since these are complex models, with gigabytes of data to train on, The training can take hours, even days. Lastly, our final step to synthesize speech is using a vocoder, the same thing that’s used for auto-tune. From the raw, acoustic sound waves provided by the decoder model, out comes an audible human-like voice. Each of these models have been specialized with a specific machine learning architecture to optimize our results. Together, they create a seamless task of transforming written text into spoken language.
Think NLP is interesting and you would like to try it for yourself? Master NLP using Red Hat OpenShift Data Science is now available, along with other RHODS related learning paths. You can also find this demo in our Git Repo. These python notebooks, along with the guide, will take you through the entire data science process. Plus, we demonstrate how to containerize and deploy your application on OpenShift. To watch the Summit Session, please visit this page. In the end, we have our speech synthesis available on a Streamlit website for anyone to use. This same framework could be applied to any intelligent application. Instead of running your model within an IDE, you can add interactive elements and analytics to a web page. At Red Hat, we encourage all data scientists and developers to explore other languages and learn the entire model lifecycle.