For the first time, Typecast users can transfer emotional style from one individual to a target speaker while preserving the target speaker’s unique voice identity and controlling emotion intensity
Typecast, a startup that operates AI-powered virtual actor service Typecast, which uses AI to create synthetic voices and videos, today announced its latest text-to-speech innovation: Cross-speaker Emotion Transfer. Based on the work of Typecast researchers and building on its unique emotional style control feature, Cross-speaker Emotion Transfer makes it possible for users to apply emotions recorded from another voice to a user’s own unique voice. This approach is detailed in the paper Cross-speaker Emotion Transfer by Manipulating Speech Style Latents, which was accepted by the IEEE International Conference on Acoustics, Speech, and Signal Processing.
The technology will be exclusively available for consumer use through Typecast. Additionally, Typecast has recently introduced the My Voice Maker feature, enabling users to clone their voices using minimal data. The availability of this new technology will be tailored to meet the consumers’ needs. The latest advancements add a new level of depth, richness, and possibilities to AI actors.
“AI actors are the future of content creation, with their capacity to accelerate production timelines, dramatically reduce costs, and expand distribution possibilities. They also empower everyday people to bring a script to life,” said Taesu Kim, co-founder and CEO of Typecast. “But AI actors have yet to fully capture the emotional range of humans, which is their biggest limiting factor. Typecast has unlocked the secret with cross-speaker emotion transfer, and we’ve productized it through the My Voice Maker feature, so that anyone can use AI actors with real emotional depth based on only a small sample of their voice.”
Breaking New Ground
In traditional emotional speech synthesis, which is used to enable AI actors, all training data must have an emotion label. This is challenging because there must be data for every emotion you want to express for every speaker you train, and emotion labels are often mislabeled because the boundaries of emotions are vague.
But to this point, cross-speaker emotion transfer has performed poorly, limiting the range of AI actors. It is often unnatural for emotional speech to be produced by an emotion-neutral speaker instead of the original emotional speaker. Additionally, emotion intensity control is often not possible, further limiting practical uses.
With Typecast’s latest approach to emotional speech synthesis, it solves these problems, achieving cross-speaker emotion transfer, emotion intensity control, and few-shot emotion transfer at the same time. It is now possible to achieve emotion transfer with very high naturalness and emotion similarity, but without changing the speaker’s identity. This makes it easy for a user to record a basic snippet of their voice and apply a full range of emotions and their intensity from another to preserve the unique qualities of a human voice for AI actors.
Transcending Emotional Boundaries
Typecast’s pioneering technology has shattered the limitations of emotional expression. Leveraging the power of artificial intelligence, this cutting-edge system learns from vast amounts of data, eliminating the need for laborious hours of voice recordings. By analyzing diverse sources like audiobooks and other available resources, AI algorithms gain a comprehensive understanding of a wide range of emotional expressions.
This achievement holds immense significance. Previously, capturing the subtleties of different emotions through voice recordings was a time-consuming and arduous task. With the introduction of Cross-Speaker Emotion Transfer, Typecast has ushered in a new era, where emotional boundaries are transcended. This breakthrough empowers individuals who may have lacked the opportunity or resources to record their voices expressing various emotions.
One of the most remarkable aspects of Typecast’s innovation is its ability to preserve the unique identity of the target speaker. Each person possesses a distinct vocal signature that conveys their individuality and personality. With Cross-Speaker Emotion Transfer, individuals can now infuse their voices with diverse emotional styles while retaining their authentic sound. This breakthrough ensures that the transferred emotions do not overshadow or distort the target speaker’s natural voice, creating a seamless and harmonious experience.
Typecast’s utilization of big data to train its AI models exemplifies the incredible potential of machine learning. By harnessing vast quantities of recorded human voices, the system can analyze and understand emotional patterns, tones, and inflections. This enables the AI to accurately emulate and transfer emotions, adapting them to the specific characteristics of the target speaker’s voice. The result is a highly personalized emotional expression that feels natural and genuine.
People can use virtual AI actors for a wide range of uses– from a YouTube short, a company presentation, voiceover for a feature film, or countless other purposes – and they can do so for a fraction of the cost and time associated with using a human actor.
With the new My Voice Maker feature available in Typecast service, users can select various types of emotional speech recorded by someone else and apply that emotional style to their voice while still preserving their own unique voice identity. Even if the user hasn’t recorded different emotional speech types, Typecast can still achieve this transfer. By recording just five minutes of their voice, users can sound happy, sad, angry and much more even if they just recorded their voice in a normal tone.
Similarly, imagine a renowned voice actor who records a single tone of her voice. With this technology, Typecast can then transfer someone else’s emotions to it so that it can then be used to make a script come to life with minimal demand on the actor.