Amazon is continually trying to evolve its Alexa digital assistant, particularly in the way it responds to our queries and commands as well as the actual voice that delivers the answers and confirmations. Now through the Neural text-to-speech (TTS) technology, they are able to not only make Alexa’s voice sound more natural, but they are also able to adjust its voice depending on the context and content of your request. This is one of the ways that they are able to be more competitive with the likes of Google Assistant, Bixby, etc.
The Amazon scientists that were working on using Neural TTS also experimented with a new way of speech synthesis called direct waveform modeling. They use deep learning to be able to produce the speech signal which will have better intonation, correct emphasis on specific words in a sentence, and better segmental quality.
They are now introducing the first practical application of this new approach. If you ask your Alexa-supported smart speaker or mobile device “Alexa, what’s the latest?”, it will change its speaking style and become like a professional newscaster. And to make it more realistic, its speech pattern will supposedly correctly emphasize the right words and phrases.
And if you ask, “Alexa, Wikipedia David Beckham”, it will respond in a neutral speaking style voice when it reads back the latest information that it has from the information site. These are just two examples of how direct waveform modeling will be able adjust Alexa’s speaking style to the context and content of your question.
Customers in the US will be the first to experience this Alexa update and when the initial rollout goes well, they will then expand this to other territories as well. We look forward to seeing more practical applications of this approach to neural TTS.