Alexa meets AWS Polly

This project demonstrates an integration of AWS Polly into an Alexa skill which translates phrases into different languages. Polly is Amazon's new text-to-speech cloud service and is a perfect fit for Alexa skills aiming for playing back foreign voice.

This project combines the Alexa Skills Kit, AWS Polly and a Translator API to translate common phrases into 17 different languages.

Important note Polly now provides the dynamic range compression SSML-tag and aligned bitrates of audiostreams. This removes the burden of manually converting Polly-mp3 using ffmpeg in order to comply with audio setting and volume requirements of Alexa. That being said, step 7 to 9 aren't necessary anymore.

  1. User speaks to an Alexa device and asks for e.g. "What is "Good Morning" in Polish?"

  2. NLU of Alexa triggers the Translate-intent and passes in a language-slot with value Polish and a term-slot having the value Good Morning. An AWS Lambda function whose code is contained in this Repo implements a Speechlet that handles the request and returns the translation.

  3. Before this skill uses the translation API and TTS service of Polly, it first looks into its own dictionary where all the previous translations are stored. If it finds a record for Good Morning in Polish in the database it will skip the entire round-trip (step 4 to 9) and uses the S3 audio-file referenced in the Dynamo record (learn how it got there in step 10.)

  4. However, if Good Morning in Polish has never been translated before the skill requests Good Morning in Polish from Microsoft Translator API (or interchangeably from Google Translate).

  5. The returned translation is then passed to AWS Polly. Polly responds with an MP3 bitstream with the spoken translation.

  6. The stream is persisted in AWS S3 as an mp3-file.

7.-9. No custom conversion of Polly-mp3 necessary anymore as it's now aligned to Alexa requirements.

  1. Finally, a record is created for Good Morning in Polish in the Dynamo dictionary. Another record that references the new dictionary entry is created for the user so Alexa keeps in mind the last translation. This is how a user can request Alexa to repeat the most recent translation.

  2. The skill creates the output-speech text and squeezes in an audio-SSML tag with the mp3-url.

  3. Output-speech is returned to the Alexa device. Alexa speaks and plays back the translated text with one of Polly's voices. A card is returned to the Alexa app providing the written translation.