Speech Synthesis Markup Language (SSML) reference (Edify Console)

Edify Console > Workflows > Speech Synthesis Markup Language (SSML) reference (Edify Console)

Speech Synthesis Markup Language (SSML) is a markup syntax you can use to customize text to speech (TTS) vocalization across Edify. 

This article provides a brief overview of SSML, explains how to use it, and includes reference documentation for specific tags and elements that make up SSML.

In this article

Overview

Speech Synthesis Markup Language (SSML) is a markup language used to instruct a text to speech (TTS) system how to vocalize text.

The example on the left below is a TTS prompt configured with SSML. The example on the right is how the text would be vocalized.

SSML markup example

This is an example of <say-as interpret-as="characters">SSML</say-as> used in a <say-as interpret-as="characters">TTS</say-as> prompt.

Result

This is an example of S S M L used in a T T S prompt

In the example above, the <say-as> tags surrounding ‘SSML’ and ‘TTS’ instruct the text-to-speech vocalizer to pronounce each letter instead of trying to pronounce them as a word.

In an Edify workflow, you can use several types of modules that use text-to-speech (TTS) technology to vocalize text to a customer. Some of these modules include:

On any of these workflow modules in the prompts, you can use SSML to change how the text-to-speech vocalization interprets and speaks the words.

SSML supports a wide range of SSML markup tags that you can use to customize how Edify vocalizes text. You can use this markup anywhere Edify uses text to speech capabilities.

<break>

The <break> tag is an empty tag that controls pausing or other prosodic boundaries between words. 

Use of this tag is optional. If this element is not present between words, the break is automatically determined based on the linguistic context.

Optional. Specifies the length of the break by seconds or milliseconds.

Example

Step 1, example of a time break with milliseconds.

<break time="200ms"/>

Step 2, example of a time break with seconds.

<break time="2s"/>

Optional. Specifies the strength of the output’s prosodic break by relative terms.

Each strength value indicates a monotonically non-decreasing (conceptually increasing) break strength between tokens.

Strong boundaries are typically accompanied by pauses.

Valid values include:

Example

Step 3, example of prosodic strength

<break strength="weak"/>

<say-as>

The <say-as> tag lets you indicate information about the type of text construct that is contained within the element. 

It also helps to specify the level of detail for rendering the contained text.

Required. Specifies how the text-to-speech synthesizer should pronounce the tagged text.

The following values are valid:

Configures the <say-as> tag to vocalize a number as a cardinal number.

In the example below, the number would be vocalized as “Twelve thousand three hundred forty five” (for US English) or “Twelve thousand three hundred and forty five” (for UK English)

<say-as interpret-as="cardinal">12345</say-as>

Configures the <say-as> tag to spell out the individual letters and numbers (but not symbols) of a word.

In the example below, the word would be spoken as E D I F Y.

<say-as interpret-as="characters">Edify</say-as>

Configures the <say-as> tag to interpret a number as a monetary value. If the language attribute is omitted, it uses the current locale.

In the example below, the word would be spoken as “fifteen dollars and sixty one cents.”

<say-as interpret-as='currency' language='en-US'>$15.61</say-as>

Configures the <say-as> tag to say a date with a specified format. The format attribute is a sequence of date field character codes. Supported field character codes are:

If the field code appears once for year, month, or day, then the number of digits expected are 4, 2, and 2 respectively. If the field code is repeated then the number of expected digits is the number of times the code is repeated.

Fields in the date text may be separated by punctuation and/or spaces.

The detail attribute controls the spoken form of the date. For detail=’1’, only the day fields and one month or year field are required (although both may be supplied). This is the default when less than all three fields are given. The spoken form is “The {ordinal day} of {month}, {year}.”

In the example below, the date would be spoken as “The sixth of May, nineteen seventy seven.”

<say-as interpret-as="date" format="yyyymmdd" detail="1">1977-05-06</say-as>

In the example below, the date would be spoken as “The eleventh of August.”

<say-as interpret-as="date" format="dm">11-8</say-as>

For detail=’2’, the day, month, and year fields are required. This is the default when all three fields are supplied. The spoken form is {month} {ordinal day}, {year}.

In the example below, the date would be spoken as “August eleventh, nineteen seventy seven.”

<say-as interpret-as="date" format="dmy" detail="2">11-8-1977</say-as>

Configures the <say-as> tag to vocalize the tagged text as a bleep as though it has been censored.

In the example below, the string would be vocalized as a bleep.

<say-as interpret-as="expletive">censor this</say-as>

Configures the <say-as> tag to vocalize a string as a fraction.

In the example below, the string would be spoken as “five and a half.”

<say-as interpret-as="fraction">5+1/2</say-as>

Configures the <say-as> tag to vocalize a number as an ordinal number.

In the example below, the number would be spoken as “First.”

<say-as interpret-as="ordinal">1</say-as>

Configures the <say-as> tag to vocalize a number as a telephone number.

The telephone attribute can be accompanied by the optional format helper attribute, which can be used to indicate a country code. You can retrieve a list of valid country codes from the List of ITU-T Recommendation E.164 Assigned Country Codes from the International Telecommunication Union.

See the examples below.

<say-as interpret-as="telephone" format="1">(555) 555-5555</say-as>

Note that the inclusion of a country code in the format does not preclude the use of the country code in the phone number.

<say-as interpret-as="telephone" format="1">1-555-555-5555</say-as>

Configures the <say-as> tag to vocalize a time value.

In the example below, the text would be spoken as “Two thirty P.M.”

<say-as interpret-as="time" format="hms12">2:30pm</say-as>

Configures the <say-as> tag to use singular or plural units depending on the number.

In the example below, the text would be spoken as “10 feet.”

<say-as interpret-as="unit">10 foot</say-as>

Configures the <say-as> tag to spell out the individual letters, numbers, and symbols of a word.

In the example below, the word would be spoken as “E D I F Y dash one”.

<say-as interpret-as="characters">Edify-1</say-as>

<p>, <s>

The <p> and <s> tags indicate paragraph and sentence elements, respectively.

The use of these tags is optional. The text-to-speech processor is capable of determining the structure of plain text that doesn’t include explicit paragraph or sentence tags.

However, these tags are useful when used in conjunction with other tags that might change prosody (like <break>, <emphasis>, or <say-as>). Providing these tags helps to explicitly instruct Edify how to process the elements.

In the example below, the two <s> elements represent two sentences organized in a single paragraph <p>.

<p><s>This is sentence one.</s><s>This is sentence two.</s></p>

<sub>

The <sub> tag is used to indicate that text contained in the alias attribute should replace the tagged text.

You can also use this tag to provide a simplified pronunciation for a difficult-to-read word.

<sub alias="California">CA</sub>

<prosody>

The <prosody> tag modifies the rate, pitch, and volume of the tagged text.

Note: The <prosody> tag should only be used around a full sentence. Enclosing individual words within a sentence may cause unwanted pauses in speech.

Optional. The rate attribute modifies the speaking rate of the contained text. 

The following values are valid:

Optional. The volume attribute modifies the volume of the contained text.

The following values are valid:

Optional. The pitch attribute modifies the pitch of the contained text.

There are three ways to modify pitch:

<emphasis>

The <emphasis> tag provides or removes emphasis from the contained text. This functions similarly to <prosody>, but it doesn’t require any individual speech attributes.

Note: The <emphasis> tag should only be used around a full sentence. Enclosing individual words within a sentence may cause unwanted pauses in speech.

The following values are valid:

In the example below, the sentence would have moderate emphasis applied to it when vocalized.

<emphasis level="moderate">This is an important announcement</emphasis>

<par>

The <par> tag defines parallel media elements to play simultaneously.

This tag creates a parallel media container that allows you to play multiple media elements at once. The only allowed content is a set of one or more <par>, <seq>, and <media> elements. The order of the <media> elements is not significant.

Unless a child element specifies a different begin time, the implicit begin time for the element is the same as that of the <par> container. If a child element has an offset value set for its begin or end attribute, the element's offset will be relative to the beginning time of the <par> container. 

For the root <par> element, the begin attribute is ignored and the beginning time is when SSML speech synthesis process starts generating output for the root <par> element (i.e. effectively time "zero").

In the example below, the <par> element contains two <media> elements. Each has an xml:id attribute that uniquely identifies the element. Each also has a begin attribute that defines when to start speaking the text.

In the second element, the begin time is “question.end+2.0s”, meaning that it should start speaking two seconds after the end of the media element with the id “question.”

   <par>

       <media xml:id="question" begin="0.5s">

           <speak>What is the largest land animal?</speak>

       </media>

       <media xml:id="answer" begin="question.end+2.0s">

           <speak>The largest land animal is the African elephant.</speak>

       </media>

   </par>

<seq>

The <seq> tag defines sequential media that plays one after another.

The only allowed content is a set of one or more <seq>, <par>, and <media> elements. The order of the media elements is the order in which they are rendered.

The begin and end attributes of child elements can be set to offset values (see Time Specification below). Those child elements' offset values will be relative to the end of the previous element in the sequence or, in the case of the first element in the sequence, relative to the beginning of its <seq> container.

<seq>

   <media begin="0.5s">

       <speak>What is the largest land animal?</speak>

   </media>

   <media begin="1.0s">

       <speak>The largest land animal is the African elephant</speak>

   </media>

   <media begin="2.0s">

       <speak>What is the second largest land animal?</speak>

   </media>

   <media begin="3.0s">

       <speak>The Asian elephant.</speak>

   </media>

</seq>

<media>

The <media> tag represents a media layer within a <par> or <seq> element. 

Edify supports <speak> elements within <media> tags.

Optional. A unique XML identifier for this element. Encoded entities are not supported. The allowed identifier values match the regular expression "([-_#]|\p{L}|\p{D})+". 

Optional. The beginning time for this media container. Ignored if this is the root media container element (treated the same as the default of "0"). See the Time specification section below for valid string values.

By default, this value is 0.

Optional. A specification for the ending time for this media container. See the Time specification section below for valid string values.

Optional. A Real Number specifying how many times to insert the media. Fractional repetitions aren't supported, so the value will be rounded to the nearest integer. Zero is not a valid value and is therefore treated as being unspecified and has the default value in that case.

By default, this value is 1.

Optional. A TimeDesignation that is a limit on the duration of the inserted media. If the duration of the media is less than this value, then playback ends at that time.

Optional. Adjusts the sound level of the audio by soundLevel decibels. Maximum range is +/-40dB but actual range may be effectively less, and output quality may not yield good results over the entire range.

By default, this value is +0dB.

Optional. A TimeDesignation over which the media will fade in from silent to the optionally-specified soundLevel. If the duration of the media is less than this value, the fade in will stop at the end of playback and the sound level will not reach the specified sound level.

By default, this value is 0s.

Optional. A TimeDesignation over which the media will fade out from the optionally-specified soundLevel until it is silent. If the duration of the media is less than this value, the sound level is set to a lower value to ensure silence is reached at the end of playback.

By default, this value is 0s.

Time specification

A time specification, used for the value of `begin` and `end` attributes of <media> elements and media containers (<par> and <seq> elements), is either an offset value (for example, +2.5s) or a syncbase value (for example, foo_id.end-250ms).

The first digit string is the whole part of the decimal number and the second digit string is the decimal fractional part. The default sign (i.e. "(+|-)?") is "+". The unit values correspond to hours, minutes, seconds, and milliseconds respectively. The default for the units is "s" (seconds).

The digits and units are interpreted in the same way as an offset value.

<phoneme>

The <phoneme> tag produces custom pronunciations of words inline. Text-to-Speech accepts the IPA and X-SAMPA phonetic alphabets. 

The example below shows how this tag is used for IPA and X-SAMPA:

<phoneme alphabet="ipa" ph="ˌmænɪˈtoʊbə">manitoba</phoneme>

<phoneme alphabet="x-sampa" ph='m@"hA:g@%ni:'>mahogany</phoneme>

<voice>

The <voice> tag allows you to use more than one voice in a single SSML request. 

In the following example, the default voice is an English male voice. All words will be synthesized in this voice except for "qu'est-ce qui t'amène ici", which will be verbalized in French using a female voice instead of the default language (English) and gender (male).

<speak>And then she asked, <voice language="fr-FR" gender="female">qu'est-ce qui

   t'amène ici</voice><break time="250ms"/> in her sweet and gentle voice.</speak>

Alternatively, you can use a <voice> tag to specify an individual voice rather than specifying a language and/or gender. You can find the BCP-47 code for your language with the List Voices route of Edify API.

<speak>The dog is friendly<voice name="fr-CA-Wavenet-B">mais la chat est

   mignon</voice><break time="250ms"/> said a pet shop

   owner

</speak>

When you use the <voice> tag, Text-to-Speech expects to receive either a name (the name of the voice you want to use) or a combination of the following attributes. All three attributes are optional but you must provide at least one if you don't provide a name.

You can also control the relative priority of each of the gender, variant, and language attributes using two additional tags: required and ordering.

Examples of configurations using the required and ordering tags:

<speak>And there it was <voice language="en-GB" gender="male" required="gender"

   ordering="gender language">a flying bird </voice>roaring in the skies for the

   first time.</speak>

  

<speak>Today is supposed to be <voice language="en-GB" gender="female"

   ordering="language gender">Sunday Funday.</voice></speak>

<lang>

You can use the <lang> tag to include text in multiple languages within the same SSML request. All languages will be synthesized in the same voice unless you use the <voice> tag to explicitly change the voice. 

The xml:lang string must contain the target language in BCP-47 format. You can find the BCP-47 code for your language with the List Voices route of Edify API. In the following example "chat" will be verbalized in French instead of the default language (English):

<speak>The french word for cat is <lang xml:lang="fr-FR">chat</lang></speak>

Text-to-Speech supports the <lang> tag on a best effort basis. Not all language combinations produce the same quality results if specified in same SSML request. In some cases, a language combination might produce an effect that is detectable but subtle or perceived as negative. Known issues: