Fwd: [Accessibility] TTS API document + introduction

Peter Korn Peter.Korn at Sun.COM
Wed Mar 8 10:13:59 PST 2006

Hi Bill,
> Janina Sajka wrote On 03/06/06 15:34,:
>> Thanks Olaf--and thanks also to Hynek.
>> Just one item ... We have not set a teleconference for this week
>> Wednesday 8 March. The time slot is certainly available, but I cannot
>> participate as I will be flying back to the U.S. at that time.
> Does this mean we're moving this to March 15?
I think we should.  I believe Willie is at an appointment and may not be 
back in time for our meeting slot today.

I also wonder about whether we should meet on the 15th, or the 22nd.  
The CSUN conference is March 21-25, and at least one of us will be at 
the "What's new in GNOME & Java Accessibility" talk from 10:40 to 
11:40am PT on the 22nd (that'd be me, who is co-presenting that talk).  
And the week before I know some of us will still be busy in conference 

> Bill
>> Olaf Jan Schmidt writes:
>>> Hi!
>>> For those who are not subscribed to accessibility at freedesktop.org I 
>>> am forwarding the latest draft for the joint TTS API that we need 
>>> for reworking kttsd and SpeechDispatcher.
>>> Hynek has written an introduction that summarises our approach. I 
>>> hope it helps our discussion on Wednesday.
>>> Please cc the freedesktop.org list in your comments, because I want 
>>> to make sure that there is at least one place where all the email 
>>> discussion goes.
>>> Olaf
>>> -- 
>>> Olaf Jan Schmidt, KDE Accessibility co-maintainer, open standards 
>>> accessibility networker, Protestant theology student and webmaster 
>>> of http://accessibility.kde.org/ and http://www.amen-online.de/
>> Content-Description: Hynek Hanke <hanke at brailcom.org>: 
>> [Accessibility] TTS API document + introduction
>>> From: Hynek Hanke <hanke at brailcom.org>
>>> To: "Accessibility, Freedesktop" <accessibility at freedesktop.org>
>>> Hello,
>>> here is the latest version of the TTS API document with a new
>>> introduction section trying to summarize the previous private and
>>> public discussions on this topic. Comments are welcomed.
>>> With regards,
>>> Hynek Hanke
>>> Changes
>>> =======
>>> * Introduction was written (clarification of intent, scope)
>>> * Clarification of the meaning of MUST HAVE, SHOULD HAVE
>>> * Point (4.11) was removed as not directly important for accessibility
>>> (after discussions with Willie Walker who requested the point)
>>> * Point (4.13) was removed because its purpose is not clear.
>>>  Even if this functionality is needed, the 's' SSML element is
>>>  not a good way to do it.
>>> * Reformulation of (1.4), added 'temporarily' to (3.2), 'software
>>>  synthesizers' in (4.4), terminology in (4.13),
>>>  clarification in (B.1.3/2) and (B.1.3/3)
>>> Common TTS Driver Interface
>>> ============================
>>> Document version: 2006-03-06
>>> The purpose of this document is to define a common low-level interface
>>> to access the various speech synthesizers on Free Software and Open
>>> Source platforms. It is designed to be used by applications that do
>>> not need the advanced functionality like message management and by
>>> applications providing high-level interfaces (such as Speech
>>> Dispatcher, Gnome Speech, KTTSD etc.)  The purpose of this document is
>>> not to define and force an API on the speech synthesizers. The
>>> synthesizers might use different interfaces that will be handled by
>>> their drivers.
>>> This interface will be implemented by a simple layer integrating
>>> available speech synthesis drivers and in some cases emulating some of
>>> the functionality missing in the synthesizers themselves.
>>> Advanced capabilities not directly related to speech, like message
>>> management, priorities, synchronization etc. are left out of scope for
>>> this low-level interface. They will be dealt with by higher-level
>>> interfaces. (It is desirable to be able to agree on a common
>>> higher-level interface too, but agreeing first on a low-level
>>> interface is an easier task to accomplish.) Such high-level interface
>>> (not necessarily limited to speech) will make good use of the already
>>> existing low-level interface.
>>> It is desirable that simple applications can use this API in a simple
>>> way. However, the API must also be complex enough so that it doesn't
>>> limit more advanced applications in use of the synthesizers.
>>> The first part (A) of this document describes the requirements
>>> gathered between projects like Gnome Speech, Speech Dispatcher, KTTSD,
>>> Emacspeak and SpeakUp of what they might reasonably expect from speech
>>> synthesis on a system. These requirements are not meant to be the
>>> requirements on the synthesizers, although they might be a guide to
>>> synthesizer authors as they plan future features and capabilities for
>>> their products. Parts (B) and (C) describe the XML/SSML markup in use
>>> and part (D) defines the interface.
>>> Temporary note: The goal of this interface is real implementation in
>>> foreseeable future.  The next step will be merging the available
>>> engine drivers in the various accessibility projects under this
>>> interface and using this interface. For this reason, we need all
>>> accessibility projects who want to participate in this common effort
>>> to make sure all their requirements on a low-level speech output
>>> interface are met and that such an interface is defined that it is
>>> suitable for their needs.
>>> Temporary note: Any comments about this draft are welcome and
>>> useful. But since the goal of these requirements is real
>>> implementation, we need to avoid endless discussions and keep the
>>> comments focused and to the point.
>>> A. Requirements
>>>  This section defines a set of requirements on the interface and on
>>>  speech synthesizer drivers that need to support assistive
>>>  technologies on free software platforms.
>>>  1. Design Criteria
>>>    The Common TTS Driver Interface requirements will be developed
>>>    within the following broad design criteria:
>>>    1.1. Focus on supporting assistive technologies first.  These
>>>      assistive technologies can be written in any programming language
>>>      and may provide specific support for particular environments such
>>>      as KDE or GNOME.
>>>    1.2. Simple and specific requirements win out over complex and
>>>      general requirements.
>>>    1.3. Use existing APIs and specs when possible.
>>>    1.4 All language dependent functionality with respect to text
>>>     processing for speech synthesis should be covered in the
>>>     synthesizers or synthesis drivers, not in applications.
>>>    1.5. Requirements will be categorized in the following priority
>>>      order: MUST HAVE, SHOULD HAVE, and NICE TO HAVE.
>>>      The priorities have the following meanings with respect
>>>      to the drivers available under this API:
>>>               MUST HAVE: All drivers must satisfy this requirement.
>>>      SHOULD HAVE: The driver will be usable without this feature, but
>>>        it is expected the feature is implemented in all drivers
>>>        intended for serious use.
>>>      NICE TO HAVE: Optional features.
>>>      Regardless of the priority, full interface will be provided
>>>      by the API, even when the given functionality is actually not
>>>      implemented behind the interface.
>>>    1.6. Requirements outside the scope of this document will be
>>>      labelled as OUTSIDE SCOPE.
>>>    1.7. An application must be able to determine if SHOULD HAVE
>>>      and NICE TO HAVE features are supported for a given driver.
>>>  2. Synthesizer Discovery Requirements
>>>    2.1. MUST HAVE: An application will be able to discover all speech
>>>      synthesizer drivers available to the machine.
>>>    2.2. MUST HAVE: An application will be able to discover all possible
>>>      voices available for a particular speech synthesizer driver.
>>>    2.3. MUST HAVE: An application will be able to determine the
>>>      supported languages, possibly including also a dialect or a
>>>      country, for each voice available for a particular speech
>>>      synthesizer driver.
>>>      Rationale: Knowledge about available voices and languages is
>>>      necessary to select proper driver and to be able to select a
>>>      supported language or different voices in an application.
>>>    2.4. MUST HAVE: Applications may assume their interaction with the
>>>      speech synthesizer driver doesn't affect other operating system
>>>      components in any unexpected way.
>>>    2.5. OUTSIDE SCOPE: Higher level communication interfaces     to 
>>> the speech synthesizer drivers. Exact form of the
>>>        communication protocol (text protocol, IPC etc).
>>>      Note: It is expected they will be implemented by particular
>>>      projects (Gnome Speech, KTTSD, Speech Dispatcher) as wrappers
>>>      around the low-level communication interface defined below.
>>>  3. Synthesizer Configuration Requirements
>>>    3.1. MUST HAVE: An application will be able to specify the default
>>>      voice to use for a particular synthesizer, and will be able to
>>>      change the default voice in between `speak' requests.
>>>    3.2. SHOULD HAVE: An application will be able to specify the default
>>>      prosody and style elements for a voice.  These elements will match
>>>      those defined in the SSML specification, and the synthesizer may
>>>      choose which attributes it wishes to support.  Note that prosody,
>>>      voice and style elements specified in SSML sent as a `speak'
>>> request
>>>      will temporarily override the default values.
>>>    3.3. SHOULD HAVE: An application should be able to provide the
>>>      synthesizer with an application-specific pronunciation lexicon
>>>      addenda.  Note that using `phoneme' element in SSML is another way
>>>      to accomplish this on a very localized basis, and will override
>>>      any pronunciation lexicon data for the synthesizer.
>>>      Rationale: This feature is necessary so that the application is
>>>      able to speak artificial words or words with explicitly modified
>>>      pronunciation (e.g. "the word ... is often mispronounced as ...
>>>      by foreign speakers").
>>>    3.4. MUST HAVE: Applications may assume they have their own local
>>>      copy of a synthesizer and voice.  That is, one application's
>>>      configuration of a synthesizer or voice should not conflict with
>>>      another application's configuration settings.
>>>    3.5. MUST HAVE: Changing the default voice or voice/prosody element
>>>      attributes does not affect a `speak' in progress.
>>>           4. Synthesis Process Requirements
>>>    4.1. MUST HAVE: The speech synthesizer driver is able to process
>>>      plain text (i.e. text that is not marked up via SSML) encoded in
>>>      the UTF-8 character encoding.
>>>    4.2. MUST HAVE: The speech synthesizer driver is able to process
>>>      text formatted using extended SSML markup defined in part B of
>>>      this document and encoded in UTF-8.  The synthesizer may choose
>>>      to ignore markup it cannot handle or even to ignore all markup
>>>      as long as it  is able to process the text inside the markup.
>>>    4.3. SHOULD HAVE: The speech synthesizer driver is able to properly
>>>      process the extended SSML markup defined in the part B. of this
>>>      document as SHOULD HAVE. Analogically for NICE TO HAVE.
>>>    4.4. MUST HAVE: An application must be able to cancel a synthesis
>>>      operation in progress.  In case of hardware synthesizers, or
>>>      synthesizers that produce their own audio, this means cancelling
>>>      the audio output as well.
>>>    4.5. MUST HAVE: The speech synthesizer driver must be able to
>>>      process long input texts in such a way that the audio output
>>>      starts to be available for playing as soon as possible.  An
>>>      application is not required to split long texts into smaller
>>>      pieces.
>>>    4.6. SHOULD HAVE: The speech synthesizer driver should honor the
>>>      Performance Guidelines described below.
>>>    4.7. NICE TO HAVE: It would be nice if a synthesizer were able to
>>>      support "rewind" and "repeat" functionality for an utterance (see
>>>      related descriptions in the MRCP specification).
>>>      Rationale: This allows moving over long texts without the need to
>>>      synthesize the whole text and without loosing context.
>>>    4.8. NICE TO HAVE: It would be nice if a synthesizer were able to
>>>      support multilingual utterances.
>>>    4.9. SHOULD HAVE: A synthesizer should support notification of
>>>      `mark' elements, and the application should be able to align
>>>      these events with the synthesized audio.
>>>    4.10. NICE TO HAVE: It would be nice if a synthesizer supported
>>>      "word started" and "word ended" events and allowed alignment of
>>>      the events similar to that in 4.9.
>>>      Rationale: This is useful to update cursor position as a displayed
>>>      text is spoken.
>>>    4.11. REMOVED (not directly important for accessibility)
>>>      The former version: It would be nice if a synthesizer supported
>>>      timing information at the phoneme level and allowed alignment of
>>>      the events similar to that in 4.9.  Rationale: This is useful
>>>      for talking heads.
>>>    4.12. SHOULD HAVE: The application must be able to pause and resume
>>>      a synthesis operation in progress while still being able to handle
>>>      other synthesis requests in the meantime.  In case of hardware
>>>      synthesizers, this means pausing and if possible resuming the
>>>      audio output as well.
>>>    4.13. REMOVED (not clear purpose, the SSML specs do not require
>>>      the 's' element to work this way)
>>>      The synthesizer should not try to split the
>>>      contents of the `s' SSML element into several independent pieces,
>>>      unless required by a markup inside.
>>>      Rationale: An application may have better information about the
>>>      synthesized text and perform its own splitting of sentences.
>>>    4.14. OUTSIDE SCOPE: Message management (queueing, ordering,
>>>      interleaving, etc.).
>>>    4.15. OUTSIDE SCOPE: Interfacing software synthesis with audio
>>>      output.
>>>    4.16. OUT OF SCOPE: Specifying the audio format to be used by a
>>>     synthesizer.
>>>   5. Performance Guidelines
>>>     In order to make the speech synthesizer driver actually usable with
>>>     assistive technologies, it must satisfy certain performance
>>>     expectations.  The following text provides a clue to the driver
>>>     implementors to get a rough idea about what is needed in practice.
>>>     Typical scenarios when working with a speech enabled text editor:
>>>     5.1. Typed characters are spoken (echoed).
>>>           Reading of the characters and cancelling the synthesis 
>>> must be
>>>       very fast, to catch up with a fast typist or even with
>>>       autorepeat.  Consider a typical autorepeat rate 25 characters per
>>>       second.  Ideally within each of the 40 ms intervals synthesis
>>>       should begin, produce some audio output and stop.  To perform
>>>       all these actions within 100 ms (considering a fast typist and
>>>       some overhead of the application and the audio output) on a
>>>       common hardware is very desirable.
>>>       Appropriate character reading performance may be difficult to
>>>       achieve with contemporary software speech synthesizers, so it may
>>>       be necessary to use techniques like caching of the synthesized
>>>       characters.  Also, it is necessary to ensure there is no initial
>>>       pause ("breathing in") within the synthesized character.
>>>    5.2. Moving over words or lines, each of them is spoken.
>>>      The sound sample needn't be available as quickly as in case of the
>>>      typed characters, but it still should be available without clearly
>>>      noticeable delay.  As the user moves over the words or lines, he
>>>      must hear the text immediately.  Cancelling the synthesis of the
>>>      previous word or line must be instant.
>>>    5.3. Reading a large text file.
>>>      In such a case, it is not necessary to start speaking instantly,
>>>      because reading a large text is not a very frequent operation.
>>>      One second long delay at the start is acceptable, although not
>>>      comfortable.  Cancelling the speech must still be instant.
>>> B. XML (extended SSML) Markup in Use
>>>  This section defines the set of XML markup and special
>>>  attribute values for use in input texts for the drivers.
>>>  The markup consists of two namespaces: 'SSML' (default)
>>>  and 'tts', where 'tts' introduces several new attributes
>>>  to be used with the 'say-as' element and a new element
>>>  'style'.
>>>  If an SSML element is supported, all its mandatory attributes
>>>  by the definition of SSML 1.0 must be supported even if they
>>>  are not explicitly mentioned in this document.
>>>  This section also defines which functions the API
>>>  needs to provide for default prosody, voice and style settings,
>>>  according to (3.2).
>>>  Note: According to available information, SSML is not known
>>>  to suffer from any IP issues.
>>>  B.1. SHOULD HAVE: The following elements are supported
>>>     speak
>>>     voice
>>>     prosody
>>>     say-as
>>>  B.1.1. These SPEAK attributes are supported
>>>     1 (SHOULD HAVE): xml:lang
>>>  B.1.1. These VOICE attributes are supported
>>>     1 (SHOULD HAVE):  xml:lang
>>>     2 (SHOULD HAVE):  name
>>>     3 (NICE TO HAVE): gender
>>>     4 (NICE TO HAVE): age
>>>     5 (NICE TO HAVE): variant
>>>  B.1.2. These PROSODY attributes are supported
>>>     1 (SHOULD HAVE): pitch  (with +/- %, "default")
>>>     2 (SHOULD HAVE): rate   (with +/- %, "default")
>>>     3 (SHOULD HAVE): volume (with +/- %, "default")
>>>     4 (NICE TO HAVE): range  (with +/- %, "default")
>>>     5 (NICE TO HAVE): 'pitch', 'rate', 'range'
>>>              with absolute value parameters
>>>   Note: The corresponding global relative prosody settings
>>>   commands (not markup) in TTS API represent the percentage
>>>   value as a percentage change with respect to the default
>>>   value for the given voice and parameter, not with respect
>>>   to previous settings.
>>>  B.1.3. The SAY-AS attribute 'interpret-as'
>>>     is supported with the following values
>>>     1 (SHOULD HAVE) characters
>>>         The format 'glyphs' is supported.
>>>     Rationale: This provides capability for spelling.
>>>     2 (SHOULD HAVE) tts:char
>>>         Indicates the content of the element is a single
>>>     character and it should be pronounced as a character.
>>>     The element's contents (CDATA) should only contain
>>>     a single character.
>>>     This is different than the interpret-as value "characters"
>>>     described in B.1.3.1. While "characters" is intended
>>>     for spelling words and sentences, "tts:char" means
>>>     pronouncing the given character (which might be subject
>>>     to different settings, as for example using sound icons to
>>>     represent symbols).   
>>>     If more than one character is present as the contents
>>>     of the element, this is considered an error.
>>>     Example:
>>>     <speak>
>>>     <say-as interpret-as="tts:char">@</say-as>
>>>     </speak>       
>>>     Rationale: It is useful to have a separate attribute
>>>     for "single characters" as this can be used in TTS
>>>     configuration to distinguish the situation when
>>>     the user is moving with cursor over characters
>>>        from the situation of spelling. As well as in other
>>>     situations where the concept of "single character"
>>>     has some logical meaning.
>>>     3 (SHOULD HAVE) tts:key
>>>         The content of the element should be interpreted
>>>     as the name of a keyboard key or combination of keys. See
>>>     section (C) for possible string values of content of this
>>>     element. If a string is given which is not defined in section
>>>     (C), the behavior of the synthesizer is undefined.
>>>     Example:
>>>     <speak>
>>>     <say-as interpret-as="tts:char">shift_a</say-as>
>>>     </speak>
>>>     4 (NICE TO HAVE) tts:digits
>>>         Indicates the content of the element is a number.
>>>     The attribute "detail" is supported and can take a numerical
>>>     value, meaning how many digits should the synthesizer group
>>>     for reading. The value of 0 means the number should be
>>>     pronounced as a whole appropriate for the language, while any
>>>     non-zero value means that a groups of so many digits should be
>>>     formed for reading, starting from left.
>>>     Example: The string "5431721838" would normally be read
>>>     as "five billion four hundred thirty seven million ..." but
>>>     when enclosed in the above say-as with detail set to 3, it
>>>     would be read as "five hundred forty three, one hundred
>>>     seventy two etc." or "five, four, three, seven etc." with
>>>     detail 1.
>>>     Note: This is an extension to SSML not defined in the
>>>     format itself, introduced under the namespace 'tts' (as
>>>     allowed    in SSML 'say-as' specifications).
>>>  B.2. NICE TO HAVE: The following elements are supported
>>>     mark
>>>     s
>>>     p
>>>     phoneme
>>>     sub
>>>  B.2.1. NICE TO HAVE: These P attributes are supported:
>>>     1 xml:lang
>>>  B.2.2. NICE TO HAVE: These S attributes are supported     1 xml:lang
>>>  B.3. SHOULD HAVE: An element `tts:style' (not defined in SSML 1.0)
>>>     is supported.
>>>     This element can occur anywhere inside the SSML document.
>>>     It may contain all SSML elements except the element 'speak'
>>>     and it may also contain the element 'tts:style'.
>>>     It has two mandatory attributes 'field'
>>>     and 'mode' and an optional string attribute 'detail'. The
>>>     attribute 'field' can take the following values
>>>         1) punctuation
>>>         2) capital_letters
>>>     defined below.
>>>     If the parameter field is set to 'punctuation',
>>>     the 'mode' attribute can take the following values
>>>         1) none
>>>         2) all
>>>         3) (NICE TO HAVE) some
>>>     When set to 'none', no punctuation characters are explicitly
>>>     indicated. When it is set to 'all', all punctuation characters
>>>     in the text should be indicated by the synthesizer.  When
>>>     set to 'some', the synthesizer will pronounce those
>>>     punctuation characters enumerated in the additional attribute
>>>        'detail' or will only speak those characters according to its
>>>     settings if no 'detail' attribute is specified.
>>>     The attribute detail takes the form of a string containing
>>>     the punctuation characters to read.
>>>     Example:
>>>     <tts:style field="punctuation" mode="some" detail=".?!">
>>>     If the parameters field is set to 'capital_letters',
>>>     the 'mode' attribute can take the following values
>>>         1) no
>>>         2) spelling
>>>         3) (NICE TO HAVE) icon
>>>         4) (NICE TO HAVE) pitch
>>>     When set to 'no', capital letters are not explicitly
>>>     indicated. When set to 'spell', capital letters are
>>>     spelled (e.g. "capital a"). When set to 'icon', a sound
>>>     is inserted before the capital letter, possibly leaving
>>>     the letter/word/sentence intact. When set to 'pitch',
>>>     the capital letter is pronounced with a higher pitch,
>>>     possibly leaving the letter/word/sentence intact.
>>>     Rationale: These are basic capabilities well established
>>>     in accessibility. However, SSML does not support them. 
>>>     Introducing this additional element does not break the
>>>     possibility of outside applications to send valid SSML
>>>     into TTS API.
>>>  B.4. NICE TO HAVE: Support for the rest of elements and attributes
>>>     defined in SSML 1.0. However, this is of lower priority than
>>>     the enumerated subset above.
>>>  Open Issue: In many situations, it will be desirable to
>>>   preserve whitespace characters in the incoming document.
>>>   Should we require the application to use the 'xml:space'
>>>   attribute for the speak element or should we state 'preserve'
>>>   is the default value for 'xml:space' in the root 'speak'
>>>   element in this case?
>>> C. Key names
>>> Key name may contain any character excluding control characters (the
>>> characters in the range 0 to 31 in the ASCII table and other
>>> ``invisible'' characters), spaces, dashes and underscores.
>>>  C.1 The recognized key names are:
>>>   1) Any single UTF-8 character, excluding the exceptions defined
>>>      above.
>>>   2) Any of the symbolic key names defined bellow.
>>>   3) A combination of key names defined bellow using the
>>>     '_' (underscore) character for concatenation.
>>>   Examples of valid key names:
>>>     A
>>>     shift_a
>>>     shift_A
>>>     $
>>>     enter
>>>     shift_kp-enter
>>>     control
>>>     control_alt_delete
>>>  C.2 List of symbolic key names
>>>  C.2.1 Escaped keys
>>>     space
>>>     underscore
>>>     dash
>>>  C.2.2 Auxiliary Keys
>>>     alt
>>>     control
>>>     hyper
>>>     meta
>>>     shift
>>>     super
>>>  C.2.3 Control Character Keys
>>>     backspace
>>>     break
>>>     delete
>>>     down
>>>     end
>>>     enter
>>>     escape
>>>     f1
>>>     f2 ... f24
>>>     home
>>>     insert
>>>     kp-*
>>>     kp-+
>>>     kp--
>>>     kp-.
>>>     kp-/
>>>     kp-0     kp-1 ... kp-9
>>>     kp-2
>>>     kp-enter
>>>     left
>>>     menu
>>>     next
>>>     num-lock
>>>     pause
>>>     print
>>>     prior
>>>     return
>>>     right
>>>     scroll-lock
>>>     space
>>>     tab
>>>     up
>>>     window
>>> D. Interface Description
>>>  This section defines the low-level TTS driver interface for use by
>>>  all assistive technologies on free software platforms.
>>>  1. Speech Synthesis Driver Discovery
>>>    ...
>>>  2. Speech Synthesis Driver Interface
>>>  ...
>>>  Open Issue: Still not clear consensus on how to return the
>>>     synthesized audio data (if at all).  The main issue here is
>>>     mostly with how to align marker and other time-related events
>>>     with the audio  being played on the audio output device.
>>>  Proposal: There will be 2 possible ways to do it. The synthesized
>>>     data can be returned to the application (case A) or the
>>>     application can ask for them being played on the audio (which
>>>     will not be the task of TTS API, but will be handled by
>>>     another API) (case B).
>>>     In (case A), each time the application gets a piece of audio
>>>     data, it also gets a time-table of index marks and events
>>>     in that piece of data. This will be done on a separate socket
>>>     in asynchronous mode. (This is possible for software
>>>     synthesizers only, however.)
>>>     In (case B), the application will get asynchronous callbacks
>>>     (they might be realized by sending a defined string over
>>>     a socket, by calling a callback function or in some other
>>>     way -- the particular way of doing it is considered an
>>>     implementation detail).
>>>     Rationale: Both approaches are useful in different situations
>>>     and each of them provides some capability that the other one
>>>     doesn't.
>>>  Open Issue: Will the interaction with the driver be synchronous
>>>     or asynchronous?  For example, will a call to `speak'
>>>     wait to return until all the audio has been processed?  If
>>>     not, what happens when a call to "speak" is made while the
>>>     synthesizer is still processing a prior call to "speak?"
>>>  Proposal: With the exception of events and index marks signalling,
>>>     the communication will be synchronous. When a speak request
>>>     is issued while the is still processing a prior call to speak
>>>     and the application didn't call pause before, this is
>>>     considered an error.
>>> E. Related Specifications
>>>    SSML: http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/
>>>          (see requirements at the following URL:
>>> http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/#ref-reqs)
>>>    SSML 'say-as' element attribute values:
>>>       http://www.w3.org/TR/2005/NOTE-ssml-sayas-20050526/
>>>    MRCP: http://www.ietf.org/html.charters/speechsc-charter.html
>>> F. Copying This Document
>>>  Copyright (C) 2006 ...
>>>  This specification is made available under a BSD-style license ...
