"Voice Browser" Activity — Voice enabling the Web!

NEWS

Answers for some frequently asked questions can be found at the end of this page.

Introduction

W3C is working to expand access to the Web to allow people to interact via key pads, spoken commands, listening to prerecorded speech, synthetic speech and music. This will allow any telephone to be used to access appropriately designed Web-based services, and will be a boon to people with visual impairments or needing Web access while keeping their hands and eyes free for other things. It will also allow effective interaction with display-based Web content in the cases where the mouse and keyboard may be missing or inconvenient.

To fulfill this goal, the W3C Voice Browser Working Group (Members only) is defining a suite of markup languages covering dialog, speech synthesis, speech recognition, call control and other aspects of interactive voice response applications. Specifications such as the Speech Synthesis Markup Language, Speech Recognition Grammar Specification, and Call Control XML are core technologies for describing speech synthesis, recognition grammars, and call control constructs respectively. VoiceXML is a dialog markup language that leverages the other specifications for creating dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key (touch tone) input, recording of spoken input, telephony, and mixed initiative conversations.

These specifications bring the advantages of web-based development and content delivery to interactive voice response applications. Further work is anticipated on enabling their use with other W3C markup languages such as XHTML, XForms and SMIL. This will be done in conjunction with other W3C Working Groups, including the Multimodal Interaction Activity.

Some possible applications include:

We have set up a public mailing list for discussion of voice browsers and our work in this area. To subscribe send an email to www-voice-request@w3.org with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). The archive for the list is accessible online. Note: to post a message to the list, you first need to subscribe. This is an anti-spam measure.

This page will give you an introduction to each of the areas the Working Group is addressing and plans for future work.

Current Situation

W3C's work on voice browsers originally started in the context of making the Web accessible to more of us, more of the time. In October 1998, W3C organized a workshop on "Voice Browsers". The workshop brought together people involved in developing voice browsers for accessing Web based services. The workshop concluded that the time was ripe for W3C to bring together interested parties to collaborate on the development of joint specifications for voice browsers. As a response, an activity proposal and charter were written to establish a W3C "Voice Browser" Activity and Working Group (members only).

Following review by W3C members, this activity was initially established on 26 March 1999, and was subsequently rechartered on 25 September 2002 as a royalty free working group under the terms of the W3C's Current Patent Practice note. The W3C staff contact, and activity lead is Dave Raggett (W3C/Openwave). The Working Group is chaired by by Jim Larson (Intel) and Scott McGlashan (PipeBeach).

Work under development

This is intended to give you a brief summary of each of the major work items under development by the Voice Browser working group. The suite of specifications is known as the W3C Speech Interface Framework.

The top priority work items cover dialog (VoiceXML 2.0), speech recognition grammar (SRGS), speech synthesis (SSML), semantic interpretation, and call control (CCXML).

Lower priority work items are currently inactive. Work on these is likely to be resumed as higher priority items are completed. The lower priority items cover: pronunciation lexicon, stochastic grammars (N-Grams) and voice browser interoperation.

In late 2002 we intend to begin collecting requirements for the next version of the dialog markup language. We will also work closely with the Multimodal Interaction activity to provide support for speech within multimodal applications.

VoiceXML 2.0

VoiceXML 2.0 is designed based upon extensive industry experience for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. For an introduction, here is a tutorial. Further tutorials and other resources can be found on the VoiceXML Forum website. W3C and VoiceXML Forum have signed a memorandum of understanding setting out mutual goals.

Speech Recognition Grammar (SRGS)

The speech recognition grammar specification covers both speech and DTMF (touch tone) input. DTMF is valuable in noisy conditions or when the social context makes it awkward to speak. Grammars can be specified in either an XML or an equivalent augmented BNF (ABNF) syntax, which some authors may find easier to deal with. Speech recognition is an inherently uncertain process. Some speech engines may be able to ignore "um's" and "aah's", and to perform partial matches. Recognizers may report confidence values. If the utterance has several possible parses, the recognizer may be able to report the most likely alternatives (n-best results).

Speech Synthesis (SSML)

The Speech Synthesis specification defines a markup language for prompting users via a combination of prerecorded speech, synthetic speech and music. You can select voice characteristics (name, gender and age) and the speed, volume, pitch, and emphasis. There is also provision for overriding the synthesis engine's default pronunciation.

The Voice Browser working group is collaborating with the CSS working group to develop a CSS3 module for speech synthesis based upon SSML for use in rendering XML documents to speech. This is intended to replace the aural cascading style sheet properties in CSS2. The first working draft is expected late 2002.

Semantic Interpretation

The semantic interpretation specification describes annotations to grammar rules for extracting the semantic results from recognition. The annotations are expressed in a syntax based upon a subset of ECMAScript, and when evaluated, yield a result represented either as XML or as a value that can be held in an ECMAScript variable. The target for the XML output is EMMA (Extensible Multimodal Annotation Markup Language) which is being developed in the Multimodal Interaction activity.

Call Control (CCXML)

W3C is working on markup to enable fine-grained control of speech (signal processing) resources and telephony resources in a VoiceXML telephony platform. The scope of these language features is for controlling resources in a platform on the network edge, not for building network-based call processing applications in a telephone switching system, or for controlling an entire telecom network. These components are designed to integrate naturally with existing language elements for defining applications which run in a voice browser framework. This will enable application developers to use markup to perform call screening, whisper call waiting, call transfer, and more. Users can be offered the ability to place outbound calls, conditionally answer calls, and to initiate or receive outbound communications such as another call.

Future work on dialog markup

Work is expected to start in late 2002 on collecting requirements for the next version of the dialog markup language.

Lower Priority Work Items

Pronunciation Lexicon describing phonetic information for use in speech recognition and synthesis. The requirements were published on 12 March 2001.

Stochastic Grammars (N-Grams) are used for open ended prompts (how can I help?) where context free grammars would be unwieldy. N-Gram models cover the likelihood that one word will occur after certain other words. Such models are widely used for dictation systems, but can also be combined with word spotting rules that determine how to route a help desk call etc. The current draft N-Gram spec was published on on 3 January 2001.

Voice Browser Interoperation describes the means to convey the context when transferring the user from one voice browser to another. In a related scenario, the user could start with a visual interaction on a cell phone and follow a link to switch to a VoiceXML application. The ability to transfer a session identifier makes it possible for the Voice Browser application to pick up user preferences and other data entered into the visual application. Finally, the user could transfer from a VoiceXML application to a customer service agent. The agent needs the ability to use their console to view information about the customer, as collected during the preceding VoiceXML application. The ability to transfer a session identifier can be used to retrieve this information from the customer database. The requirements were published on 8 August 2002.

Frequently asked questions

Far more people today have access to a telephone than have access to a computer with an Internet connection. In addition, sales of cellphones are booming, so that many of us have already or soon will have a phone within reach wherever we go. Voice Browsers offer the promise of allowing everyone to access Web based services from any phone, making it practical to access the Web any time and any where, whether at home, on the move, or at work.

It is common for companies to offer services over the phone via menus traversed using the phone's keypad. Voice Browsers offer a great fit for the next generation of call centers, which will become Voice Web portals to the company's services and related websites, whether accessed via the telephone network or via the Internet. Users will able to choose whether to respond by a key press or a spoken command. Voice interaction holds the promise of naturalistic dialogs with Web-based services.

Voice browsers allow people to access the Web using speech synthesis, pre-recorded audio, and speech recognition. This can be supplemented by keypads and small displays. Voice may also be offered as an adjunct to conventional desktop browsers with high resolution graphical displays, providing an accessible alternative to using the keyboard or screen, for instance in automobiles where hands/eyes free operation is essential. Voice interaction can escape the physical limitations on keypads and displays as mobile devices become ever smaller.

Hitherto, speech recognition and spoken language technologies have had for the most part to be handcrafted into applications. The Web offers the potential to vastly expand the opportunities for voice-based applications. The Web page provides the means to scope the dialog with the user, limiting interaction to navigating the page, traversing links and filling in forms. In some cases, this may involve the transformation of Web content into formats better suited to the needs of voice browsing. In others, it may prove effective to author content directly for voice browsers.

Information supplied by authors can increase the robustness of speech recognition and the quality of speech synthesis. Text to speech can be combined with pre-recorded audio material in an analogous manner to the use of images in visual media, drawing upon experience with radio broadcasting. The lessons learned in designing for accessibility can be applied to the broader voice browsing marketplace, making it practical to deliver services to a wide range of platforms.

Q1. Why not just use HTML instead of inventing a new language for voice-enabled web applications?

A1. HTML was designed as a visual language with emphasis on visual layout and appearance. Voice interfaces are much more dialog oriented, with emphasis on verbal presentation and response. Rather than bloating HTML with additional features and elements, new markup languages were especially designed for speech dialogs.

Q2. How does the W3C Voice Browser Working Group relate to the VoiceXML Forum?

A2. The VoiceXML Forum developed the dialog language VoiceXML 1.0, which it submitted to the W3C Voice Browser Working Group. The Voice Browser working group used those specifications as a model for VoiceXML 2.0. In addition, the Voice Browser Working Group has augmented the VoiceXML 2.0 with Speech Recognition Grammar Markup Language and the Speech Synthesis Markup Language. The VoiceXML Forum provides educational, marketing, and conformance testing services. The two groups have a good working relationship, and work closely together to enhance the ability of developers to create web-based voice applications. Both organizations have signed a memorandum of understanding setting out the goals of both parties.

Q3. What are the differences between VoiceXML, VXML, VoXML, and all the other voice mark up languages?

A3. Historically, different speech companies created their own voice markup languages with different names. As companies integrated languages together, new names were given to the integrated languages. The IBM original language was SpeechML. AT&T and Lucent both had a language called PML (Phone Markup Language, but each had different syntax. Motorola's original language was VoxML. IBM, AT&T, Lucent, and Motorola formed the VoiceXML Forum and created VoiceXML (briefly known as VXML). HP's Corporate Research Labs created TalkML. The World Wide Web Consortium's Voice Browser Working Group has specified VoiceXML 2.0, based upon extensive industry experience with VoiceXML 1.0.

Q4. What is the relationship between VoiceXML 2.0 and SALT?

A4. VoiceXML and SALT are both markup languages. VoiceXML incorporates speech interface, data and control flow. SALT focuses on the speech interface. VoiceXML is a standalone language while SALT must be embedded into other languages, such as XHTML. VoiceXML provides a form filling metaphor as the basis for spoken dialogs. SALT enables application developers to write their own dialog flow using lower level primitives.

Q5. Will W3C be pursuing SALT?

A5.The SALT 1.0 specfication was contributed to the Voice Browser and Multimodal Interaction working groups on 31rd July 2002. Which ideas are taken up will depend on the outcome of the consensus process used by the Working Groups. W3C Members can view the contribution letter.

Q6. Will WAP and VoiceXML ever be integrated into a single language for specifying a combined verbal/visual interface?

A6. Different standards bodies defined the Wireless Markup Language (WML) and VoiceXML 2.0. A joint W3C/WAP workshop was held in September 2000 to address this question. Some difficult problems to integration were identified, including differences in architecture (WAP is a client-based browser, VoiceXML is a server-based browser), as well as differences in language philosophy and style. The workshop adopted the "Hong Kong Manifesto" which basically states that a new W3C working group should be created to address this area and coordinate activities to develop multimodal specifications supporting both visual and verbal user interfaces. The "Hong Kong Manifesto" was subsequently approved by the W3C Voice Browser Working Group. W3C has now started a Multimodal Interaction Activity to work in this area.

Q7. What is the difference between VoiceXML 2.0 and SMIL?

A7. Synchronized Multimedia Integration Language (SMIL, pronounced "smile") is a presentation language that coordinates the presentation of multiple visual and audio outputs to the user. VoiceXML 2.0 coordinates input from the user and output to the user. Eventually the presentation capabilities of SMIL should be integrated with the output capabilities of VoiceXML 2.0.

Q8. Where can I find specifications of the W3C Voice Browser Activity and how do I provide feedback to the W3C Voice Browser Working Group?

A8. The page to look at is http://www.w3.org/Voice/. You can find links to all of the published drafts and additional background material. Comments and feedback may be e-mailed to www-voice@w3.org, but you have to subscribe first (an anti-spam measure).

Q9. What speech applications cannot currently be supported by the W3C Speech Interface Framework?

A9. While the W3C Speech Interface Framework and its associated languages support a wide range of speech applications in which the user and computer speak with each other, there are several specialized classes of applications requiring greater control of the speech synthesizer and speech recognizer than supported in the current languages. The Speech Grammar Markup Language does not currently support the fine granularity necessary for detecting speech disfluencies in disabled speakers or foreign language speakers that may be required for "learn to speak" applications. There are currently no mechanisms to synchronize a talking head with synthesized speech. The Speech Synthesis Markup Language is not able to specify melodies for applications in which the computer sings. We consider the Natural Language Semantics a first step towards specifying semantics of dialogs. Because there are no context or dialog history databases defined, extra mechanisms must be supplied to do advanced natural language processing. Speaker identification and verification and advanced telephony commands are not yet supported in the W3C Speech Interface Framework. Developers are encouraged to define objects that support these features.

Q10. When developing an application, what functions and features belong in the application and what functions and features belong in the browser?

A10. A typical browser implements a specific set of features. We discourage developers from reimplementing these features within the application. New features should be implemented in the application. If and when several applications implement a new feature, the Working Group will consider placing the features in a markup language specification, and encouraging updates to browsers that incorporate the new feature. We discourage developers from creating downloadable browser enhancements because some browsers may not be able to accept downloads, especially browsers embedded into small devices and appliances.

Q11. What is the relationship between the VoiceXML 2.0 and programming languages such as Java and C++?

A11. Objects may be implemented using any programming language.

Q12. How has the voice browser group addressed accessibility?

A12. The voice browser group's work on speech synthesis markup language brings the same level of richness to synthesized aural presentations that users have come to expect with visual presentations driven by HTML. In this respect, our work picks up from the prior W3C work on Aural CSS. Next, our work on making speech interfaces pervasive on the WWW has an enormous accessibility benefit; speech interaction enables information access to a significant percentage of the population that is currently disenfranchised.

As the voice browser group, our focus has naturally been on auditory interfaces, and hence all of our work has a positive impact on the user group facing the most access challenges on the visual WWW today --namely blind and low vision users. At the same time we are keenly aware of the fact that the move to information access via the auditory channel raises access challenges for users with hearing or speaking impairments. For a hearing-impaired user, synthesized text should be displayed visually. For a speaking-impaired user, verbal responses may instead be entered via a keyboard.

Finally, we realize that every individual is unique in terms of his or her abilities; this is likely to become key as we move towards multimodal interfaces which will need to adjust themselves to the users current environment and functional abilities. Work on multimodal browsing will address this in the context of user and device profiles.

Q13. Are their IP issues associated with VoiceXML 2.0?

A13.  Yes. Some members of the Voice Browser Working Group may have IP claims.  Every member of the Voice Browser Working Group is required to make a disclosure statement regarding its IP claims relevant to essential technology for Voice Browsers.  You can review these statements at http://www.w3.org/2001/09/voice-disclosures.html

Q14. How will patent policy issues effect future work on VoiceXML?

A14. The Voice Browser working group has been rechartered as a royalty free group under the terms of W3C's Current Patent Practice note, with the goal of producing a W3C Recommendation for VoiceXML and related specifications that can be implemented and distributed without the need to pay any royalties.

Note: W3C does not take a position regarding the validity or scope of any intellectual property right or other rights that might be claimed to pertain to the implementation or use of the technology, nor the extent to which any license under such rights might or might not be available. Copyright of WG deliverables is vested in the W3C.

Q15. How can I found out about VoiceXML related talks, conferences and seminars?

A15. See the Speech Technology magazine website, the VoiceXML Forum website, and Ken Rehor's list of VoiceXML events.

Q16. Who has implemented VoiceXML interpreters?

A16. Several venders have implemented VoiceXML 1.0 and are extending their implementations to conform with the markup languages in the W3C Speech Interface Framework. To be listed here, the implementation must be working and available for use by developers. Vendors are listed in alphabetical order:

The BeVocal Café is a web-based VoiceXML development environment providing a carrier-grade VoiceXML 2.0 (and 1.0) interpreter, and the tools necessary to debug, and test usability of applications. The BeVocal VoiceXML Interpreter features support of the latest W3C Working Draft including many enhancements such as Speaker Verification, Voice Enrollment, XML data, pre-tuned grammars and professional audio.

Conita have implemented a VoiceXML interpreter based upon Open VXI (see below.)

General Magic, http://www.generalmagic.com, has also implemented a version of VoiceXML 1.0.

HeyAnita's implementation of VoiceXML 1.0 is now available for use by developers, and offers full interactive debugging support. For more detail see HeyAnita's FreeSpeech Developer Network.

IBM Voice Server SDK Beta Program is based on VoiceXML Version 1.0. The IBM WebSphere Voice Server SDK is available for free. IBM has also provided a tutorial for learning VoiceXML.

Motorola has the Mobile Application Development Toolkit (MADK), a freely downloadable software development kit that supports VoiceXML 1.0 (as well as WML and VoxML). See http://www.motorola.com/MIMS/ISG/spin/mix/.

Nuance offers graphical VoiceXML development tools, a Voice Site Staging Center for rapid prototyping and testing, and a VoiceXML-based voice browser to developers at no cost. See the Nuance Developer Network at http://extranet.nuance.com/developer/ to get started.

OpenVXI is a portable open source VoiceXML interpreter developed by SpeechWorks. It may be used free of charge in commercial applications and allows the addition of proprietary modifications if desired. OpenVXI closely follows the VoiceXML 2.0 draft specification and can be downloaded from the Carnegie Mellon University speech software site.

PIPEBEACH offers speechWeb, a carrier-class VoiceXML platform, and a speechWeb Application Partner Program including developer's site, tutorials and access to speechWeb systems for application verification. For more information visit http://www.pipebeach.com.

PublicVoiceXML is an open source implementation of VoiceXML, and also the name of a project aimed at European community radio stations.

SpeechWorks offers a modular toolkit that simplifies adding VoiceXML 2.0 support to any platform. It includes a VoiceXML interpreter, speech recognition and TTS engines, data fetching and caching management, telephony hardware integration, and other components plus source code to accelerate development. For more information, please take a look at the OpenSpeech Browser Platform Integration Kit.

Telera offers DeVXchange, a Web-based community dedicated to making your development efforts a success. Based on VoiceXML Standards, you'll find the entire spectrum of tools and resources to take your phone-based business applications from concept to deployment. Visit DeVXchange today at http://www.telera.com/devxchange.html.

Tellme Studio allows anyone to develop their own voice applications and access them over the phone just by providing a URL to your content. Visit http://studio.tellme.com to begin. The Tellme Networks voice service is built entirely with VoiceXML. Call 1-800-555-TELL to try this service.

VoiceGenie has sponsored a developer challenge in association with VoiceXMLCentral, a VoiceXML virtual community and search engine. For details on Voice Genie's developer support, see: http://developer.voicegenie.com.

VoxPilot offers a free VoiceXML online development environment, with a multilingual implementation of VoiceXML including local access numbers in all major European markets, provided over a robust VoIP infrastructure.

W3C Staff Contact

Dave Raggett <dsr@w3.org>, W3C, (on assignment from Openwave)

Copyright © 1995-2002 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements. This page was last updated on $Date: 2002/10/09 09:41:35 $