|
|
| Speech Technologies White Paper |
| Backgrounder |
Tim Berners-Lee, the inventor of the World Wide Web, said:
“Speech technologies bridge the gap between computer language and human language; it helps computers to figure out what people are thinking,
and people to figure out what computers are thinking.”
The goal of natural language interaction is to communicate concepts, not words: "It's not how you say it, it's what you mean."
Building machines that communicate using human languages has proven tricky; now decades of research and development are finally paying off,
delivering a suite of tools and products that enable the construction of effective broad-based applications. The constituent technology categories are described below:
Speech Processing
Speech processing empowers computers to recognize - and, to some extent, understand - spoken language. Speech is "eyes free" and "hands free",
allowing a device to be truly used anywhere. This technology has engendered two types of software products: continuous-speech recognition and command and control.
For a broad-use application, context-free grammars are the most reliable. Because a context-free grammar allows a speech recognition engine
to reduce the number of recognized words to a predefined list, high levels of recognition can be achieved in a speaker-independent environment.
Context-free grammars work great with no voice training, cheap microphones, and average CPUs.
Although speech recognition technologies are not new, accuracy rates are just now becoming acceptable for natural language discourse. According to Microsoft,
speech recognition accuracy is improving 10% per year and, currently (2004), the error rate sits at about 8%. Microsoft expects that to drop to 6% in 2005
and error rates similar to human error rates should be achieved by 2011.
Speech Synthesis
The ability to synthesize the sound of speech is useful for applications that require spontaneous interaction, or in situations where
reading isn't practical (giving instructions to a driver, for example). In products aimed at the general public, it's critical that
the output sound pleasant and human enough to encourage regular use.
Microsoft is now a major player in the speech technology marketplace. A 10-year billion-dollar effort, resulting in a very compelling suite of services and tools
based on open standards, brings inexpensive and effective conversational access to information applications and accelerates the acceptance of speech as a user interface
alternative for Web and mobile applications. It's impact will be most significant in attacking problems in the small and medium enterprise (SME) market.
At SpeechTek 2004, their new Microsoft Speech Server product received the
"Best Product Making the Most Impact" award. It supports eight languages (they plan full international
language support in the future) and is based on the open-standard Speech Application Language Tags (SALT) specification, which extends familiar mark-up languages
and leverages the existing Web development paradigm.
Natural Language Processing
NLP systems interpret written rather than spoken language. In fact, NLP modules can be found in speech-processing systems that start
by converting spoken input into text. Using lexicons and grammar rules, NLP parses sentences, determines underlying meanings, and
retrieves or constructs responses. This technology's main use is to enable databases to answer queries entered in the form of a question.
A newer application is handling high-volume email. NLP performance can be improved by incorporating a common sense knowledge base -
that is, an encyclopedia of real-world rules.
NLP with Microsoft English Query
Almost all of database query languages tend to be rigid and difficult to learn, not to mention that is is often difficult
even for the experienced user to get the desired information out of databases. A natural language interface to the SQL language
overcomes the need for users to master the complexities of SQL.
English Query is a component of Microsoft SQL Server 2000 that
provides the ability for users to query databases using plain English. The
EQ engine then creates a database query that can then be executed under program control to return a formatted answer.
The development process is at higher level than traditional programming, but can be mastered by non-programmers with some database background.
To implement natural language searching, you first use the authoring tool to provide domain knowledge to the engine. The authoring tool is
used to relate database entities to objects in the domain.
For example, the user needs to create a verb relationship between the salespeople table and a products table by indicating that "salespeople sell products."
EQ uses these relationships to perform natural language parsing of users' questions, which provides better search results than you would get using keyword-based
technology.
Although your initial goal in an English Query project might be to answer the most common questions your users will ask, the
ultimate goal is to identify and model all the relationships between entities in your database. You want to have a semantic
model that defines the knowledge domain of your application, thus enabling EQ to provide answers to a wide range of questions without having to identify those questions ahead of time.
Input Devices
If you add speech recognition capability to your English Query application with a microphone, you can type or speak your question
to the application. Without stretching much further, you could put that speech interface on a smart phone or handheld Personal Digital
Assistant (PDA) with wireless Internet capability.
The combination of speech recognition and English Query represents a powerful way for a user to access information in a SQL Server
database very quickly. For users who work in an environment where speed and ease of access are critical, it holds enormous promise
for future applications. As hardware continues to become more powerful and cheaper, speech recognition should continue to become
more accurate and useful to increasingly wider audiences.
Multimodality
Multimodality seamlessly combines graphics, text, audio and avatar output with speech, text, ink, body attitude, gaze, RFID, GPS and touch input to enable a greatly enhanced
user experience. It is enabled by the convergence of voice, data and content and is enabled by multimedia, IP, speech and wireless technologies hosted on a diversity of
devices and device combinations. When compared to single-mode voice and visual applications, multimodal applications are easier and more intuitive to use. The user can
pick how best to interact with an application, which is especially helpful with newer, small-form-factor devices. When modalities are used contemporaneously, the resulting
decrease in Mutual Disambiguation (MD) input error rates improves accuracy, performance and robustness.
Radio Frequency Identification (RFID)
Radio frequency identification, or RFID, is a generic term for technologies that automatically identify one or more objects via radio waves, using a unique serial number stored in an RFID tag. The tag's antenna, tuned to receive the reader's electromagnetic waves in real time, is able to transmit the identification information to the reader. The reader converts the radio waves received from the RFID tag into digital information which, in turn, can be passed on to a business system for processing or storage.
RFID reader technology can be integrated with PDA's via a PC Card or CF Card implementation. Recent Wal-Mart and Department of Defense RFID mandates have intensified overall interest in using this technology in business applications and stimulated further hardware and software related advancements.
RFID tags tend to be small and lightweight and can be read through nonmetallic materials. A reader does not have to touch a tag, making RFID ideal for cluttered, dirty, wet, and harsh environments. Unlike bar code scanners, RFID scanners can read tags through mud, dirt, paint, grease, wood, cement, plastic, water, and steam. RFID does not require line-of-sight between tag and reader, allowing them to be hidden under skin, inside clothes hems, and within the pages of a book, preserving the item's usability and aesthetics.
RFID tags come in two forms: passive (low power, short range, inexpensive) and active (high power, longer range, more expensive). Our target implementation requires active tags, which run on their own power and can transmit over long distances. The battery life of a typical active tag is five years.
|
|