Windows Speech Recognition is a speech recognition component developed by Microsoft and introduced in the Windows Vista operating system that enables the use of voice commands to perform operations, such as the dictation of text, within applications and the operating system itself.
Speech recognition relies on the Speech API developed by Microsoft, and is also present in Windows 7, Windows 8, Windows 8.1, and Windows 10.
Video Windows Speech Recognition
History
Precursors
Microsoft has been involved in speech recognition and speech synthesis research for many years. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts. The company's research eventually ultimately led to the development of the Speech API, introduced in 1994. Speech recognition technology has been used in some of Microsoft's products prior to Windows Speech Recognition. Versions of Microsoft Office, including Office XP and Office 2003, included support for speech recognition among Office applications and other applications such as Internet Explorer. Installation of Office would enable limited speech functionality in Windows NT 4.0, Windows 98 and Windows ME. The 2002 edition of Windows XP Tablet PC Edition would also include support within the Tablet PC Input Panel feature, and the Microsoft Plus! for Windows XP expansion package enabled voice commands to be used in Windows Media Player. However, this support was limited to individual applications, and prior to Windows Vista, the Windows operating system did not include integrated support for speech recognition.
Development
At the Windows Hardware Engineering Conference of 2002, Microsoft announced that Windows Vista, then known by its codename "Longhorn," would include advances in speech recognition technology and features such as support for microphone arrays. Bill Gates expanded upon this information during the Professional Developers Conference of 2003 where he stated that the company would "build speech capabilities into the system -- a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time." Further reports said that the operating system would include integrated support for speech recognition, and certain pre-release builds throughout development of the operating system would include a speech engine with training features. In 2003, Microsoft clarified the extent of its intended integration for Windows Vista when the company stated within a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide" in the operating system.
During WinHEC 2004, Microsoft listed speech recognition as part of its "Longhorn" mobile PC strategy to improve productivity and listed microphone arrays as a hardware opportunity for the operating system. At WinHEC 2005, Microsoft released additional details pertaining to speech recognition in Windows Vista with a focus on accessibility, new mobility scenarios, and improvements to the speech user experience. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between dictation and command modes, Windows Vista would separate the feature from the Tablet PC Input Panel by introducing a dedicated interface for speech input on the desktop and would also unify the previously separate dictation and command modes. In previous versions of Windows, speech recognition would not allow a user to speak a command after dictation or vice versa without first switching between these two modes. Microsoft also stated that speech recognition in Windows Vista would improve dictation accuracy, and support additional languages and microphone arrays. A demonstration of the feature at WinHEC 2005 focused on e-mail dictation with correction and editing commands, and a set of slides dedicated to microphone arrays was also released. Windows Vista Beta 1 would include an integrated speech recognition application. In an effort to persuade company employees to interact with Windows Speech Recognition during its development, Microsoft offered an opportunity to win a Premium model of its Xbox 360 video game console.
On July 27, 2006, before the operating system's release to manufacturing (RTM), a notable incident pertaining to speech recognition occurred during a demonstration by Microsoft at its annual Financial Analyst Meeting. Speech recognition initially failed to function correctly, which resulted in an unintended output of: "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; the incident was a subject of significant derision among analysts and journalists present in the audience. Microsoft later revealed that the errors during the demonstration were due to an audio gain glitch that caused speech recognition to distort the dictated commands. The glitch was fixed prior to the operating system's release to manufacturing on November 8, 2006.
Security report
Reports surfaced in early 2007 that Windows Speech Recognition may be vulnerable to an attack that could allow attackers to take advantage of its capabilities to perform undesired operations on a targeted computer by playing audio through the targeted computer's speakers; it was the first vulnerability discovered after the operating system's general retail availability. While Microsoft stated that such an attack is theoretically possible, it would have to meet a number of prerequisites in order to be successful: the targeted system would be required to have the speech recognition feature previously activated and configured, speakers and microphone(s) connected to the targeted system would need to be turned on, and the exploit would require the software to interpret commands without a user noticing--an unlikely scenario as the affected system would perform user interface operations and produce audible feedback (as speakers would need to be active). Moreover, mitigating factors would include dictation clarity, and microphone feedback and placement. An exploit of this nature would also not be able to perform privileged operations for users or protected administrators without explicit user consent because of User Account Control.
Maps Windows Speech Recognition
Overview and features
Windows Speech Recognition allows a user to control a computer, including the operating system desktop user interface, through voice commands. Applications, including most of those bundled with Windows, can also be controlled through voice commands. By using speech recognition, users can dictate text within documents and e-mail messages, fill out forms, control the operating system user interface, perform keyboard shortcuts, and move the mouse cursor.
Speech recognition uses a speech profile to store information about a user's voice. Accuracy of speech recognition increases through use, which helps the feature adapt to a user's grammar, speech patterns, vocabulary, and word usage. Speech recognition also includes a tutorial to improve accuracy, and can optionally review a user's personal documents, including e-mail messages, to improve its command and dictation accuracy. In Windows 7 and later versions, an additional option is available that allows users to send speech information to Microsoft. Individual speech profiles can be created on a per-user basis, and backups of profiles can be performed via Windows Easy Transfer or through a downloable utility developed by Microsoft. Profiles archived through this utility carry the WSRPROFILE filename extension. Windows Speech Recognition relies on Microsoft Speech API. and third-party applications must support the Text Services Framework. Speech Recognition currently supports the following languages: Chinese (Traditional), Chinese (Simplified), English (U.S.), English (U.K.), French, German, Japanese, and Spanish.
Interface
The interface for Windows Speech Recognition primarily consists of a status area for instructions, for information about commands (e.g., if a command could not be heard by the speech recognizer), and also for information related to the state of the speech recognizer; a voice meter is also provided to display visual feedback to the user about voice volume levels. The status area represents the current state of Windows Speech Recognition in a total of three modes, listed below with their respective meanings:
- Listening: The speech recognizer is active and waiting for user input
- Sleeping: The speech recognizer will not listen for or respond to commands other than "Start listening"
- Off: The speech recognizer will not listen or respond to any commands; this mode can be enabled by speaking "Stop listening"
In addition to the three modes listed above, the status area can also display information about messages that users can customize as part of their own Windows Speech Recognition Macros.
Alternates panel
A disambiguation interface referred to as the alternates panel displays a list of items interpreted by the recognizer as being relevant to a user's spoken word(s); if the word or phrase that the user desired to insert into an application is listed among results, the user can speak the corresponding number of the word that appears among the results and confirm this choice by speaking "OK" to insert it within the application.
The alternates panel will also appear when launching programs or speaking commands that may refer to more than one item (e.g., speaking "Start Internet Explorer" may list the web browser and an alternate version of the web browser with add-ons disabled). However, a Windows Registry entry, ExactMatchOverPartialMatch, can limit commands to programs or commands with exact names if there is more than one instance of that item included among results.
Common commands
Listed below are common commands available for Windows Speech Recognition. Words in italics indicate a variable that can be substituted for a desired item (e.g., the word "direction" in the "scroll direction" command can be substituted with the word "down" to scroll down). A "start typing" command enables Windows Speech Recognition to interpret dictation commands as keyboard shortcuts.
- Dictation commands: "New line," "new paragraph," "tab," "literal word," "numeral number," "go to word," "go after word," "no space," "go to start of sentence," "go to end of sentence," "go to start of paragraph," "go to end of paragraph," "go to start of document," "go to end of document," "go to field name" (e.g., go to address, cc, or subject). Special characters, such as a comma, can be dictated simply by stating the name of the special character.
- Navigation commands:
- Keyboard shortcuts: "Press keyboard key," "press ? Shift plus a," "press capital b." The NATO phonetic alphabet is also supported. Keys that can be pressed without first giving the press command include: <- Backspace, Delete, End, ? Enter, Home, Page Down, Page Up, Tab ?.
- Mouse commands: "Click," "click that," "double-click," "double-click that," "mark," "mark that," "right-click," "right-click that," mousegrid.
- Window management commands: "Close (alternatively maximize, minimize, or restore) window," "close that," or "close application name," "switch applications," "switch to program name," "scroll direction," "scroll direction in number of pages," "show desktop," "show numbers."
- Speech recognition commands: "Start listening," "stop listening," "show speech options," "open speech dictionary," "move speech recognition," "mimimize speech recognition." A list of applicable commands can be shown by speaking "What can I say?" This command is currently only available in English. Users can also query the recognizer about tasks in Windows by speaking "How can I task name, which opens the Help Pane that displays related information.
Mousegrid
A mousegrid command enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions narrow as a user speaks--by number(s)--which region to focus on until they reach a desired interface element to interact with. Entire regions can be interacted with by speaking "click number of region," which moves the mouse cursor to the desired region and then clicks it. An individual item within a region, such as a computer icon, can also be selected by speaking "mark number of region" where the item appears. A user can then specify where to move the marked item by speaking "click number of region." These commands also work for multiple regions of the mousegrid.
Show numbers
Applications and operating system user interface elements that do not present obvious commands can still be controlled by asking the system to overlay numbers on top of them through a show numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. The command was designed so that users could interact with items that are not readily identifiable.
Dictation
Windows Speech Recognition enables dictation of text in the operating system and applications. For applications that do not automatically support dictation, an option to enable dictation everywhere is available. If a mistake in dictation occurs, a user can correct the mistake by saying "correct word" or "correct that" and an alternate panel interface will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion in the list and by speaking "OK." If the desired word is not listed among the included suggestions, a user can speak the desired word so that it might appear. Alternatively, a "spell it" command allows a user to speak the desired word on a per-letter basis so that it will appear among suggestions. Multiple words in a sentence may also be corrected at a time. As an example, if a user states "dictating" but speech recognition recognizes this word as "the thing," a user can state "correct the thing" to correct both words.
Speech dictionary
WSR includes a personal dictionary that allows users to include or exclude certain words or expressions from being dictated. By default, this dictionary includes over 100,000 words in the English language. When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized during dictation or if capitalization depends on the context where the word is spoken; users may also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Most of the information stored within a dictionary is included as part of a user's speech profile.
Macros
Windows Speech Recognition supports custom macros through a separate utility released by Microsoft that enables the use of commands that are further based on natural language processing. As an example of this functionality, an e-mail macro released by Microsoft enables a natural language command where a user can state "send e-mail to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted within the application. Microsoft has also released sample macros for the speech dictionary, for Windows Media Player, for Microsoft PowerPoint, for speech synthesis, to switch between multiple microphones, to customize various aspects of audio device configuration such as volume levels, and for general natural language queries such as, "What is the weather forecast?" "What time is it?" and "What's the date?" Answers to these queries are spoken to the user via a speech synthesizer.
Users and developers can create their own custom macros that can be based on text transcription and substitution, program execution (with support for command-line arguments), keyboard shortcuts, emulation of existing voice commands, or a combination of these items. XML, JScript and VBScript are supported. Macros can be limited to individual applications if desired, and rules for macros can be defined programmatically. In order for a macro to be loaded, it must be stored within a Speech Macros folder within the current user's Documents directory. By default, all macros are digitally signed if a user certificate is available in order to ensure that created commands are not loaded or tampered with by third-parties; if one is not available, an administrator can create a certificate for use. The macros utility also includes security levels to prohibit unsigned macros from being loaded, to prompt users to sign macros, and to load unsigned macros if a user desires for this to occur.
Performance
As of 2017 Windows Speech Recognition uses Microsoft Speech Recognizer 8.0, which has not been changed since Windows Vista. Speech recognition for dictation, rather than giving defined voice commands, was found to be 93.6% accurate by Mark Hachman, Senior Editor of PCWorld, dictating an article, not as good as other voice recognition software. This was without training the application; Microsoft employees have said that, properly trained, accuracy was 99%, which is good. Hachman commented that speech recognition was a feature Microsoft didn't like to talk about, with few users knowing that documents could be dictated within Windows.
See also
- List of speech recognition software
- Microsoft Cortana
- Microsoft Narrator
- Microsoft Voice Command
- Technical features new to Windows Vista
References
External links
- Windows Vista Speech Recognition demonstration at Microsoft Financial Analyst Meeting
Source of the article : Wikipedia