Speech recognition

src: cdn1.i-scmp.com

Speech recognition is a sub-field of computational linguistics disciplines that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. This is also known as automatic speech recognition ( ASR ), introduction to computer utterance or greeting to text ( STT ). It combines knowledge and research in linguistics, computer science, and electrical engineering.

Some voice recognition systems require "training" (also called "enrollment") where a speaker speaks an isolated text or vocabulary into the system. The system analyzes one's specific voice and uses it to perfect the person's speech recognition, thus increasing accuracy. Systems that do not use training are called "independent speaker" systems. The system that uses the training is called "dependent on the speaker".

Speech recognition applications include voice user interfaces such as voice calls (eg "Home phone"), call routing (eg "I want to make a collect call"), domotik tool control, search (eg finding podcasts where certain words are spoken), entries simple data (eg entering credit card numbers), preparation of structured documents (eg radiology reports), speech-to-text processing (eg, word processing or email), and aircraft (usually called direct voice input).

The term speech recognition or speaker identification refers to identifying the speaker, rather than what they say. Recognizing speakers can simplify the task of translating speech into systems that have been trained in a particular person's voice or can be used to authenticate or verify a speaker's identity as part of the security process.

From a technological perspective, voice recognition has a long history with several waves of great innovation. More recently, the field has benefited from advances in in-depth learning and large data. This progress is evidenced not only by the surge of academic papers published in the field, but more importantly by the worldwide adoption of industry from various in-depth learning methods in designing and deploying speech recognition systems. The industry players of this speech include Google, Microsoft, IBM, Baidu, Apple, Amazon, Nuance, SoundHound, iFLYTEK many of whom have published the core technology in their speech recognition system as based on in-depth learning.

Video Speech recognition

Histori

Pekerjaan awal

In 1952, three Bell Labs researchers built the system for the introduction of a single digit speaker. Their system works by placing its formants in the power spectrum of each speech. The era of 1950s technology was limited to a single speaker system with a vocabulary of about ten words.

Gunnar Fant developed the source filter model from speech production and published it in 1960, which proved to be a useful utility production model.

Unfortunately, funding at Bell Labs dried up for several years when, in 1969, influential John Pierce wrote an open letter critical to voice recognition research. Pierce thwarted speech recognition at Bell Labs where there was no research on voice recognition made until Pierce retired and James L. Flanagan took over.

Raj Reddy was the first to take over the introduction of sustained greetings as a graduate student at Stanford University in the late 1960s. The previous system requires the user to pause after each word. The Reddy system is designed to issue spoken commands for chess games.

Also around this time Soviet researchers invented the dynamic time warping (DTW) algorithm and used it to create recognizers capable of operating on a 200 word vocabulary. The DTW algorithm processes speech signals by dividing them into short frames, e.g. Segment 10 ms, and process each frame as one unit. Although DTW will be replaced by subsequent algorithms, the technique of dividing the signal into the frame will continue to run. Achieving speaker independence is a major goal of unsolved researchers during this period.

In 1971, DARPA funded five years of voice recognition research through the Research Understanding Speech program with an ambitious final goal including a minimum vocabulary of 1,000 words. It is thought that the speech comprehension will be the key to making progress in speech recognition , although it later proved to be incorrect. BBN, IBM, Carnegie Mellon, and Stanford Research Institute all participated in the program. Government funding revived a largely abandoned voice recognition study in the United States after John Pierce's letter.

Despite the fact that the Harpy CMU system meets the original goals of the program, many predictions turned out to be nothing more than hype, disappointing DARPA administrators. This disappointment caused DARPA not to continue funding. Some innovations occur during this time, such as the discovery of beam search for use in the Harpy CMU system. This field also benefits from the discovery of several algorithms in other fields such as encoding linear predictions and cepstral analysis.

In 1972, the IEEE Acoustics, Speech, and Signal Processing groups held a conference in Newton, Massachusetts. Four years later, the first ICASSP was held in Philadelphia, which has since been the main venue for the publication of research on speech recognition.

During the late 1960s Leonard Baum developed the mathematics of the Markov chain at the Institute of Defense Analysis. A decade later, at CMU, Raj Reddy's students, James Baker and Janet M. Baker began using the Hidden Markov Model (HMM) for voice recognition. James Baker has learned about HMM from a summer job at the Institute of Defense Analysis during undergraduate education. The use of HMM allows researchers to combine different sources of knowledge, such as acoustics, language, and syntax, in an integrated probabilistic model.

Under the leadership of Fred Jelinek, IBM created a typewriter that was turned on by a sound called Tangora, which could handle 20,000 words of vocabulary in the mid-1980s. Jelinek's statistical approach less emphasizes on mimicking the way the human brain processes and understands speeches that support the use of statistical modeling techniques such as HMM. (Jelinek groups independently find HMM applications to speak.) This is controversial with linguists because HMM is too simple to explain many common features of human language. However, HMM proved to be a very useful way of modeling speech and replacing dynamic time warping into the dominant voice recognition algorithm of the 1980s. IBM has several competitors including the Dragon System founded by James and Janet M. Baker in 1982. The 1980s also saw the introduction of the n-gram language model. Katz introduced a back-off model in 1987, which allowed the language model to use several n-grams of length. During the same time, CSELT also used HMM (which diphonies learned since 1980) to recognize languages â€‹â€‹like Italian. At the same time, CSELT leads a series of European projects (Esprit I, II), and summarizes the state-of-the-art in a book, then (2013) reprinted.

Much of the progress in the field is thanks to the rapidly growing computer capabilities. At the end of the DARPA program in 1976, the best computer available to researchers was the PDP-10 with 4 MB of ram. Using this computer, it takes up to 100 minutes to decipher just 30 seconds of talk. Decades later, researchers have access to tens of thousands of times more computing power. As advances in technology and computers get faster, researchers begin to tackle more difficult issues such as larger vocabulary, speaker independence, noisy environment, and conversational conversations. In particular, the shift to this more difficult task has marked DARPA's voice recognition funding since the 1980s. For example, progress is made on the independence of the first speaker by training in a larger range of speakers and then by performing an explicit speaker's adaptation during decoding. A further decrease in the rate of word error comes as researchers shift the acoustic model to discriminatory rather than using the maximum probability estimate.

In the mid-eighties, a newly released voice recognition microprocessor: for example RIPAC, an independent speaker (for continuous sound) designed for telephone service, was presented in the Netherlands in 1986. It was designed by CSELT/Elsag and produced by SGS.

Introduction to practical greeting

The 1990s saw the first introduction of a commercially successful speech recognition technology. The first two products are Dragon Dictate, a consumer product released in 1990 and initially priced at $ 9,000, and an identifier from Kurzweil Applied Intelligence was released in 1987. AT & amp; T deployed a Voice Recognition Call Processing service in 1992 to route phone calls without the use of human operators. This technology was developed by Lawrence Rabiner and others at Bell Labs. At this point the vocabulary of a typical commercial speech recognition system is greater than the average human vocabulary. Former Raj Reddy student, Xuedong Huang, developed the Sphinx-II system at CMU. The Sphinx-II system was the first to perform an independent vocabulary, a large vocabulary, the introduction of continuous speech and the best performance in the 1992 DARPA evaluation. Handling continuous conversations with a large vocabulary is a major milestone in the history of speech recognition. Huang later found a speech recognition group at Microsoft in 1993. Raj Reddy's student Kai-Fu Lee joined Apple where, in 1992, he helped develop a prototype interface speech for Apple computers known as Casper.

Lernout & amp; Hauspie, a Belgium-based speech recognition company, acquired several other companies, including Kurzweil Applied Intelligence in 1997 and Dragon Systems in 2000. L & H is used in Windows XP operating system. L & amp; H was the industry leader until the accounting scandal ended the company in 2001. The speech technology of L & amp; H was purchased by ScanSoft who became Nuance in 2005. Apple originally licensed the software from Nuance to provide voice recognition capabilities to Siri's digital assistant.

In 2000 DARPA sponsored two speech recognition programs: Effective Affordable Referenceable Speech-to-Text (EARS) in 2002 and Global Autonomous Language Exploitation (GALE). Four teams participated in the TELINGA program: IBM, BBN led team with LIMSI and Univ. Pittsburgh, Cambridge University, and a team consisting of ISCI, SRI and the University of Washington. EARLY funds a collection of speech telephone corpus Switchboard containing 260 hours of recorded conversations from over 500 speakers. The GALE program focuses on Arabic and Chinese newscasts. Google's first attempt at voice recognition came in 2007 after hiring several researchers from Nuance. The first product is GOOG-411, a phone-based directory service. Recordings from GOOG-411 generate valuable data that help Google improve their recognition system. Google voice search is now supported in over 30 languages.

In the United States, the National Security Agency has been using this type of speech recognition for keywords since at least 2006. This technology allows analysts to search through large volumes of recorded conversations and isolate the mention of keywords. Records can be indexed and analysts can run queries over the database to find interesting conversations. Some government research programs focus on voice recognition intelligence applications, e.g. DARPA's EARS program and IARPA Babel program.

Modern system

In early 2000, voice recognition was still dominated by traditional approaches such as the Hidden Markov Model combined with feedforward artificial neural networks. Today, however, many aspects of speech recognition have been taken over by an in-depth learning method called Long short term memory (LSTM), a recurrent neural network published by Sepp Hochreiter & amp; JÃƒÆ'Ã‚Â¼rgen Schmidhuber in 1997. LSTM RNNs avoids the problem of gradients that disappear and can learn the task of "Deeply Learning" which requires memory of events that occur thousands of discrete time steps ago, which is important to speak. Around 2007, LSTM trained by Connectionist Temporal Classification (CTC) began to outperform the introduction of traditional speech in certain applications. In 2015, Google's speech recognition reported a dramatic jump in performance by 49% through CTC-trained LSTM, which is now available through Google Voice for all smartphone users.

The use of a deep (non-repetitive) feedforward network for acoustic modeling was introduced during the latter part of 2009 by Geoffrey Hinton and his students at the University of Toronto and by Li Deng and colleagues at Microsoft Research, originally in collaborative work between Microsoft and the University of Toronto which later expanded to include IBM and Google (henceforth, "A shared view of the four research groups" in their 2012 review paper). A Microsoft research executive called this innovation "the most dramatic change of accuracy since 1979." In contrast to the steady gradual increase over the past few decades, the application of in-depth learning decreases the 30% word error rate. This innovation was quickly adopted in the field. Researchers have begun to use in-depth learning techniques for language modeling as well.

In the long history of speech recognition, both superficial and deep forms (eg recurrent nets) of artificial neural networks have been explored for years during the 1980s, 1990s and several years into the 2000s. But these methods never win mixed internal Gaussian mixed models/non-uniform internal Handcrafting/Hidden Markov models (GMM-HMM) based on generative models of discriminatory trained speech. A number of major difficulties have been methodologically analyzed in the 1990s, including decreasing gradients and weak temporal structures in neural prediction models. All of these difficulties are in addition to the lack of great training data and great computing power in these early days. Most voice recognition researchers who understand these barriers are then transferred from the neural net to pursue a generative modeling approach until the recent in-depth learning awakening started around 2009-2010 that has overcome all these difficulties. Hinton et al. and Deng et al. Reviews part of recent history of how their collaboration with each other and then with colleagues in four groups (University of Toronto, Microsoft, Google, and IBM) triggered a resurgence of deep feedforward neural network applications for voice recognition.

Maps Speech recognition

Model, method and algorithm

Both acoustic modeling and language modeling are an important part of modern statistical speech recognition algorithms. Hidden Markov model (HMM) is widely used in many systems. Language modeling is also used in many other natural language processing applications such as document classification or machine translation statistics.

Hidden Markov Model

The modern general-purpose speech recognition system is based on the Hidden Markov Model. This is a statistical model that generates a sequence of symbols or quantities. HMM is used in speech recognition because speech signals can be viewed as stationary or truncated stationary signals. In a short time scale (eg, 10 milliseconds), speech can be considered a stationary process. Speech can be regarded as a Markov model for many stochastic purposes.

Another reason why HMM is popular is because they can be trained automatically and are simple and computationally worthwhile to use. In speech recognition, the hidden Markov model will produce a sequence of n -important vector dimension (with n being a small integer, like 10), yielding one of these every 10 milliseconds. The vector will consist of the cepstral coefficient, which is obtained by taking the Fourier Transformation from the short talk time window and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficient. The hidden Markov model will tend to have in every statistical distribution state which is a diagonal mixture of Gaussians covariance, which will provide possibilities for each observed vector. Every word, or (for a more general speech recognition system), each phoneme, will have a different output distribution; hidden Markov models for word order or phonemes are created by combining the hidden Markov models trained individually for separate words and phonemes.

Described above are the most common core elements, the HMM-based approach to voice recognition. Modern speech recognition systems use various combinations of a number of standard techniques to improve results over the basic approach described above. A typical large vocabulary system will require context dependence for phonemes (so phonemes with different left and right contexts have different realizations as stated by HMM); it will use normalization of cepstral to normalize for different speaker and recording conditions; for further speech normalization may use voice channel normalization (VTLN) for male-female normalization and maximum linier regression (MLLR) for more general speaker adaptation. These features will have so-called delta and delta-coefficients to capture speech dynamics and in addition may use linear heteroccedastic discriminant analysis (HLDA); or perhaps passing deltas and delta-coefficients and using LDA-based splicing and projection is followed possible by linear heteroccedastic discriminant analysis or a global co-transformed semi-bound transformation (also known as the maximum possible linear transformation, or MLLT). Many systems use so-called discriminatory training techniques that discard the pure statistical approach to estimation of HMM parameters and instead optimize some measures related to the classification of training data. Examples are maximum reciprocal information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

Speech parsing (the term for what happens when the system is presented with a new speech and has to compute the most likely source of the sentence) might use the Viterbi algorithm to find the best path, and here is the choice between dynamically generating a combination of hidden Markov models, acoustic models and languages, and combine them statically previously (state transducers up to, or FST, approach).

A possible increase in password decryption is to keep a good number of candidates rather than maintain only the best candidates, and use a better scoring function (reassessment) to assess this good candidate so we can choose the best match with this enhanced score. The set of candidates can be stored either as a list (the N-best list approach) or as part of the model (grid). Reassessment is usually done by minimizing the risk of Bayes (or estimates): Instead of taking the source sentence with maximum probability, we try to retrieve a sentence that minimizes the expectation of a given loss function with regard to all possible transcription (ie we take a sentence that minimizes the distance to another sentence that may be weighted with its probability estimate). The loss function is usually Levenshtein distance, although it can be a different distance for certain tasks; sets of transcription that may, of course, be trimmed to maintain tractability. An efficient algorithm has been designed for lattice scores represented as weighted state weighted transducers by editing distances representing themselves as limited state transducers verifying certain assumptions.

Dynamic time warping (DTW) -based speech recognition

Dynamic time warping is an approach that has historically been used for voice recognition but has now largely been replaced by a more successful HMM-based approach.

Dynamic time warping is an algorithm for measuring the similarity between two sequences that can vary in time or speed. For example, the similarity in the walking pattern will be detected, even if in one video the person is walking slowly and if on the other it runs faster, or even if there is acceleration and deceleration during one observation. DTW has been applied to video, audio, and graphics - indeed, any data that can be converted into linear representations can be analyzed with DTW.

Famous app is auto speech recognition, to overcome different speaking speeds. In general, this is a method that allows the computer to find an optimal match between two given circuits (eg, time series) with certain restrictions. That is, the sequence is "warped" in a non-linear fashion to fit each other. This sequence alignment method is often used in the context of the hidden Markov model.

Neural Network

The neural network emerged as an interesting acoustic modeling approach in ASR in the late 1980s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification, isolated word recognition, audiovisual voice recognition, introduction of audiovisual speakers and speaker adaption.

In contrast to HMM, neural networks make no assumptions about the property of feature statistics and have some qualities that make it an interesting introduction model for speech recognition. When used to estimate the probability of a feature segment of speech, neural networks enable discriminatory training naturally and efficiently. Some assumptions about feature input statistics are made with neural networks. However, apart from their effectiveness in classifying short-time units such as individual phonemes and isolated words, neural networks rarely succeed in sustained recognition tasks, primarily because of their lack of ability to model temporal dependence.

However, recently LSTM Recurrent Neural Networks (RNNs) and Time Delay Neural Networks (TDNN's) have been used that have been shown to be able to identify latent temporal dependencies and use this information to perform voice recognition tasks.

Deep Neural Networks and Denoising Autoencoders are also experimenting to solve this problem in an effective way.

Due to the inability of the Feedforward Neural Network to model temporal dependencies, the alternative approach is to use neural networks as pre-processing for example. feature transformation, dimensional reduction, for HMM based recognition.

Deep feedforward and recurrent neural networks

A deep feedforward neural network (DNN) is an artificial neural network with many hidden layers between the input and output layers. Similar to superficial neural networks, DNN can model complex non-linear relationships. DNN architecture produces a compositional model, where an additional layer allows the feature composition of the lower layers, providing a large learning capacity and thus the complex modeling potential of speech data.

The success of DNN in the introduction of large vocabulary words occurred in 2010 by industry researchers, in collaboration with academic researchers, where a large output layer of DNN is based on an HMM-dependent country context constructed by a decision tree adopted. See a comprehensive review of these developments and current circumstances in October 2014 in Springer's latest book from Microsoft Research. See also backgrounds regarding the introduction of automatic speech and the impact of various machine learning paradigms including learning in especially in the latest overview articles.

One of the fundamental principles of in-depth learning is to get rid of handmade feature engineering and use the raw features. This principle was first successfully explored in an in-depth autoencoder architecture on a "standard" spectrogram or linear bank-filter feature, demonstrating its superiority over the Mel-Cepstral feature that contains several stages of fixed transformation of the spectrograph. A truly "raw" talk "feature", waveform, has recently been shown to produce excellent large-scale speech recognition results.

Introduction of end-to-end automatic greeting

Since 2014, there has been a lot of research interest about "end-to-end" ASR. The traditional phonetic-based approach (ie, all HMM-based models) requires separate components and training for speech, acoustic and language models. The end-to-end models together study all speech recognition components. This is valuable because it simplifies the training process and the implementation process. For example, the n-gram language model is required for all HMM-based systems, and typical n-gram language models often take several gigabytes in memory, making it impractical to apply to mobile devices. As a result, modern commercial ASR systems from Google and Apple (as of 2017) are used in the cloud and require network connections as opposed to local devices.

The first end-to-end ASR effort is with a Temporal Interconnection (CTC) based system introduced by Alex Graves of Google DeepMind and Navdeep Jaitly of the University of Toronto in 2014. This model consists of repeated neural networks and CTC layers. Together, the RNN-CTC model learns pronunciation and acoustic models together, but is unable to learn the language because the conditional independence assumption is similar to HMM. As a result, the CTC model can instantly learn to map speech acoustics to English characters, but the model makes many common spelling errors and has to rely on separate language models to clear transcripts. Later, Baidu expanded its work with a very large dataset and showed some commercial success in both Chinese and English. In 2016, the University of Oxford presented LipNet, the first-order lip-reading model to the end, using a spatiotemporal convolution coupled with the RNN-CTC architecture, surpassing human-level performance in a limited set of grammatical data.

An alternative approach to the CTC-based model is the attention-based model. ASR model based on attention was introduced simultaneously by Chan et al. from Carnegie Mellon University and Google Brain and Bahdanaua et al. from the University of Montreal in 2016. The model named "Listen, Attend and Mantra" (LAS), literally "listens" to acoustic signals, pays "attention" to different parts of the signal and "spells" out of transcripts one character at a time. Unlike the CTC-based model, the attention-based model has no dependency-independent assumptions and can learn all components of the speech recognition including pronunciation, acoustics and language models directly. This means, during deployment, there is no need to carry around a language model that makes it very practical for deployment to applications with limited memory. By the end of 2016, attention-based models have seen considerable success including outperforming the CTC model (with or without an external language model). Various extensions have been proposed since the original LAS model. Latent Sequence Decompositions (LSD) is proposed by Carnegie Mellon University, MIT and Google Brain to directly transmit a more natural sub-word unit than English characters; The University of Oxford and Google DeepMind expanded the LAS into "Watch, Listen, Attend and Mantra" (WLAS) to handle lip readings that go beyond human-level performance.

Google's speech recognition error rate is now under 5% for US English

src: www.androidpolice.com

Apps

In-car system

Usually manual control inputs, for example by using the finger control on the steering wheel, allow the voice recognition system and this is marked to the driver by the audio prompt. Following the audio prompt, the system has a "listening window" that can receive speech input for recognition.

Simple voice commands can be used to start a phone call, select a radio station or play music from a compatible smartphone, MP3 player or flash drive that contains music. The voice recognition capability varies between the manufacture and model of the car. Some of the latest car models offer natural language speech recognition in lieu of a series of fixed commands, allowing drivers to use full sentences and common phrases. With such a system, therefore, it is not necessary for the user to memorize a set of fixed command words.

Health care

Medical documentation

In the health care sector, voice recognition can be implemented on the front or back of the medical documentation process. Voice recognition front-end is where the provider specifies into the speech recognition engine, known words are displayed while they are speaking, and the dictator is responsible for editing and signing documents. The introduction of a back-end or suspended sound is where the provider determines the digital dictation system, the sound is transmitted through the voice recognition engine and the recognized draft document is directed together with the original sound file to the editor, where the draft is edited and the report is completed. Suspended voice recognition is widely used in industry today.

One of the main problems related to the use of speech recognition in health care is that the American Recovery and Reinvestment Act of 2009 (ARRA) provides substantial financial benefits to physicians who utilize EMR in accordance with the "Meaningful Use" standard. These standards require that large amounts of data be maintained by EMR (now more commonly referred to as Electronic Health Record or EHR). The use of speech recognition is more appropriate with the manufacture of narrative texts, as part of radiology/pathology interpretations, development notes or release summaries: ergonomic benefits of using speech recognition to include structured discrete data (eg, numerical values â€‹â€‹or codes from list or controlled vocabulary) for people who look and who can operate keyboard and mouse.

A more significant problem is that most EHRs have not been specifically tailored to take advantage of voice recognition capabilities. Most physician interaction with EHR involves navigating through the user interface using menus, and clicking tabs/buttons, and relies heavily on keyboards and mice: voice-based navigation only provides simple ergonomic benefits. In contrast, many highly customized systems for radiology or pathology diagnosis apply "macro" sounds, in which the use of certain phrases - for example, "normal reports", will automatically populate a large number of default values â€‹â€‹and/or generate boilerplate, which will vary by type exam - for example, chest X-ray vs. gastrointestinal contrast series for radiology systems.

As an alternative to this navigation by hand, the use of word recognition and flowing information extraction have been studied as a way to fill in submission forms for clinical examination and sign-off. The results are encouraging, and this paper also opens the data, along with related performance benchmarks and some processing software, to the research and development community to learn the clinical documentation and language processing.

Therapeutic use

The use of speech recognition software for a long time along with word processors has demonstrated benefits for the reinforcement of short-term memory in brain-treated AVM patients with resection. Further research needs to be done to determine the cognitive benefits for individuals whose AVMs have been treated using radiological techniques.

Military

High-performance fighter

Substantial efforts have been devoted to the last decade to test and evaluate voice recognition on fighter aircraft. Of particular note are the US programs in voice recognition for Advanced Fighter Technology Integration (AFTI)/F-16 (F-16 VISTA) aircraft, the French program for Mirage aircraft, and other UK programs that handle various aircraft platforms. In this program, speech recognizers have been successfully operated within fighter aircraft, with applications including: radio frequency settings, command of the autopilot system, steering point coordinates and weapon discharge parameters, and controlling the appearance of the flight.

Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found the introduction to worsen with increasing g-load. The report also concludes that adaptation greatly improves outcomes in all cases and that the introduction of a model for respiration has been shown to increase the recognition score significantly. Contrary to what might have been expected, there was no effect of broken English from the found speakers. It is evident that spontaneous speech causes problems for the recognizer, as expected. Limited vocabulary, and most importantly, proper syntax, can be expected to improve substantial recognition accuracy.

Eurofighter Typhoon, currently working with the British RAF, uses a speaker-dependent system, which requires each pilot to create a template. This system is not used for critical-critical or weapon-critical tasks, such as the release of weapons or lower undercarriage, but is used for other cockpit functions. Voice commands are confirmed by visual and/or aural feedback. This system is seen as a major design feature in reducing pilot workloads, and even allowing pilots to set targets onto their aircraft with two simple voice commands or to one of its wings with just five commands.

The speaker-independent system is also being developed and is being tested for F35 Lightning II (JSF) and Alenia Aermacchi M-346 Master's premier combat trainer. This system has resulted in more than 98% word accuracy score.

Helicopter

The problem of achieving high acknowledgment accuracy under pressure and noise is closely related to the helicopter environment as well as to the jet fighter environment. Acoustic noise problems are actually more severe in a helicopter environment, not only because of high noise levels but also because helicopter pilots, in general, do not wear face masks, which will reduce the acoustic noise in the microphone. Substantial tests and evaluation programs have been conducted in the last decade in helicopter voice recognition applications, primarily by the Research and Development of US Army Avionics Activities (AVRADA) and by the Royal Aerospace Establishment (RAE) in the United Kingdom. Jobs in France have included voice recognition on the Puma helicopter. There are also many useful jobs in Canada. The results are very encouraging, and voice applications include: radio communication controls, navigation system settings, and automatic target submission control systems.

As in fighter apps, the main problem for sound in a helicopter is its impact on pilot effectiveness. Encouraging results are reported for the AVRADA test, although this only represents a feasibility demonstration in the test environment. There is still much to be done both in speech recognition and in speech technology as a whole to consistently achieve improved performance in operational settings.

Training of air traffic controllers

Training for air traffic controllers (ATC) represents an excellent application for voice recognition systems. Many ATC training systems today require a person to act as a "pseudo-pilot", engaging in voice dialogue with trainee controllers, who simulate the dialogue that controllers must perform with the pilot in true ATC situations. Speech recognition and synthesis techniques offer the potential to eliminate a person's need to act as a pseudo-pilot, thereby reducing personnel training and support. Theoretically, the task of air controller is also characterized by highly structured speech as the controller's main output, thus reducing the difficulty of voice recognition tasks should be possible. In practice, this is rare. FAA Documents 7110.65 specifies the phrases that should be used by air traffic controllers. Although this document provides fewer than 150 such phrase examples, the number of phrases supported by one of the simulation vendor's speech recognition systems is over 500,000.

USAF, USMC, US Army, US Navy and FAA as well as a number of international ATC training organizations such as the Royal Australian Air Force and Civil Aviation Authority in Italy, Brazil and Canada currently use ATC simulators with voice recognition from a number of different vendors.

Telephony and other domains

ASR is now commonplace in the field of telephony, and is becoming more widespread in the field of computer games and simulations. Despite a high degree of integration with word processors in general personal computing. However, ASR in the field of document production has not seen the expected increase in usage.

The increase in the speed of mobile processors has made introducing practical greetings on smartphones. Speech is mostly used as part of the user interface, to make standard or custom speech commands. Leading software vendors in this field are: Google, Microsoft Corporation (Microsoft Voice Command), Siphon Digital (Sonic Extractor), LumenVox, Nuance Voice Control, Voci Technologies, VoiceBox Technology, Speech Technology Center, Vito Technologies (VITO Voice2Go), Speereo Software (Speereo Voice Translator), Verbyx VRX and SVOX.

Use in education and everyday life

For language learning, speech recognition can be useful for learning a second language. Can teach the correct pronunciation, in addition to helping a person develop fluency with their speaking skills.

Students who are blind (see Blindness and education) or have very low vision can utilize technology to deliver words and then hear computers pronounce them, and use computers by commanding with their voices, rather than having to see the screen and keyboard.

Students who are physically disabled or suffering from repetitive strain injuries/other injuries to the upper extremities may be exempted from having to worry about handwriting, typing, or working with clerks on schoolwork using a speech-to-text program. They can also use voice recognition technology to freely enjoy searching on the Internet or using a home computer without having to operate the mouse and keyboard physically.

Speech recognition can allow students with learning disabilities to become better writers. By saying the words out loud, they can improve their writing fluidity, and ease worries about spelling, punctuation, and other writing mechanics. Also, see learning disability.

The use of voice recognition software, along with digital audio recorders and personal computers running word processing software have proven to be positive for recovering damaged short-term memory capacity, in individual strokes and craniotomies.

People with limitations

People with disabilities can benefit from a speech recognition program. For deaf or hard of Hearing individuals, voice recognition software is used to automatically generate closed text conversations such as discussion in conference rooms, lecture classes, and/or religious services.

Speech recognition is also very useful for people who have difficulty using their hands, ranging from mild recurrent stress injuries to involving defects that hinder the use of conventional computer input devices. In fact, people who use multiple keyboards and develop RSI become an urgent early market for voice recognition. Speech recognition is used in deaf telephonies, such as voice messages to text, relay services, and signed phones. Individuals with learning disabilities who have problems with mind-to-paper communication (basically they think of an idea but it's being processed incorrectly causing it to end differently on paper) may be able to benefit from the software but the technology does not bug proof. Also the whole idea of â€‹â€‹speaking to the text can be difficult for intellectually disabled people due to the fact that very rarely anyone is trying to learn technology to teach people with disabilities.

This type of technology can help those with dyslexia but other defects are still questionable. The effectiveness of a product is a problem that prevents it from becoming effective. Although a child may be able to say a word depends on how clearly they say it, the technology may think they are saying another word and entering the wrong word. Giving them more work to fix, causing them to take more time by correcting the wrong word.

More apps

Aerospace (eg space exploration, spacecraft, etc.) Mars Lander NASA Lander uses voice recognition technology from Sensory, Inc. in Mars Microphone in Lander
Automatic subtitle with speech recognition
Automatic translation
Court reports (Realtime Speech Writing)
eDiscovery (Legal discovery)
Hands-free computing: Voice recognition of computer user interface
House automation
Interactive voice response
Mobile phones, including mobile email
Multimodal interactions
Evaluate pronunciation in computer-assisted language learning apps
Robotics
Speech-to-text reporter (speech transcription into text, video text, court reporting)
Telematics (e.g. Vehicle Navigation System)
Transcription (speech-to-text digital)
Video games, with EndWar Tom Clancy and Lifeline as working examples
Virtual assistant (eg Apple Siri)

src: i.ytimg.com

Performance

The performance of the voice recognition system is usually evaluated in terms of accuracy and speed. Accuracy is usually judged by the word error rate (WER), while the velocity is measured by real time factor. Other accuracy sizes include the Single Word Error Rate (SWER) and Commitment Success Rate (CSR).

The introduction of speech by machine is a very complex issue. Vocalizations vary in accents, pronunciations, articulations, rudeness, nasality, tone, volume, and speed. Speech is distorted by background noise and echoes, electrical characteristics. The accuracy of speech recognition may vary with the following:

The size and confusion of vocabulary
Dependency of speaker versus independence
Speech, discontinuous or persistent
Tasks and language constraints
Read compared to spontaneous speech
Bad condition

Accuracy

As mentioned earlier in this article, the accuracy of speech recognition may vary depending on the following factors:

The error rate increases with increasing vocabulary size:

for example, 10 "zero" to "nine" characters can be basically perfect, but the vocabulary size of 200, 5000 or 100000 may have a 3%, 7% or 45% error rate respectively.

The vocabulary is difficult to recognize if it contains confusing words:

for example the 26 letters of the English alphabet are difficult to distinguish due to confusing words (most famously, E-sets: "B, C, D, E, G, P, T, V, Z"); an error rate of 8% is considered good for this vocabulary.

Speaker vs. dependence freedom:

The system relies on the speaker intended for use by one speaker.
The speaker-independent system is intended for use by any speaker (more difficult).

Isolated, Disconnected or Continuous Speech

With isolated speech, single words are used, therefore it becomes easier to recognize speech With sentences disconnected full sentence separated by silence is used, therefore it becomes easier to recognize speech as well as with isolated speech. With a continuous sentence, a spoken sentence is naturally used, therefore it becomes more difficult to recognize speech, unlike isolated and disjointed speech.

Tasks and language constraints

for example The query app can ignore the hypothesis "Apples are red." eg Limitations may be semantic; reject "The apple is angry." eg Syntax; reject "Red is apple the." Limitations are often represented by grammar.

Read vs. Spontaneous Speech

When one reads it is usually in pre-prepared context, but when one uses spontaneous speech, it is difficult to recognize speech due to malfunction (such as "uh" and "um", false start, incomplete sentence, stammer, cough, and laughter) and limited vocabulary.

Bad condition

Environmental noise (eg Noise in car or factory)
Acoustic distortion (eg echo, acoustic room)
Speech recognition is a multi-leveled pattern recognition task.

The acoustic signal is organized into a hierarchy of units;

for example Phonics, Words, Phrases, and Sentences;

Each level provides additional restrictions;

such as a known word pronunciation or a sequence of words of law, which can offset errors or uncertainties at a lower level;

The hierarchy of this constraint is exploited;

By combining decisions probabilistically at all lower levels, and making more deterministic decisions only at the highest level, speech recognition by machines is a process that is broken down into several phases. Computationally, it is a matter in which a sound pattern must be identified or classified into categories that represent a meaning for humans. Each acoustic signal can be solved with a smaller sub-signal base. When more complex sound signals are broken down into smaller sub-sounds, different levels are created, where at the top level we have a complex sound, made of a simpler sound at a lower level, and will lower the lower levels , we create a more basic and shorter and simpler sound. The lowest level, where sound is the most basic, the machine will check the simpler and more probabilistic rules of what should be represented by the sound. Once these sounds are integrated into a more complex sound at the top level, a more deterministic new set of rules should predict what should represent new complex sounds. The topmost level of the deterministic rules must figure out the meaning of complex expressions. To expand our knowledge of speech recognition, we need to consider neural networks. There are four steps to the neural network approach:

Digitize the speech we want to recognize

For telephone speech, sampling rate is 8000 samples per second;

Calculate the spectral-domain feature of the speech (with Fourier transform);

calculated every 10 ms, with one 10 ms section called frame;

Analysis of the four-step neural network approach can be explained with more information. Sound is generated by the vibration of the air (or some other medium), which we register by ear, but the machine by the receiver. The basic sound creates a wave that has 2 descriptions; Amplitude (how strong), and frequency (how often to vibrate per second).

Sound waves can be digitized: Example of power at short intervals like in the picture above to get many numbers close to each wave power step. Collections of these numbers represent analog waves. This new wave is digital. Sound waves are tricky because they place on top of each other. Like the waves going. In this way they create strange-looking waves. For example, if there are two waves that interact with each other we can add those that create new strange waves.

The neural network classifies features into phonetic-based categories;

With the basic sound block that the machine digitizes, one has many numbers that describe waves and waves describing words. Each frame has a sound block unit, which is broken into basic sound waves and represented by numbers that, after Fourier Transform, can be evaluated statistically to determine which class of sound belongs to it. The nodes in the image on the slide represent the sound feature in which the wave feature of the first layer of the node to the second layer of nodes based on statistical analysis. This analysis depends on the programmer's instructions. At this point, the second layer of nodes represents a high-level feature of the voice input that is again evaluated statistically to see what classes they have. The last level of the node must be an output node that tells us with a high probability what exactly the original sound is.

Search to match the neural network's output score for the best word, to determine which words are likely to be pronounced.

Security worries

Speech recognition can be a means of attack, theft, or accidental operation. For example, activation words like "Alexa" spoken in audio or video broadcasts may cause devices at home and work to start listening input incorrectly, or may take undesirable actions. Voice-controlled devices can also be accessed by visitors to the building, or even outside the building if they can be heard inside. Attackers may be able to gain access to personal information, such as calendars, address book content, personal messages, and documents. They can also impersonate the user's identity to send messages or make purchases online.

Two attacks have been shown that use artificial sounds. One transmits an ultrasound and tries to send a command unnoticed by anyone around it. Others add unheard distortions to other talks or music that are specifically created to confuse certain speech recognition systems to recognize music as speech, or to make what sounds like a command to a human voice like a different command to a system.

How to set up and use Windows 10 Speech Recognition | Windows Central

src: www.windowscentral.com

More information

Conference and journal

The popular speech recognition conferences are held every year or two including SpeechTEK and SpeechTEK Europe, ICASSP, Interspeech/Eurospeech, and IEEE ASRU. Conferences in natural language processing fields, such as ACL, NAACL, EMNLP, and HLT, began to include papers on speech processing. Important journals include IEEE Transactions on Speech and Audio Processing (later renamed IEEE Transactions in Audio, Speech and Language Processing and since September 2014 renamed IEEE/ACM Transactions on Audio, Speech and Language Processing - after incorporating the ACM publication) Computer Speech and Language, and Speech Communication.

Books

Books such as Lawrence Rabiner's "Speech Recognition Basics" can be useful for gaining basic knowledge but may not be fully up-to-date (1993). Another good source is the "Statistics Method for Speech Recognition" by Frederick Jelinek and "The Language of Oral Medicine (2001)" by Xuedong Huang etc. The more recent is "Computer Speech," by Manfred R. Schroeder, second edition published in 2004, and "Speech Processing: A Dynamic and Optimization-Oriented Approach" published in 2003 by Li Deng and Doug O'Shaughnessey. The latest textbook "Speech and Language Processing (2008)" by Jurafsky and Martin presents the basics and state of the art for ASR. Speaker recognition also uses the same features, mostly from the same front-end processing, and classification techniques such as those done in speech recognition. A recent comprehensive book, "The Basics of Speech Introduction" is an in-depth source for the latest details on theory and practice. A good insight into the techniques used in the best modern systems can be obtained by taking into account government sponsored evaluations such as those organized by DARPA (the largest project related to speech recognition that took place in 2007 was the GALE project, involving voice recognition and translation components).

A good and accessible introduction to voice recognition and history is provided by the public audiobook "The Voice in the Machine." Roberto Pieraccini (2012).

The latest book on speech recognition is "The Introduction of Automatic Speech: The Deeper Learning Approach" (Publishers: Springer) written by D. Yu and L. Deng published towards the end of 2014, with highly mathematical technical detail oriented on how deep the learning method is derived and implemented in a modern voice recognition system based on DNN and related in-depth learning methods. A related book, published earlier in 2014, "Deep Learning: Methods and Applications" by L. Deng and D. Yu provides a less technical picture but focuses more on the DNN word-based methodology during 2009-2014, placed in in more general context of in-depth learning applications including not only voice recognition but also image recognition, natural language processing, information retrieval, multimodal processing, and multitask learning.

Software

In terms of freely available resources, the Carnegie Mellon University Sphinx toolkit is one of the places to start learning speech recognition and start experimenting. Another source (free but copyrighted) is the HTK book (and the attached HTK toolkit). For newer and newer techniques, the Kaldi toolkit can be used.

The on-line greeting recognition demo is available on the Cobalt web page.

Source of the article : Wikipedia

Jumat, 29 Juni 2018