|
Synthesizer Properties
Synthesizers use five properties to control and fine-tune output speech. They are voice, volume, speaking rate, pitch, and pitch range.
- Voice: This decides the type of voice the synthesizer uses to render the speech. Synthesizers provide a variety of voice options by simulating the age of the user (infant, adult, teenager etc), type (male, female, neutral etc) and, style (casual, business etc.). Combinations of any of these provide different kinds of voices.
- Volume: This value ranges from a scale of 0.0 to 1.0 for the loudest.
- Speaking Rate: This decides the speed of speech output in words per minute.
- Pitch: This decides the baseline (minimum) pitch of the voice.
- Pitch range: This decides the range of pitches that can be used starting from the baseline pitch.
Synthetic speech is usually generated using either Concatenative Synthesis or Formant Synthesis.
- Concatenative Synthesis: Libraries of phonemes (unique theoretical units of sound in a language that enable the differentiation of words) are arranged together to form words and sentences. The generated sentence is then rendered as a waveform (or, sound signal). Mostly, the intelligibility of speech generated using this method is high. However, differences in speech patterns prevent naturalness in generated speech.
- Formant Synthesis: This method generates speech artificially. Phonemes are associated with certain frequency ranges called Formants. Each formant is defined by its pitch, frequency range, and noise level. Varying the frequency range and pitch of each formant generates waveforms. The speech thus generated sounds machine-like and does not have human speech quality.
Speech Recognition
Speech recognition is the process of converting speech to text. This is more difficult compared to synthesis, as it requires interpreting what the user has spoken and converting that speech into tangible sentences.
The Process of Speech Recognition can be effectively divided into these four steps:
- Speech is converted to digital signals. Noise, microphone position, quality of audio hardware, etc. have a major impact of the generated digital signals.
- Actual speech sounds are extracted from the sounds (based on energy of the sounds).
- The extracted sounds are put together into 'speech frames.'
- The speech frames are compared with words from the grammar file to determine the word that was spoken.
These are the types of speech recognizers:
- Speaker-independent: Speaker-independent systems are recognizers that can be used by anybody without the need for training. These systems are usually deployed in environments where the system cannot be trained, for instance, telephony applications.
- Speaker-dependent: Speaker-dependent systems are those systems that need to be first trained for use by specific speakers. Voice samples of each speaker are taken, analyzed and stored. Speech is then matched with these samples to accurately determine the words spoken.
- Continuous Speech Recognition: Continuous speech recognizers allow users to speak naturally and continuously.
- Isolated or Discrete Speech Recognition: Isolated speech recognizers require the speaker to pause (about a fifth of a second) between each word that is uttered, so that the recognizer can buffer the word before processing the next one.
- Vocabulary Constrained Systems: These systems have a limited vocabulary they can understand. Small vocabulary means that users will have to restrict speech to only words that recognizers understand.
Speech recognition is usually done using Grammar Constrained Recognition or Natural Language Recognition:
- Grammar Constrained Recognition: This method is used by applications that need short responses to definite questions. Probable responses are stored as grammar. The synthesizer then uses the grammar to 'recognize' user answers. Software is programmed so that appropriate action is taken when answers are not found in the grammar. For instance, the question can be repeated again so that the user can provide the answer as specified in the grammar.
- Natural Language Recognition: This method allows users to speak in a 'natural way'. Statistical models are developed to map normal responses to their inferred meanings (what the response meant). These models are then used to match user answers to a 'what the user meant' concept and thereby provide suitable responses.
Natural speech has a lot of possibilities. It has alternative pronunciations, context-based pronunciations, varied meanings of phrases. For instance, 'turn on' could mean either to arouse or to operate (as in to turn on the TV or turn on the charm). Building applications that understand such diversity is therefore a very complex process.
Two Java Speech API Implementations
A Synthesizer Implementation: FreeTTS
FreeTTS is an open source implementation of JSAPI written completely using Java. The implementation is based on Flite, a speech synthesizer built at Carnegie Mellon University. FreeTTS is not a full implementation of the JSAPI, since it does not implement javax.speech.recognition.
Features:
- Standard download comes with three different voices&3151ltwo male and one female
- More voices (for US English) based on FestVox project can be imported
- Supports MBROLA voices (MBROLA is a speech synthesizer from the MBROLA project)
- Support for JSAPI (subset of javax.speech.synthesis only)
- Performs better than Flite on a couple of platforms
The limitation with FreeTTS is that it does not render JSML speech markup. It processes JSML data, but discards it and generates speech as plain text. You can download FreeTTS from sourceforge.com.
A Recognizer Implementation: Sphinx 4
Sphinx 4 is a sophisticated speech recognition system built using Java. It was also developed at Carnegie Mellon University. Previous versions were developed using C.
Features:
- Supports a wide range of grammar formats including: Java Speech Grammar Format, SimpleWordListGrammar, LMGrammar, FSTGrammar
- Continuous speech and grammar constrained recognition
- Large vocabulary
- Partial support for JSAPI
- Complete support for Java Speech Grammar Format (JSGF)
- Built using JDK 1.4
- Allows the training of new acoustic models
- Provides for the use custom language models
- Provides for the addition of custom dictionaries
Sphinx can be downloaded from sourceforge.com.
New on the Java Boutique:
New Review:
Time Management Made Easy with the Quartz Enterprise Job Scheduler
Why not just use the Java timer API? This open source scheduling
API boasts simplicity, ease-of-integration, a well-rounded feature
set, and it's free!
New Applet:
Reverse Complement
Reverse Complement is a simple applet that converts DNA or RNA
sequences into three useful formats.
Elsewhere on internet.com:
WebDeveloper Java
Lots of Java information on webdeveloper.com
WDVL Java
Thorough Java resource at the Web Developer's Virtual Library.
ScriptSearch Java
Hundreds of free Java code files to download.
jGuru: Your View of the Java Universe
Customizable portal with online training, FAQs, regular news updates, and tutorials.
|