Audio Mining: Where We are and What’s Next

/
01Jan2009

Audio Mining: Where We are and What’s Next

  • 1 Tags
  • 0 Comments
By Alice Barrett Mack

audio miningSome documents consider audio mining to be only the indexing phase. For purposes of this document, audio mining is considered to be both the searching and indexing phases.

Audio mining is a speaker-independent, speech recognition technique that is used to search audio or video files for occurrences of spoken words or phrases. The speech recognition search engine identifies words or phonemes that are spoken within the file and generates a searchable index that includes a time stamp for each important word or phoneme and its locations within the file. Essentially, it creates indices of words or speech sounds. Thus, it can enable users to search audio and video files similar to the way they use search engines to search the Internet.

Audio mining software uses two basic methods, both consisting of indexing and searching phases. Large Vocabulary Continuous Speech Recognition (LVCSR) analyzes words or phrases to generate an index; phonetic audio mining analyzes sound patterns to generate an index and stores the phonetic content of the speech, rather than information about words. The audio mining software makes a rough transcript of the important words in the file (the index) and then searches one or more index files for all matches to a specified search term.

Each technique has advantages and disadvantages that vendors tout as critical. The specifics of the merits of each method (e.g., one is faster or more accurate) is beyond the scope of this document. What is important is that the major advantage of audio mining is its ability to process and search for audio data many times faster (up to many thousands of times faster for large files) than a human could – because the human would have had to listen to each word of the audio. In some situations, this would be virtually impossible due to the time it would take for a human to listen to all the material.

Status of the Technology

Audio mining has been used for searching television captions and other media content because the audio mining search is able to locate the speech content associated with the text for each caption. Used in this environment, audio mining can then produce a rough caption text file that includes the words that were recognized. In addition, it contains start and end timestamps for each word indicating exactly where that word was spoken in the source. Audio mining could also be used to automatically convert an existing transcript into a caption. This capability uses the audio mining file as a source of timestamps to add to the existing transcript. Simply by matching the words within the audio mining file to those within the transcript, audio mining can transfer timestamps from the file to the words in the transcript, thereby creating synchronized caption text.(2)

According to Dr. John Makhoul, chief scientist at BBN Technologies, audio mining works well for the location function; it is not yet good enough for the text creation function.
Audio mining development is still in its early stages with research ongoing in a number of universities and companies. Currently there are many disadvantages and obstacles still to be overcome. One of its principal disadvantages is its low accuracy rates when used in a realtime environment. However, used in an offline mode where it can analyze more slowly, accuracy rates improve. If there are multiple microphones and multiple people speaking, it has trouble distinguishing between speakers. Time stamps are not always accurate. In addition, it has problems with the transcription of function words (such as “the”), frequently inserting words that make the text sound peculiar.

Audio mining is domain specific, i.e., it is trained for specific applications or categories of speakers. For example, it does a good job on news programs because anchor newspeople and reporters are trained to speak distinctly. Its accuracy rate is lower on talk shows where the speakers talk more spontaneously and less distinctly.

Since it is speaker-independent, the audio mining technique has potential in a legal environment for examination of the contents of depositions, testimony, and other audio and audio-visual files. Therefore, in a courtroom situation where a video exists, audio mining could be used to search for specific words or phrases in a transcript; the time stamps allow the locations of the words/phrases to be synchronized with the video such that specific words or phrases can be easily located on the video. Low accuracy rates currently preclude this as a viable application in courts.

Audio mining is costly and has not yet been proven to be financially justifiable; as a result, there are few deployments. It is still too early in its development to have had any impact on reporters.
Primary vendors in the audio mining market and the method each uses are:

What’s Next?

Some view audio mining as a way to create text as a replacement for a transcript. Should court reporters and captioners fear that they will be replaced? The use of audio mining for this purpose is well into the future. Research will continue but seems to be going slowly. One such project, EARS (Effective, Affordable, Reusable Speech-to-Text), had as one of its objectives, turning radio and television broadcasts or telephone speech into text with 90-95 per cent accuracy. The project was cancelled before its completion. A subsequent project will focus more on creating text from foreign language documents into English with high accuracy rates.

Glossary of Terms

Audio Mining: A speaker-independent, speech recognition technique that is used to search audio or video files for occurrences of spoken words or phrases.

Phoneme: The smallest unit of sound.

Utterance: An entity that represents a single meaning to the computer; an utterance can be a single word, multiple words, a sentence, or even multiple sentences.

  1. Some documents consider audio mining to be only the indexing phase. For purposes of this document, audio mining is considered to be both the searching and indexing phases.
  2. Adapted from the ScanSoft audio mining documentation on the ScanSoft Web site.

This article was originally posted on the National Court Reporters Association website.

COMMENTS
Almanya sohbet anal yapan escort