A speech corpus, also known as a spoken corpus, is a collection of speeches preserved in audio or text format. These collections are useful in developing speech software and in conducting linguistic studies. The two varieties of speech corpus are spontaneous speech and read speech.
It is important to define what the words "speech" and "corpus" mean. Speech comprises collections of thoughts and facts, usually in a spoken form. One may also view any spoken utterance as speech. A corpus, in turn, references a formal collection of various pieces of information.
Users generally create a speech corpus via either audio recordings or text-based transcriptions. Recordings may be made via sound storage technologies and stored — often as MP3 files in electronic databases — to create a corpus. A transcriber, on the other hand, converts spoken speech into a written form, which is then compiled with other transcriptions.
Any type of speech may be found in a speech corpus, but such databases are generally divided into two categories. The first, spontaneous speech, contains non-formalized speeches a person might give, such as those found in conversations or in oral story-telling. Read speeches, however, have a more formalized and pre-planned structure. Examples might include political speeches, news broadcasts, and audio book readings. Some varieties may be dependent on the specific context, like interviews.
One major advantage of speech corpus tools is their practical usefulness in helping create speech-based software. For example, many computers and other electronic devices present speech recognition features as an option, such as reading back typed text, transforming spoken words into text, or identifying a speaker by unique vocal traits. Extractions from a speech corpus might aid in enhancing this technology by applying mathematically based sets of statistics called acoustic models to each individual sound. In addition, the databases can assist with developing language learning audio tapes.
These functions tie in with another application for a speech corpus. Namely, scholars can take these preserved audio or written files and study the subtle grammatical variations that comprise language. Therefore, a speech corpus can serve as a valuable tool for learning about pronunciation, word order, and other linguistic models. Researchers can further compare similarities and differences in various regional dialects and languages if they create a collection with multiple languages, or a multilingual corpus. Evaluation of corpora involving speech is a specialized research concentration known as corpus linguistics, and its computerized implementation is called computational linguistics.
Many transcript databases include notations or tags that contain information about the individual components in a piece of text. This process is called annotation. In the process of abstraction, linguists will document and translate various terms in a speech. Such input may be useful if an individual wishes to learn about unknown civilizations through texts. The final step of corpus study involves analysis, or deriving comparisons and theoretical ideals from a collection of speech components.