Finally, notice that even though TIMIT is a speech corpus, its transcriptions and associated data are just text, and can be processed using programs just like any other text corpus.

Therefore, many of the computational methods described in this book are applicable.

First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.

In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels.

It could also be a phrasal lexicon, where the key field is a phrase rather than a single word.

A thesaurus also consists of record-structured data, where we look up entries via non-key fields that correspond to topics.

Finally, TIMIT includes demographic data about the speakers, permitting fine-grained study of vocal, social, and gender characteristics.

Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.Despite its complexity, the TIMIT corpus only contains two fundamental data types, namely lexicons and texts.As we saw in 2., most lexical resources can be represented using a record structure, i.e. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated.

: Structure of the Published TIMIT Corpus: The CD-ROM contains doc, train, and test directories at the top level; the train and test directories both have 8 sub-directories, one per dialect region; each of these contains further subdirectories, one per speaker; the contents of the directory for female speaker A fourth feature of TIMIT is the hierarchical structure of the corpus.

