录音语料设计规范

建立和收集语音库目的是为语音识别系统提供训练库和测试库,为语音研究提供朗读和口语风格,覆盖尽可能多地语音、词汇的语音库。

863项目资金的资助下,在1998年建立了一个基于语言学/语音学规则的朗读文本语料库。这一语料库的建立,极大的推动了语音工程的发展。但是随着这些年来的语音工程进步,需要设计新的语料。

RASC863语料库设计有如下的特点:第一,这个语料库中的语音平衡的句子主要是从口语语料挑选来的,所以更符合语音识别面对的真实情形;第二,语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;第三,我们对三音子不进行归类的挑选,这样可以有效的解决训练数据稀疏的问题。第四,将被挑选句子的最大长度增加到35字,从某种程度上可以算是一个小篇章,增加韵律结构的复杂度。

The aim of collecting and building this speech corpus is to provide a database for training and evaluating for ASR, or for phonetic research. So the speech materials should cover different speaking styles and various phonetic and linguistic phenomena .

In 1998, funded by national 863 program, we built a read speech corpus based on linguistic/phonetic rules, which have helped the ASR a lot. However, with the development of the speech technology in recent years, new speech database are urgently needed.RASC863 was built then.

RASC863 has the following characteristics. First, the phonetically balanced sentences of this corpus are mainly selected from a variety of materials, which meets the real applications such as spoken dialogue system. Second, sentences in the corpus are integrated respectively in content and meaning, and therefore, can reflect the prosodic information of each sentence as more as possible. Third, we a full tri-phone set was used, so as to resolve effectively the rarity problem of data training. Fourth, we expand the length of the selected sentences to 35 syllables, being a small discourse relatively speaking, so as to increase the complexity of the prosodic structure.

 

语料挑选原则】 口语为主。尽量覆盖语音现象,包括音段搭配和超音段的组合。

Principles in selection of recording materialsThe corpus consists of mainly colloquial materials, covering as more phonetic phenomena as possible, including segmental and suprasegmental phonetic combinations.

 

原始语料】

l         小说

l         课本

l         电影剧本

l         聊天访谈

l         现代汉语示例

Original materials

l         novels

l         textbooks

l         film scenarios

l         informal interviews

l         examples of modern Chinese

 

语料】

每个方言点包含20套语料,具体语料文本参见“TXT”目录下各地语料子目录。(注:RASC863-G2制作过程中,发音人个别实际发音可能会和本参考语料有微小差异.

每套录音语料包括口语和朗读两种体裁。每个发音人的录音语料具体组成内容见表1。

独白3-5分钟,由发音人从160个话题中任意选择一个适合自己的话题,然后用自然的口语讲述。文件名为:(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wavxxx代表具体数字编号)。例如长沙地区1号女发言人数据为:CSF001.wav.

回答问题是让发音只回答一些问题,包括工作单位、个人爱好、联系电话、网址、数字等问题。对应文件名称为:Axxxx(xxxx代表具体数字编号)。

常用口语句子,我们收集了460个,每个发音人读23个。对应文件名称为:Qxxxx(xxxx代表具体数字编号)。

面向信息和通讯应用的语句包括数字,字符和手机短信内容等, 对应文件名称为:Xxxxx(xxxx代表具体数字编号)。

语音平衡的句子,选自访谈对话、口语对话以及人民日报等语料,句长小于35个音节,尽量覆盖所有的音节间的三音子音联。整个挑选的句子有1895个,覆盖几乎所有音节、音节间两音子和大部分三音子组合。同时兼顾2-3音节词的声调搭配。对应文件名称为:Sxxxx(xxxx代表具体数字编号)。

各部分具体语料请参见“TXT”目录下的“all.txt”文件。

口语独白部分的160个常见参考话题请参见“TXT”目录下的“topics.doc”文件。

1:每个发音人的发音语料 prompt sheet

每个发音人语料的组成

发音方式

内容说明

(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx

自然独白口语

 发音人自由挑选一个话题口述:3-5分钟

a0001-a0015

自然口语

回答23个问题

qxxxx

朗读

常用口语句子 每人23个

xxxxx

朗读

数字,字母,短信等5句

sxxxx

朗读

语音平衡的句子 95句左右

materials

    We used 20 sets of materials in each city. For the materials in detail, please refer to the child folders of each city under the folder TXT. (In recording, some of the actual pronunciations of words may be a little different from those in the text in the files. You can find the correct text in PRAAT annotation files. )

Each set of materials included two styles, colloquial and reading. The materials of each speaker can be found in Table 1 in detail.

Each speaker selected one by themselves from 160 topics, then using their natural speech, made a monologue about 3-5 minutes long. The name of files are like this, (CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wavxxx stands for the serial number of each file. Take female speaker No. 1 from Changsha for example: CSF001.wav

Each speaker was asked to answer some questions about work affiliation, personal habits, contact telephone numbers, website addresses, numerals, and etc.. The names of the files related are written as Axxxx(xxxx is the serial number of the file).

We collected 460 sentences used in daily life, of which each speaker read 20. The names of these sound files are Qxxxx, where xxxx stands for the serial numbers of the files.

In order to be use in communication, we collected some digits, characters and sms texts. The names of the files related are written as Xxxxx(xxxx is the serial number of the file).

The phonetically balanced sentences were selected from interviews, dialogues, and People’s Daily, shorter than 35 syllables, covering as more kinds of tri-phones  as possible, of which, 1895 were unabridged full sentences. 96% of di-phones and 84% of tri-phones are covered by the original materials, which are all kinds of consonantal couples between syllables and almost all kinds of syllables. At the same time, the allocation of tones in 2-3 syllable words was also in consideration. Corresponding files are named as Sxxxx, where xxxx stands for the serial numbers of the files.

For details of the materials in each part, please refer to the file ALL.xls in the folder TXT.

The 160 daily topics for colloquial monologue can be found in the file TOPICS.doc in the folder TXT.

 

1:每个发音人的发音语料 prompt sheet

每个发音人语料的组成

发音方式

内容说明

(CQ/GZ/SH/XM)Spon(f/m)xxx

自然独白口语

 发音人自由挑选一个话题口述:3-5分钟

a0001-a0015

自然口语

回答15个问题

qxxxx

朗读

常用口语句子 每人20个

dxxxx

朗读

本地常用词汇若干 (方言)

sxxxx

朗读

语音平衡的句子 110句左右

 

 

Table 1: Prompt sheet

Composition of the prompt list

Style

Content

(CQ/GZ/SH/XM)Spon(f/m)xxx

Spontaneous monologue

 A 3-5 minutes monologue on a topic selected by the speakers themselves

a0001-a0015

spontaneous  

Answers of 15 elicited questions

qxxxx

reading

Daily colloquial sentences, 20 for each speaker

xxxxx

reading

Digits, characters and sms texts

sxxxx

reading

110 or so phonetically balanced sentences