建立和收集语音库目的是为语音识别系统提供训练库和测试库,为语音研究提供朗读和口语风格,覆盖尽可能多地语音、词汇的语音库。
在863项目资金的资助下,在1998年建立了一个基于语言学/语音学规则的朗读文本语料库。这一语料库的建立,极大的推动了语音工程的发展。但是随着这些年来的语音工程进步,需要设计新的语料。
RASC863语料库设计有如下的特点:第一,这个语料库中的语音平衡的句子主要是从口语语料挑选来的,所以更符合语音识别面对的真实情形;第二,语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;第三,我们对三音子不进行归类的挑选,这样可以有效的解决训练数据稀疏的问题。第四,将被挑选句子的最大长度增加到35字,从某种程度上可以算是一个小篇章,增加韵律结构的复杂度。
The aim of collecting and building this
speech corpus is to provide a database for training and evaluating for ASR, or for
phonetic research. So the speech materials should cover different speaking
styles and various phonetic and linguistic phenomena .
In 1998, funded by national 863 program, we built a read speech
corpus based on linguistic/phonetic rules, which have helped the ASR a lot. However,
with the development of the speech technology in recent years, new speech database
are urgently needed.RASC863 was built then.
RASC863
has the following characteristics. First, the phonetically balanced sentences
of this corpus are mainly selected from a variety of materials, which meets the
real applications such as spoken dialogue system. Second, sentences in the
corpus are integrated respectively in content and meaning, and therefore, can
reflect the prosodic information of each sentence as more as possible. Third,
we a full tri-phone set was used, so as to resolve effectively the rarity
problem of data training. Fourth, we expand the length of the selected
sentences to 35 syllables, being a small discourse relatively speaking, so as
to increase the complexity of the prosodic structure.
【语料挑选原则】 口语为主。尽量覆盖语音现象,包括音段搭配和超音段的组合。
【Principles in
selection of recording materials】The corpus consists of mainly colloquial materials, covering as more
phonetic phenomena as possible, including segmental and suprasegmental phonetic
combinations.
【原始语料】
l 小说
l 课本
l 电影剧本
l 聊天访谈
l 现代汉语示例
【Original materials】
l novels
l textbooks
l film scenarios
l informal interviews
l examples of modern Chinese
【语料】
每个方言点包含20套语料,具体语料文本参见“TXT”目录下各地语料子目录。(注:RASC863-G2制作过程中,发音人个别实际发音可能会和本参考语料有微小差异.)
每套录音语料包括口语和朗读两种体裁。每个发音人的录音语料具体组成内容见表1。
独白3-5分钟,由发音人从160个话题中任意选择一个适合自己的话题,然后用自然的口语讲述。文件名为:(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wav(xxx代表具体数字编号)。例如长沙地区1号女发言人数据为:CSF001.wav.
回答问题是让发音只回答一些问题,包括工作单位、个人爱好、联系电话、网址、数字等问题。对应文件名称为:Axxxx(xxxx代表具体数字编号)。
常用口语句子,我们收集了460个,每个发音人读23个。对应文件名称为:Qxxxx(xxxx代表具体数字编号)。
面向信息和通讯应用的语句包括数字,字符和手机短信内容等, 对应文件名称为:Xxxxx(xxxx代表具体数字编号)。
语音平衡的句子,选自访谈对话、口语对话以及人民日报等语料,句长小于35个音节,尽量覆盖所有的音节间的三音子音联。整个挑选的句子有1895个,覆盖几乎所有音节、音节间两音子和大部分三音子组合。同时兼顾2-3音节词的声调搭配。对应文件名称为:Sxxxx(xxxx代表具体数字编号)。
各部分具体语料请参见“TXT”目录下的“all.txt”文件。
口语独白部分的160个常见参考话题请参见“TXT”目录下的“topics.doc”文件。
表1:每个发音人的发音语料(
prompt sheet )
每个发音人语料的组成 |
发音方式 |
内容说明 |
(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx |
自然独白口语 |
发音人自由挑选一个话题口述:3-5分钟 |
a0001-a0015 |
自然口语 |
回答23个问题 |
qxxxx |
朗读 |
常用口语句子 每人23个 |
xxxxx |
朗读 |
数字,字母,短信等5句 |
sxxxx |
朗读 |
语音平衡的句子 95句左右 |
【materials】
We used 20 sets of materials
in each city. For the materials in detail, please refer to the child folders of
each city under the folder TXT. (In recording, some
of the actual pronunciations of words may be a little different from those in
the text in the files. You can find the correct text in PRAAT annotation files.
)
Each set of materials included two styles, colloquial and reading. The
materials of each speaker can be found in Table
Each speaker selected one by themselves from 160 topics, then using their
natural speech, made a monologue about 3-5 minutes long. The name of files are
like this, (CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wav(xxx stands for the
serial number of each file. Take female speaker No. 1 from
Each speaker was asked to answer some questions about work affiliation,
personal habits, contact telephone numbers, website addresses, numerals, and
etc.. The names of the files related are written as Axxxx(xxxx is the serial
number of the file).
We collected 460 sentences used in daily life, of which each speaker read
20. The names
of these sound files are Qxxxx, where xxxx stands for the serial numbers of
the files.
In order to be use in communication, we collected some digits, characters
and sms texts. The names of the files related are written as Xxxxx(xxxx is the serial
number of the file).
The phonetically balanced sentences were selected from interviews,
dialogues, and People’s Daily,
shorter than 35 syllables, covering as more kinds of tri-phones as possible, of which, 1895 were unabridged
full sentences. 96% of di-phones and 84% of tri-phones are covered by the
original materials, which are all kinds of consonantal couples between
syllables and almost all kinds of syllables. At the same time, the allocation
of tones in 2-3 syllable words was also in consideration. Corresponding files
are named as Sxxxx, where xxxx stands for the serial numbers of the files.
For details of the materials in each part, please refer to the file
ALL.xls in the folder TXT.
The 160 daily topics for colloquial monologue can be found in the file TOPICS.doc in the folder TXT.
表1:每个发音人的发音语料(
prompt sheet )
每个发音人语料的组成 |
发音方式 |
内容说明 |
(CQ/GZ/SH/XM)Spon(f/m)xxx |
自然独白口语 |
发音人自由挑选一个话题口述:3-5分钟 |
a0001-a0015 |
自然口语 |
回答15个问题 |
qxxxx |
朗读 |
常用口语句子 每人20个 |
dxxxx |
朗读 |
本地常用词汇若干 (方言) |
sxxxx |
朗读 |
语音平衡的句子 110句左右 |
Table 1: Prompt sheet
Composition of the
prompt list |
Style |
Content |
(CQ/GZ/SH/XM)Spon(f/m)xxx |
Spontaneous monologue
|
A 3-5 minutes monologue on a topic
selected by the speakers themselves |
a0001-a0015 |
spontaneous |
Answers of 15 elicited questions |
qxxxx |
reading |
Daily colloquial sentences,
20 for each speaker |
xxxxx |
reading |
Digits, characters and sms
texts |
sxxxx |
reading |
110 or so phonetically
balanced sentences |