录音语料设计规范

建立和收集语音库目的是为语音识别系统提供训练库和测试库,为语音研究提供朗读和口语风格,覆盖尽可能多地语音、词汇的语音库。

863项目资金的资助下,在1998年建立了一个基于语言学/语音学规则的朗读文本语料库。这一语料库的建立,极大的推动了语音工程的发展。但是随着这些年来的语音工程进步,需要设计新的语料。

RASC863语料库设计有如下的特点:第一,这个语料库中的语音平衡的句子主要是从口语语料挑选来的,所以更符合语音识别面对的真实情形;第二,语料库中的句子在内容和语义上都是完整的,所以能够尽可能的反映一个句子的韵律信息;第三,我们对三音子不进行归类的挑选,这样可以有效的解决训练数据稀疏的问题。第四,将被挑选句子的最大长度增加到35字,从某种程度上可以算是一个小篇章,增加韵律结构的复杂度。

The aim of collecting and building this speech corpus is to provide a database for training and evaluating for ASR, or for phonetic research. So the speech materials should cover different speaking styles and various phonetic and linguistic phenomena .

In 1998, funded by national 863 program, we built a read speech corpus based on linguistic/phonetic rules, which have helped the ASR a lot. However, with the development of the speech technology in recent years, new speech database are urgently needed.RASC863 was built then.

RASC863 has the following characteristics. First, the phonetically balanced sentences of this corpus are mainly selected from a variety of materials, which meets the real applications such as spoken dialogue system. Second, sentences in the corpus are integrated respectively in content and meaning, and therefore, can reflect the prosodic information of each sentence as more as possible. Third, we a full tri-phone set was used, so as to resolve effectively the rarity problem of data training. Fourth, we expand the length of the selected sentences to 35 syllables, being a small discourse relatively speaking, so as to increase the complexity of the prosodic structure.

 

语料挑选原则】 口语为主。尽量覆盖语音现象,包括音段搭配和超音段的组合。

Principles in selection of recording materialsThe corpus consists of mainly colloquial materials, covering as more phonetic phenomena as possible, including segmental and suprasegmental phonetic combinations.

 

原始语料】

l         小说

l         课本

l         电影剧本

l         聊天访谈

l         现代汉语示例

Original materials

l         novels

l         textbooks

l         film scenarios

l         informal interviews

l         examples of modern Chinese

 

语料】

每个方言点包含20套语料,具体语料文本参见“TXT”目录下各地语料子目录。(注:RASC863-G2制作过程中,发音人个别实际发音可能会和本参考语料有微小差异.

每套录音语料包括口语和朗读两种体裁。每个发音人的录音语料具体组成内容见表1。

独白3-5分钟,由发音人从160个话题中任意选择一个适合自己的话题,然后用自然的口语讲述。文件名为:(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wavxxx代表具体数字编号)。例如长沙地区1号女发言人数据为:CSF001.wav.

回答问题是让发音只回答一些问题,包括工作单位、个人爱好、联系电话、网址、数字等问题。对应文件名称为:Axxxx(xxxx代表具体数字编号)。

常用口语句子,我们收集了460个,每个发音人读23个。对应文件名称为:Qxxxx(xxxx代表具体数字编号)。

面向信息和通讯应用的语句包括数字,字符和手机短信内容等, 对应文件名称为:Xxxxx(xxxx代表具体数字编号)。

语音平衡的句子,选自访谈对话、口语对话以及人民日报等语料,句长小于35个音节,尽量覆盖所有的音节间的三音子音联。整个挑选的句子有1895个,覆盖几乎所有音节、音节间两音子和大部分三音子组合。同时兼顾2-3音节词的声调搭配。对应文件名称为:Sxxxx(xxxx代表具体数字编号)。

各部分具体语料请参见“TXT”目录下的“all.txt”文件。

口语独白部分的160个常见参考话题请参见“TXT”目录下的“topics.doc”文件。

1:每个发音人的发音语料 prompt sheet

每个发音人语料的组成

发音方式

内容说明

(CS/LY/NC/NJ/TY/WZ)+(F/M)xxx

自然独白口语

 发音人自由挑选一个话题口述:3-5分钟

a0001-a0015

自然口语

回答23个问题

qxxxx

朗读

常用口语句子 每人23个

xxxxx

朗读

数字,字母,短信等5句

sxxxx

朗读

语音平衡的句子 95句左右

materials

    We used 20 sets of materials in each city. For the materials in detail, please refer to the child folders of each city under the folder TXT. (In recording, some of the actual pronunciations of words may be a little different from those in the text in the files. You can find the correct text in PRAAT annotation files. )

Each set of materials included two styles, colloquial and reading. The materials of each speaker can be found in Table 1 in detail.

Each speaker selected one by themselves from 160 topics, then using their natural speech, made a monologue about 3-5 minutes long. The name of files are like this, (CS/LY/NC/NJ/TY/WZ)+(F/M)xxx.wavxxx stands for the serial number of each file. Take female speaker No. 1 from Changsha for example: CSF001.wav

Each speaker was asked to answer some questions about work affiliation, personal habits, contact telephone numbers, website addresses, numerals, and etc.. The names of the files related are written as Axxxx(xxxx is the serial number of the file).

We collected 460 sentences used in daily life, of which each speaker read 20. The names of these sound files are Qxxxx, where xxxx stands for the serial numbers of the files.

In order to be use in communication, we collected some digits, characters and sms texts. The names of the files related are written as Xxxxx(xxxx is the serial number of the file).

The phonetically balanced sentences were selected from interviews, dialogues, and People’s Daily, shorter than 35 syllables, covering as more kinds of tri-phones  as possible, of which, 1895 were unabridged full sentences. 96% of di-phones and 84% of tri-phones are covered by the original materials, which are all kinds of consonantal couples between syllables and almost all kinds of syllables. At the same time, the allocation of tones in 2-3 syllable words was also in consideration. Corresponding files are named as Sxxxx, where xxxx stands for the serial numbers of the files.

For details of the materials in each part, please refer to the file ALL.xls in the folder TXT.

The 160 daily topics for colloquial monologue can be found in the file TOPICS.doc in the folder TXT.

 

1:每个发音人的发音语料 prompt sheet

每个发音人语料的组成

发音方式

内容说明

(CQ/GZ/SH/XM)Spon(f/m)xxx

自然独白口语

 发音人自由挑选一个话题口述:3-5分钟

a0001-a0015

自然口语

回答15个问题

qxxxx

朗读

常用口语句子 每人20个

dxxxx

朗读

本地常用词汇若干 (方言)

sxxxx

朗读

语音平衡的句子 110句左右

 

 

Table 1: Prompt sheet

Composition of the prompt list

Style

Content

(CQ/GZ/SH/XM)Spon(f/m)xxx

Spontaneous monologue

 A 3-5 minutes monologue on a topic selected by the speakers themselves

a0001-a0015

spontaneous  

Answers of 15 elicited questions

qxxxx

reading

Daily colloquial sentences, 20 for each speaker

xxxxx

reading

Digits, characters and sms texts

sxxxx

reading

110 or so phonetically balanced sentences

 

 

 

 

 

 

 

 

 

                         发音人规范

每个地区发音人200个,没有发音障碍,听力正常。年龄、性别以及口音和文化程度分布如下, 允许误差5%。

口音按照普通话水平测试标准分级,分为三级,每级又分甲乙两等。首先由录音人判断发音人的普通话级别,最终由专家抽样检查。

每个方言点的发音人信息请参见“..\METADATA\SPECSPK\SPKINFO”目录中的WORD文件,其中有每个发音人的序号,所用本方言点语料编号、姓名、年龄、性别、文化程度、联系电话、录音时间、口语独白的话题、录音场所面积、录音场所噪音、普通话等级等信息。

200 speakers were included in each city, without any speaking and hearing problems. Their age, sex, accent, and educational background are listed below. (deviations are allowed within 5%.)

The regional accent was differentiated into three levels based on the standards of PSC (Putonghua Proficiency Test), with each level further divided into two sub-levels A and B. Speakers’ level of Putonghua was first assessed by the recorders, and then checked by professionals by taking samples from them at last.

For information of each speaker in detail, please refer to the document in the folder “..\METADATA\SPECSPK\SPKINFO”. In it, there are information of the speakers, such as sequence number of each speaker, serial number of the dialect each one used, their name, age, sex, educational background, contact telephone number, time of recording, topic for spontaneous monologue, coverage and degree of noise of recording room, level of Putonghua proficiency, and etc..

 

发音人要求和分布

 

年龄

青年

中年

老年

50%

40%

10%

性别

男女各一半

男女各一半

男女各一半

口音

中度二级口音80%,一级乙等5%,三级15%

文化程度

90% 高中以上学历,10%高中以下学历

 

Information of speakers in general

Age

Young

Middle-aged

Old

50%

40%

10%

Sex

male 50%

female 50%

male 50%

female 50%

male 50%

female 50%

Accent

(Putonghua Proficiency level)

Level 2 (intermediate) 80%

Level 1 (more fluent in Putonghua) 5%

Level 3 (heavier in local accent) 15%

Educational background

Senior middle school and higher 90%

Junior middle school and lower 10%

 

录音和存储规范

【录音设备】

 

我们配置两套录音设备,每套包括:笔记本电脑一台,USB 声卡 (M-Audio mobilepre),头戴式话筒,CR722电容话筒。

声音文件采用双通道录制。两通道信号分别采用头戴式话筒和797厂生产的CR722电容传声器(20-20000Hz)录制。

录音时,记录录音的声学空间面积和背景噪音大小。

 

Recording equipment

We equipped with two sets of recording equipment, each of which consisted of a notebook computer and a USB sound card (M-Audio mobilepre). Sound signals were recorded through two channels, using Sennheiser headset microphone made in Germany and CR722 condenser microphone (20-20000Hz) produced by 797 Factory respectively.

The acoustic coverage and degree of background noisy while recording were noted down.

 

【录音软件】

4-5分钟的口语语料用Cooledit pro2录制。

语句用我们编制的录音软件录制,同时录制近距话筒(离嘴角2-8cm)和中距离话筒(20-50cm)两个通道语音信号。

 

Recording tools

The 4-5 minutes spontaneous speeches were recorded by Cooledit pro2.

Sentences were recorded by a self-programmed software called CASSRecorder.

 

【数据存储】

 

    以16KHz采样16bit精度,Wave格式存储。每个文件至少存贮在不同的两种存储介质上。

Storing of data

The data was stored in a sample rate of 16KHz, resolution of 16bit, in Wave format. Every file was stored in at least two different storing media.

 

 

 

 

 

 

 

 

 

 

 

语音语料库标注规范

一、       标注说明

1.1 标注软件以及标注文件格式

标注软件使用Praat语音分析软件(http://www.fon.hum.uva.nl/praat/)。

标注文件名对应声音文件号 + “.TextGrid”后缀,如A0001.TextGrid是A0001.wav对应的标注文件,可以用praat调用。标注文件和声音文件存于同一个目录下面。

 

Annotation Software and file format

The software used in annotation is PRAAT, a free speech analysis and annotation software downloadable from http://www.fon.hum.uva.nl/praat/.

The annotation files have the same file name with their sound files but have extended file name as “.TextGrid”, for example, A0001.TextGrid is the annotation file of the sound file A0001.wav. The annotation files and their corresponding sound files are placed in the same folder.

 

 

1.2 语音标注内容

    尽管录音过程中我们尽量控制发音人的发音错误,但是也很难避免个别的实际发音与所给发音语料的细小的差别。因此,我们对每个发音人的文本按照实际的发音进行了修正,准确文本见每句对应的praat标注文件。

 

标注包括以下内容:

 

1           对所有发音人的口语独白进行了语音到文字的转写,包括口语中出现的副语言学和非语言学信息的转写(见音字转写规范说明)。

2           所有朗读、常用方言词汇和回答问题的汉字的转写。如果是出现数字,那么用汉字标注,如“五十二”。如果是英文网址用英语表示,chinarencom 字母单读时,字母之间用空格隔开。

3           对所有朗读、常用方言词汇和回答问题进行了正则拼音的转写,并且标注分词信息。如果是英文网址用英语表示,chinarencom 字母单读时,字母之间用空格隔开。

 

Phonetic annotation

Though we tried our best to control pronunciation errors of speakers in recording, some minor differences between the actual pronunciation and the text was inevitable. Therefore, we have corrected the errors in annotation, and you can find right text in annotation files.

 

1. Chinese character transcription of all speakers’ spontaneous monologues, including paralinguistic and non-linguistic information. (Please refer to Specification of Transcription )

2. Chinese character transcription of all read speech, daily local dialectal words, and answers of the questions. Numerals are written in Chinese characters. English words are written in English. English letters, when read alone, are separated by spaces.

3. Orthographic Pinyin transcription of all read speech, daily local dialectal words, and answers of the questions, as well as word boundary information. English words are written in English. English letters, when read alone, are separated by spaces.