RASC863——四大方言普通话语音语料库
RASC863 -- 863 annotated 4 regional accent speech corpus
RASC863是在国家863高技术项目支持下完成的四大地方(上海、广州、重庆和厦门)普通话语音语料库。
1996年863语音识别数据库以朗读语体为主,考虑了语音的音段平衡。随着语音识别技术的发展,制作口音和口语化的语音库变得重要起来。所以,我们这次由国家863项目基金支持的RASC863项目,在收集上海、广州、重庆和厦门口音的普通话语音语料库时,突出了口语化的特点,加大了语料覆盖范围。
Funded by the National 863 High-Tech Project,
we collected a speech corpus RASC863 with four representative regional accents,
namely
中国社会科学院语言所
Phonetics Lab,
2003年1月至2004年6月
每个方言点的发音人为200人(100男+100女),共800人。各方言点发音人情况按照事先设计的年龄、性别和教育背景分布。(详情见“发音人规范”)
RASC863包括自然口语(口语独白和常见问题回答)和朗读(语音平衡句子、常用口语句和常用方言词汇)两大部分。自然口语部分分为依据话题的口语独白和回答问题两个部分:口语独白部分是由发音人从我们事先准备的160个话题中任意选择一个,然后讲述3-5分钟相关内容;回答问题部分是由每个发音人回答15个常见问题。朗读语料部分包括经过挑选的语音平衡句子共2200余句、460个常用口语句以及若干各地方言中常用方言词汇。(详情见“语料设计规范”)
RASC863录音数据收集近距(距嘴角距离2-8CM)和中距(20-50CM)两路数据,每路数据采用16000hz采样、16位、单声道WAV格式存储。(详情请见“录音和存储规范”)
我们对RASC863所有录音数据均作了语音学标注,并在每个方言点挑选出20人数据作了精细标注。(详情请见“语料库标注规范”)
The corpus consists of spontaneous speech, read speech and selected dialectical words. For the
spontaneous speech, each
speaker was asked to select a topic himself or from our prepared topic sheet with a variety of 160 topics and then to
give a 4-5 minute spontaneous speech on the topic. Besides, each speaker was asked to answer 15 questions
spontaneously. The read speech consists of 2200 phonetically balanced sentences selected
automatically, and
460 sentences frequently used in daily life. For each dialectal region, we
prepared those words that
are frequently used in daily life and are different from Standard Chinese, and each speaker
was asked to read 15 dialectal words. 800 speakers (200 from each region; balanced in terms of the age, sex, and
educational background) were recruited in
the project.
Up to now, the Chinese Character
transcription, as well as the paralinguistic and non-linguistic labeling, has
been made for both the spontaneous and read speech. In addition to these,
phonetic annotation has been made for the read speech for up to 80 speakers.