资源名称

清华大学语音合成语料库

TsingHua - Corpus of Speech Synthesis,

资源简述

在语音合成技术日益成熟走向市场的今天,作为语音合成的基础的语音合成语料库扮演着越来越重要的角色。作为语音合成和语音分析的物质基础,建立设计合理、高质量录音的语音语料库有着极为重要的研究价值和实用价值。本语音合成语料库TsingHua - Corpus of Speech Synthesis, 简称TH-CoSS)是由清华大学人机交互与媒体集成研究所完成。具有规模大、用途广、层次高、数据规范使用方便的特点。该语料库可以用于语音合成的研究、开发和评测,能够最大限度地满足研究、开发和市场的需要。

Today, technology of speech synthesis sounds more and more familiar in our everyday life. Corpus, as the fundamental material of speech synthesis, has been taking a more and more important part in this process. So it is quiet interesting and necessary to design and build a corpus of high quality and meeting the special requirements too. Tsinghua – Corpus of Speech Synthesis( TH-CoSS) is authored by the Institute of Human-Computer Interface and Multimedia in the Tsinghua University. It is a corpus of large scale, multi-purpose, high levels, and well formed data which can be read and used conveniently. The TH-CoSS corpus is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech synthesis systems. And we hope it can meet the needs as good as possible, the needs not only from research and developments, but from the market as well.

单位名称

清华大学人机交互与媒体集成研究所

the Institute of Human-Computer Interface and Multimedia

Tsinghua University

开发时间

20031月至12

规模

该语音合成语料库由四部分组成:

1.普通话TTS系统语料库,为汉语普通话朗读语句,男女声各1人,共约10000句。

2.普通话TTS系统测试语料库,为汉语普通话朗读语句,男女声各1人,约2000句。

3.普通话语调分析用数据,自然对话语句,多于1000句,覆盖多种语调和语气。

4.连续语流篇章语音数据库,选自广播或电视,以汉语标准普通话为主。

 

TH-CoSS的数据内容

    TH-CoSS 03MR00和03FR00均为包括主音陈述句、测试用语句、特殊音节、普通话语调分析语料等的朗读语料库,共五张光盘。具体内容如表一所示。

 

类别

文件

数目

CD分布

03FR00

03MR00

03FR00

03MR00

TTS系统建库语料

陈述句,语句长度5-25个音节(Monologue)

5406句

4535句

CD1,CD2

CD3,CD4

测试用语句

陈述句,语句长度5-25个音节

959

959

CD2

CD4

特殊音节

汉语轻声音节组(Neutralized)

772

772

CD5

CD5

汉语儿化音节组(Retroflexed)

340

415

上声单音节汉语音节表(F_r tone)

311个

314个

普通话语调分析语料

疑问句(Question)

485句

485句

CD5

CD5

感叹句(Exclamation)

598句

466句

总计

语音文件(Utterance)

8871

7946

 

 

文档

TH-CoSS readme.doc,Technical Report.doc

  20031107-ch.dtd,语音数据的文本

CD5

                                        表一   03FR00和03MR00的数据内容