博文

古汉语语料库《左传》于10月19日在LDC发布

已有 6676 次阅读 2017-10-21 11:26 |个人分类:Computational Linguistics|系统分类:博客资讯

历经多年的古汉语研究积累，陈小荷教授领衔的南京师范大学中文信息处理研究团队发布了一个古汉语语料库，《左传》分词、词性标注语料库。该库共约18万字，使用了自行设计的17个词类标记，先后进行了四次校对。2010年就已发表论文进行词法分析实验（石民，李斌，陈小荷. 基于CRF的先秦汉语分词标注一体化研究，中文信息学报，2010年第2期.），许多同行发邮件来索取语料。有鉴于此，我们在国际语言资源联盟LDC正式发布了这个语料，供学界使用。价格为$50，非常便宜，我们没有从中获利，仅为平台运营收费。

语料网址： https://catalog.ldc.upenn.edu/LDC2017T14

Ancient Chinese Corpus (ACC) V1.0

Author(s): Xiaohe Chen, Bin Li, Minxuan Feng, Chao Xu, Runhua Xu,Min Shi, Lili Yu, Lei Xiao, Qingqing Wang

Introduction

The Ancient Chinese Corpus (ACC) V1.0, contains the word segmented, POS-tagged data of Zuozhuan (an ancient Chinese historyclassical book). It has 180,000 Chinese characters, 195,000 segment units(including words and punctuations). It is separated to 2 parts, training data (166,138words) and test data (28,131 words). The POS tagging set has 17 tags. Thedetails of the tagging set are shown in table 1.

The AncientChinese Corpus project began at the Nanjing Normal University in 2009. Theproject goal is to provide a large, part-of-speech tagged Ancient Chinesecorpus. In this first delivery, ACC 1.0, contained only one book Zuozhuan. We will continue to releasemuch more data.

Data

There are twotext files in this release, containing 268 paragraphs, 10,560 lines. Each lineis one sentence or a statement of a person. Each paragraph is separated by oneempty line. Each word is tagged its part-of-speech and separated by a space.

Example: 夏/n 四月/t ，/w 費伯/nr 帥/v 師/n 城/v 郎/ns 。/w

We designed the POStagging set, which has 17 tags shown in table 1. The users could refer thefollowing paper or Chinese book for further information.

The data isprovided in the UTF-8 encoding. All files were automatically verified andmanually checked.

l Xiaohe Chen,Minxuan Feng, Runhua Xu, et al. Information Processing of Pre-Qin Chinese.World Publishing Corporation, Beijing, 2013. (陈小荷,冯敏萱,徐润华,等.先秦文献信息处理, 世界图书出版公司, 2013)

l Bin Li, MinxuanFeng, Xiaohe Chen. Corpus Based Lexical Statistics of Pre-Qin Chinese. LectureNotes in Computer Science Volume 7717, 2013, pp 145-153.

Samples

Please view thefollowing sample file:

example.txt

Acknowledgement

This work wassupported in part by the Ministry of Education of China (16YJC740034) NationalSocial Science Foundation of China (15ZDB127).

Updates

We will continueto release more annotated data of Ancient Chinese.

Copyright

转载本文请联系原作者获取授权，同时请注明本文来自李斌科学网博客。
链接地址：https://blog.sciencenet.cn/blog-39714-1081830.html

上一篇：第二本书出版啦！《词语认知属性的知识库构建和应用》
下一篇：NLP和语言学会议的共性与差异

随园厚学分享 http://blog.sciencenet.cn/u/gothere 计算语言学博士希望在这里留下学术的足迹

博文

古汉语语料库《左传》于10月19日在LDC发布

语料网址： https://catalog.ldc.upenn.edu/LDC2017T14

Ancient Chinese Corpus (ACC) V1.0

Author(s): Xiaohe Chen, Bin Li, Minxuan Feng, Chao Xu, Runhua Xu,Min Shi, Lili Yu, Lei Xiao, Qingqing Wang

Introduction

Data

Samples

Acknowledgement

Updates

Copyright

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

李斌

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

随园厚学分享 http://blog.sciencenet.cn/u/gothere 计算语言学博士 希望在这里留下学术的足迹

博文

古汉语语料库《左传》于10月19日在LDC发布

语料网址： https://catalog.ldc.upenn.edu/LDC2017T14

Ancient Chinese Corpus (ACC) V1.0

Author(s): Xiaohe Chen, Bin Li, Minxuan Feng, Chao Xu, Runhua Xu,Min Shi, Lili Yu, Lei Xiao, Qingqing Wang

Introduction

Data

Samples

Acknowledgement

Updates

Copyright

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

李斌

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

随园厚学分享 http://blog.sciencenet.cn/u/gothere 计算语言学博士希望在这里留下学术的足迹

该博文允许注册用户评论请点击登录评论 (0 个评论)