章成志 分享 http://blog.sciencenet.cn/u/timy 宠辱不惊闲看庭前花开花落,去留无意漫观天外云展云舒

博文

Large Corpora used in CTS

已有 4990 次阅读 2008-6-11 16:55 |个人分类:工具箱

 

本页面来源于: http://corpus.leeds.ac.uk/list.html 。

Large Corpora used in CTS

Centre for Translation Studies
Centre for Translation Studies
The website http://corpus.leeds.ac.uk/ was originally designed to host comparable English and Russian corpora, but in time we have accumulated a variety of large corpora supported by a uniform search interface: "Leeds CQP", which is a CGI Perl frontend to IMS Corpus Workbench. Tools developed to work with corpora are listed on a separate page.

Monolingual corpora

English

  1. English Internet Corpus, a corpus of about 110 million words. This corpus has been compiled automatically from the Internet in 2005 along with other Internet corpora (for Chinese, French, German, Italian, Spanish, Polish and Russian).
  2. The British National Corpus (BNC), a classic collection of samples of modern British English, 100 million words.
  3. the Reuters corpus, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19, 90 million words.
  4. A corpus of British News, a collection of newsstories from 2004 from each of the four major British newspapers: Guardian/Observer, Independent, Telegraph and Times, 200 million words.
Since BNC and Reuters require an agreement to monitor the users of their corpora, the interface requires a password, http://corpus.leeds.ac.uk/protected/query.html

Russian

  1. The Russian National Corpus, a collection of texts comparable to the BNC in its design, its pilot version has 50 million words (a more elaborated description of the project is available in Russian from "http://ruscorpora.ru)
  2. Russian Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
  3. a corpus of Russian newspapers, 78 million words (Izvestia, Trud and Strana.ru).
  4. the Russian Standard, a corpus of modern Russian fiction with manual disambiguation of morphological categories, 1.6 million words.
The interface to Russian corpora is available from http://corpus.leeds.ac.uk/ruscorpora.html

Chinese

  1. Chinese Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
  2. a fragment of LDC Chinese Gigaword corpus, 35 million words, tokenised and lemmatised using the NEUCSP tool from NLP Lab, North-Eastern University, China; the selection includes newswires for one year (2001); this makes it comparable to the Reuters corpus.
  3. Guo Jin's Chinese PH corpus, which is based on XINHUA news from 1990; segmentation done by Chris Brew and Julia Hockenmaier, 2,5 million words.
  4. Lancaster Corpus of Mandarin Chinese, a corpus of about 1 mln words, which is comparable in its design to Brown and LOB type corpora. Created by Tony McEnery and Richard Xiao, distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).
The interface to Chinese corpora is available from http://corpus.leeds.ac.uk/query-zh.html

Multilingual aligned corpora

  1. English-Russian, Russian-English fiction; a small parallel corpus of English and Russian fiction from the 19th century (aligned by A. Kretov, Voronezh);
  2. English-German corpus of European Parliament Proceedings; source texts were taken from Phil Köhn's page
  3. German-English Parallel Corpus "de-news"; also taken from Phil Köhn's page
  4. English-Japanese corpus of Yomiuri data (it is available in-house only)

Internet corpora

There are few large general corpora of the size of BNC (100 million words) available. Within Wacky (Web as Corpus) project we developed a set of procedures for collecting Internet corpora from the Internet and collected large representative corpora for for Chinese, French, German, Italian, Spanish, Polish and Russian with the search interface available from http://corpus.leeds.ac.uk/internet.html.

The query interface to all corpora is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases, in particular, querying for lemmas and for exact word forms (all corpora have word, pos and lemma attributes, even if the latter is redundant for Chinese). Other possibilities include calculation of most significant collocations (using MI, T and loglikelihood scores) and searching for similar contexts in English, German and Russian corpora.

The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

Extra resources

For some corpora I also computed the frequency lists (all lists use UTF-8 encoding):
  • English (tokenisation, lemmatisation and POS tagging by
  • Russian (tokenisation and tagging done by

There is also a frequency list of Georgian produced by Garold Shmaltsel and Givi Nozadze.

The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:

[word rank] [normalised frequency] [lemma, word form or POS]

Note that the frequency has been normalised to ipm: the number of instances of an individual word or POS tag per million words in respective corpora. Normalisation makes it possible to compare frequencies in the BNC against the Internet corpus. If you want to know the actual number of occurrences of a word listed there, multiply the frequency by the corpus size in million words (the size of a corpus is shown at the top of its frequency list). For instance, browser is used about 8556 times in the English Internet Corpus (47.17*181.376).

Finally, we have lists of distributionally similar words for English, German and Russian (words are said to be distributionally similar, if they share a significant amount of collocates in the corpus). The lists have been produced by Reinhard Rapp using Singular Value Decomposition (SVD).

TreeTagger)

https://blog.sciencenet.cn/blog-36782-28665.html

上一篇:句对齐、词对齐相关资源列表
下一篇:评价不是伪科学
收藏 IP: .*| 热度|

0

发表评论 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-10 22:52

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部