博文

Large Corpora used in CTS

已有 4990 次阅读 2008-6-11 16:55 |个人分类:工具箱

本页面来源于： http://corpus.leeds.ac.uk/list.html 。

Large Corpora used in CTS

The website http://corpus.leeds.ac.uk/ was originally designed to host comparable English and Russian corpora, but in time we have accumulated a variety of large corpora supported by a uniform search interface: "Leeds CQP", which is a CGI Perl frontend to IMS Corpus Workbench. Tools developed to work with corpora are listed on a separate page.

Monolingual corpora

English

English Internet Corpus, a corpus of about 110 million words. This corpus has been compiled automatically from the Internet in 2005 along with other Internet corpora (for Chinese, French, German, Italian, Spanish, Polish and Russian).
The British National Corpus (BNC), a classic collection of samples of modern British English, 100 million words.
the Reuters corpus, a collection of newswires from Reuters for one year from 1996-08-20 to 1997-08-19, 90 million words.
A corpus of British News, a collection of newsstories from 2004 from each of the four major British newspapers: Guardian/Observer, Independent, Telegraph and Times, 200 million words.

Since BNC and Reuters require an agreement to monitor the users of their corpora, the interface requires a password, http://corpus.leeds.ac.uk/protected/query.html

Russian

The Russian National Corpus, a collection of texts comparable to the BNC in its design, its pilot version has 50 million words (a more elaborated description of the project is available in Russian from "http://ruscorpora.ru)
Russian Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
a corpus of Russian newspapers, 78 million words (Izvestia, Trud and Strana.ru).
the Russian Standard, a corpus of modern Russian fiction with manual disambiguation of morphological categories, 1.6 million words.

The interface to Russian corpora is available from http://corpus.leeds.ac.uk/ruscorpora.html

Chinese

Chinese Internet Corpus, a corpus of about 90 million words. This corpus has been compiled automatically from the Internet in February-April 2005 along with other Internet corpora.
a fragment of LDC Chinese Gigaword corpus, 35 million words, tokenised and lemmatised using the NEUCSP tool from NLP Lab, North-Eastern University, China; the selection includes newswires for one year (2001); this makes it comparable to the Reuters corpus.
Guo Jin's Chinese PH corpus, which is based on XINHUA news from 1990; segmentation done by Chris Brew and Julia Hockenmaier, 2,5 million words.
Lancaster Corpus of Mandarin Chinese, a corpus of about 1 mln words, which is comparable in its design to Brown and LOB type corpora. Created by Tony McEnery and Richard Xiao, distributed by the European Language Resources Association (Cat. No ELRA-W0039) and the Oxford Text Archive (Cat. No 2474).

The interface to Chinese corpora is available from http://corpus.leeds.ac.uk/query-zh.html

Multilingual aligned corpora

English-Russian, Russian-English fiction; a small parallel corpus of English and Russian fiction from the 19th century (aligned by A. Kretov, Voronezh);
English-German corpus of European Parliament Proceedings; source texts were taken from Phil Köhn's page
German-English Parallel Corpus "de-news"; also taken from Phil Köhn's page
English-Japanese corpus of Yomiuri data (it is available in-house only)

Internet corpora

There are few large general corpora of the size of BNC (100 million words) available. Within Wacky (Web as Corpus) project we developed a set of procedures for collecting Internet corpora from the Internet and collected large representative corpora for for Chinese, French, German, Italian, Spanish, Polish and Russian with the search interface available from http://corpus.leeds.ac.uk/internet.html.

The query interface to all corpora is powered by the IMS Corpus Workbench, but it has been extended to simplify processing of some frequent cases, in particular, querying for lemmas and for exact word forms (all corpora have word, pos and lemma attributes, even if the latter is redundant for Chinese). Other possibilities include calculation of most significant collocations (using MI, T and loglikelihood scores) and searching for similar contexts in English, German and Russian corpora.

The interface was developed by Serge Sharoff; contact me at s.sharoffleeds.ac.uk, if you have further queries.

Extra resources

For some corpora I also computed the frequency lists (all lists use UTF-8 encoding):

English (tokenisation, lemmatisation and POS tagging by
Russian (tokenisation and tagging done by

mystem)
- various frequency lists from the Russian National Corpus (dicsussed on a separate page)
- lemmas from the Internet corpus (detected by mystem)
- word forms from the Internet corpus
- POS frequencies from the Internet corpus (detected by mystem)
- lemmas from the newspaper corpus
- word forms from the newspaper corpus
- POS frequencies from the newspaper corpus
- frequencies of personal names from the newspaper corpus. Note that lemmatisation produced by mystem is consistently wrong, either female names are produced for men (e.g. Putina, Xodorkovskaja) or verbs/adjectives are used, especially for non-Slavonic names (Saddam Husejnyj, Garry Pottiratq); take this into account when making quieres.
- bigrams of lemmas from the Internet corpus
- bigrams of word forms from the Internet corpus
Chinese (tokenisation and POS tagging by NEUCSP)
Japanese frequency lists (tokenisation, lemmatisation and POS tagging by ChaSen)
French frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)
Portuguese frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)
Spanish frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)
German frequency lists (tokenisation, lemmatisation and POS tagging by TreeTagger)

There is also a frequency list of Georgian produced by Garold Shmaltsel and Givi Nozadze.

The structure of the lists follows the template of the lemmatised BNC lists produced by Adam Kilgariff, namely:

[word rank] [normalised frequency] [lemma, word form or POS]

Note that the frequency has been normalised to ipm: the number of instances of an individual word or POS tag per million words in respective corpora. Normalisation makes it possible to compare frequencies in the BNC against the Internet corpus. If you want to know the actual number of occurrences of a word listed there, multiply the frequency by the corpus size in million words (the size of a corpus is shown at the top of its frequency list). For instance, browser is used about 8556 times in the English Internet Corpus (47.17*181.376).

Finally, we have lists of distributionally similar words for English, German and Russian (words are said to be distributionally similar, if they share a significant amount of collocates in the corpus). The lists have been produced by Reinhard Rapp using Singular Value Decomposition (SVD).

TreeTagger)

转载本文请联系原作者获取授权，同时请注明本文来自章成志科学网博客。
链接地址：https://blog.sciencenet.cn/blog-36782-28665.html

上一篇：句对齐、词对齐相关资源列表
下一篇：评价不是伪科学

收藏 IP: .*| 热度|

当前推荐数：0

发表评论评论 (0 个评论)

数据加载中...

返回顶部

博文发布时间已经超过87600小时，评论已关闭。

章成志

扫一扫，分享此博文

章成志　分享 http://blog.sciencenet.cn/u/timy 宠辱不惊闲看庭前花开花落，去留无意漫观天外云展云舒

博文

Large Corpora used in CTS

Large Corpora used in CTS

Monolingual corpora

English

Russian

Chinese

Multilingual aligned corpora

Internet corpora

Extra resources

当前推荐数：0

发表评论评论 (0 个评论)

章成志

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

章成志 分享 http://blog.sciencenet.cn/u/timy 宠辱不惊闲看庭前花开花落，去留无意漫观天外云展云舒

博文

Large Corpora used in CTS

Large Corpora used in CTS

Monolingual corpora

English

Russian

Chinese

Multilingual aligned corpora

Internet corpora

Extra resources

当前推荐数：0

发表评论 评论 (0 个评论)

章成志

全部作者的精选博文

全部作者的其他最新博文

全部精选博文导读

相关博文

章成志　分享 http://blog.sciencenet.cn/u/timy 宠辱不惊闲看庭前花开花落，去留无意漫观天外云展云舒

发表评论评论 (0 个评论)