Open Dataset

Europarl：統計機械翻訳用の並列コーパス（21種類の欧州言語版）

1.46G

922 hits

0 likes

0 downloads

0 discuss

NLP Classification

このコーパスの詳細な説明については、次を読んでください：ユーロパル：統計的機械翻訳用の並列コーパス......

Introduction
Data file
Related papers
Code
Discuss(0)
Instructions

Data Structure ? 1.46G

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

このコーパスの詳細な説明については、以下を読んでください。

Europarl: A Parallel Corpus for Statistical Machine Translation（ユーロパル：統計的機械翻訳用の並列コーパス）、Philipp Koehn、MT Summit 2005、pdf。

あなたの研究でこのコーパスを使用する場合は、この論文を引用してください。また、このレポートの拡張版（ただし古いバージョン）もあります（ps、pdf）。

ユーロパル並列コーパスは、欧州議会の会議録から抽出されています。これには21の欧州言語のバージョンが含まれています。ロマンス語族（フランス語、イタリア語、スペイン語、ポルトガル語、ルーマニア語）、ゲルマン語族（英語、オランダ語、ドイツ語、デンマーク語、スウェーデン語）、スラブ語族（ブルガリア語、チェコ語、ポーランド語、スロバキア語、スロベニア語）、フィン・ウゴル語族（フィンランド語、ハンガリー語、エストニア語）、バルト語族（ラトビア語、リトアニア語）、およびギリシャ語です。

抽出と処理の目的は、統計的機械翻訳システム用の文対訳テキストを生成することでした。この目的のために、一致する項目を抽出し、対応する文書IDでラベル付けしました。前処理ツールを使用して文の境界を特定しました。チャーチとゲイルのアルゴリズムに基づくツールを使用してデータを文対訳しました。

バージョン7リリース

2012年5月15日に、コーパスのさらに拡張され改善されたバージョンをリリースしました。以前のバージョンはこちらで入手できます。コーパスは、文書ファイルと文対訳ツール、および英語を含む言語ペアの並列コーパスとともにソースリリースとして提供されます。

バージョン6以降の変更点

2011年1月 - 2011年11月のデータを追加。現在、言語ごとに約6000万語まで増えました。
前処理とクリーニングをさらに改良。

すべてのフォーマットには、文書（）、話者（）、および段落（

）のマークアップが別の行に記載されています。データは1日ごとに1つのファイルに保存され、新しいデータについてはより小さな単位で保存されます。

一部の文書には、SPEAKERタグ属性のLANGUAGEがあり、これは元の話者が使用していた言語を示します。

GIZA++などのツールで並列コーパスを使用するには、以下のことを行う必要があります。

テキストをトークン化する（必須）
テキストを小文字にする（推奨）
空行とそれに対応する行を削除する（必須）
XMLタグで始まる行（"<"で始まる行）を削除する（必須）

コーパスのサイズ

XMLを削除した後の単一言語データのサイズ。

言語	文数	単語数
ブルガリア語	411,636	-
チェコ語	668,595	13,195,311
デンマーク語	2,323,099	47,761,381
ドイツ語	2,176,537	47,236,849
ギリシャ語	1,517,141	-
英語	2,218,201	53,974,751
スペイン語	2,123,835	54,806,927
エストニア語	692,210	11,358,009
フィンランド語	2,119,515	33,708,706
フランス語	2,190,579	54,202,850
ハンガリー語	658,824	12,606,986
イタリア語	2,081,669	50,259,169
リトアニア語	678,665	11,512,131
ラトビア語	666,026	12,085,228
オランダ語	2,333,816	53,487,257
ポーランド語	387,490	7,087,016
ポルトガル語	2,121,889	52,300,149
ルーマニア語	402,904	9,663,544
スロバキア語	674,359	13,116,301
スロベニア語	634,488	12,665,974
スウェーデン語	2,241,386	45,665,947

文対訳とXML削除後の並列コーパスのサイズ。

並列コーパス（L1 - L2）	文数	L1の単語数	英語の単語数
ブルガリア語 - 英語	406,934	-	9,886,291
チェコ語 - 英語	646,605	12,999,455	15,625,264
デンマーク語 - 英語	1,968,800	44,654,417	48,574,988
ドイツ語 - 英語	1,920,209	44,548,491	47,818,827
ギリシャ語 - 英語	1,235,976	-	31,929,703
スペイン語 - 英語	1,965,734	51,575,748 Similar Data 2101FakeNewsNet 偽ニュース研究データ収集、偽ニュース、虚偽情報、データマイニング 492Multi-Domain Sentiment Dataset--マルチドメイン感情データセット 35340万個の手書き姓名画像データセット 324MJSynth 合成単語データセット合成語データセット 195チャットボット：意図識別データセット 186IAM最も一般的な作家の手書きデータセット50個 177Twitter - データセット 168SMSスパムメール収集データセット、スパムメールまたは合法的なSMSの集合 139中国古代文字（文言文） 1110ソフトウェア要件データセット × The dataset is currently being organized and other channels have been prepared for you. Please use them The dataset is currently being organized and other channels have been prepared for you. Please use them Note: Some data is currently being processed and cannot be directly downloaded. We kindly ask for your understanding and support. No content available at the moment No content available at the moment Share your thoughts Go share your ideas~~ ALL Welcome to exchange and share Your sharing can help others better utilize data. Data usage instructions: I. Data Source and Display Explanation: 1. The data originates from internet data collection or provided by service providers, and this platform offers users the ability to view and browse datasets. 2. This platform serves only as a basic information display for datasets, including but not limited to image, text, video, and audio file types. 3. Basic dataset information comes from the original data source or the information provided by the data provider. If there are discrepancies in the dataset description, please refer to the original data source or service provider's address. II. Ownership Explanation: 1. All datasets on this site are copyrighted by their original publishers or data providers. III. Data Reposting Explanation: 1. If you need to repost data from this site, please retain the original data source URL and related copyright notices. IV. Infringement and Handling Explanation: 1. If any data on this site involves infringement, please contact us promptly, and we will arrange for the data to be taken offline. VIP Download(0.24/day) Download Points：10 Go earn points？ 922 0 0 collect Share ピティティへの紹介 Contact Us 利用者サービス契約プライバシーポリシー人材募集ビジネス協力 ©2020-2023 www.payititi.com All Rights Reserved Sitemap 京ICP备19000450号

Select Language

AI Technology Community

Today search ranking

month_search_ranking

Dataset Category

Open Dataset

Europarl：統計機械翻訳用の並列コーパス（21種類の欧州言語版）

Data Structure ? 1.46G

Data Structure ?

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

バージョン7リリース

コーパスのサイズ

Similar Data

The dataset is currently being organized and other channels have been prepared for you. Please use them

The dataset is currently being organized and other channels have been prepared for you. Please use them

ALL

I. Data Source and Display Explanation:

II. Ownership Explanation:

III. Data Reposting Explanation:

IV. Infringement and Handling Explanation: