Select Language

AI Technology Community

Popular sections
hot topic

人工知能モデルコミュニティ 3 theme content

ChatGPTの登録と使用 0 theme content

計算力百科事典 0 theme content

バグと解決方法 1 theme content

データセット応用コミュニティ 5 theme content

AI大学 2 theme content

コンボリューションニューラルネットワーク（Convolutional Neural Networks）

BPニューラルネットワーク

SOM自己組織マッピングニューラルネットワーク

独立同分布（iid，independently identically distribution）

階層クラスタリングアルゴリズム

Mean Squard Error平均二乗誤差

AI開発者への公開書簡

アラン・マシソン・トゥリン（Alan Mathison Turing）

Today search ranking

6条结果分布
16条结果打
5219条结果data
445条结果car
39条结果MIT
28条结果facebook
2条结果16个化学传感器
15条结果multi

month_search_ranking

2条结果行为数据集
282条结果分类
19条结果Agriculture
208条结果经济
69条结果能源
142条结果环境
567条结果医学健康
6条结果cuhk

Open Dataset

ウィキペディアの文章、英語版ウィキペディアのダンプから780万の文章が収集されています

891.28M

387 hits

0 likes

0 downloads

0 discuss

NLP,Text Mining Classification

ウィキペディアのダンプは巨大なXMLファイルで、あまり有用ではない内容が大量に含まれています。私は何かのためにいくつかの英語のテキストが必要でした......

0

0

Introduction
Data file
Related papers
Code
Discuss(0)
Instructions

Data Structure ? 891.28M

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

ウィキペディアのダンプファイルは巨大なXMLファイルで、あまり有用でない内容が大量に含まれています。私は教師なし学習のために英語のテキストが必要だったので、かなりの時間をかけてテキストを抽出し、クリーニングしました。

内容

txtファイルの各行は「文」です。ここで「文」と引用符を付けているのは、これらのファイルの内容が誤りをすべてチェックされていないからです。私が行ったことは以下の通りです。

非曖昧化ページと目次ページでは、冒頭のテキストを抽出しました。
出典が必要な文は削除しました。これらの文は通常、文法が不適切なことが多いからです。
SpaCyを使って各テキストブロックを文に分割しました。その後、括弧と引用符の正しさをチェックし、完全に一致しない文を除外しました。
3文字未満と255文字を超える文を削除しました。これでデータの97%がカバーされます。
重複する文を削除し、その結果としてアルファベット順に並べました。

No content available at the moment

No content available at the moment

Share your thoughts

Go share your ideas~~

ALL

Welcome to exchange and share

Your sharing can help others better utilize data.

Data usage instructions:

I. Data Source and Display Explanation:

1. The data originates from internet data collection or provided by service providers, and this platform offers users the ability to view and browse datasets.

2. This platform serves only as a basic information display for datasets, including but not limited to image, text, video, and audio file types.

3. Basic dataset information comes from the original data source or the information provided by the data provider. If there are discrepancies in the dataset description, please refer to the original data source or service provider's address.

II. Ownership Explanation:

1. All datasets on this site are copyrighted by their original publishers or data providers.

III. Data Reposting Explanation:

1. If you need to repost data from this site, please retain the original data source URL and related copyright notices.

IV. Infringement and Handling Explanation:

1. If any data on this site involves infringement, please contact us promptly, and we will arrange for the data to be taken offline.

Points：

18 Go earn points？

387
0
0
collect
Share

ピティティへの紹介 Contact Us 利用者サービス契約プライバシーポリシー人材募集ビジネス協力

©2020-2023 www.payititi.com All Rights Reserved Sitemap 京ICP备19000450号