Open Dataset

マルチモーダルな憎悪言語、テキストと画像付きのツイート150,000件、憎悪検出に使用

6.55G

714 hits

0 likes

2 downloads

0 discuss

NLP,Online Communities,Image Data,Multiclass Classification,Social Networks Classification

既存の憎悪発言データセットはテキストデータのみを含んでいます。私たちは新しい手動で注釈付けされたマルチモーダルな憎悪発言データセットを作成しました。このデータセットは150000件のツイートで構成されており、各ツイートは......

Introduction
Data file
Related papers
Code
Discuss(0)
Instructions

Data Structure ? 6.55G

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

既存の憎悪発言データセットはテキストデータのみを含んでいます。私たちは新しい手動で注釈付けされたマルチモーダル憎悪発言データセットを作成しました。このデータセットは150,000件のツイートで構成されており、各ツイートにはテキストと画像が含まれています。私たちはこのデータセットをMMHS150Kと呼んでいます。

ツイートの収集

私たちはTwitter APIを使用して、2018年9月から2019年2月までのリアルタイムツイートを収集しました。憎悪発言のツイートでより一般的な51個の憎悪語のいずれかを含むツイートを選択しました。リツイート、3語未満のツイート、および色情関連用語を含むツイートをフィルタリングしました。選択したツイートの中から、画像を含むものを残してダウンロードしました。Twitterはそのポリシーに基づいて憎悪発言フィルターや他の種類のコンテンツコントロールを適用しますが、規制はユーザーの報告に基づいて行われます。したがって、私たちがリアルタイム投稿からツイートを収集するとき、得られるコンテンツはまだどのようなフィルタリングも受けていません。

注釈付け

私たちは収集したツイートを注釈付けするために、クラウドソーシングプラットフォームのAmazon Mechanical Turkを使用しました。そこで、作業員に憎悪発言の定義を与え、いくつかの例を示してタスクをより明確にしました。その後、ツイートのテキストと画像を表示し、それを6つのカテゴリーに分類するように要求しました。カテゴリーは、いかなるコミュニティに対する攻撃もない、人種差別、性差別、同性愛に関する、宗教に基づく攻撃、または他のコミュニティに対する攻撃です。15万件のツイートのそれぞれは、作業員間の差異を緩和するために、3人の異なる作業員によってラベル付けされました。AMTから得られた元の注釈は、データセットと一緒にダウンロードすることができます。

私たちは注釈付け者から多くの貴重なフィードバックを受け取りました。彼らの大多数はこのタスクを正しく理解していましたが、その主観性のために不安を感じていました。これは確かに主観的なタスクであり、注釈付け者の信念と感受性に大きく依存します。しかし、私たちは攻撃が強ければ強いほど注釈が明確になることを期待しています。これは私たちがより検出することに興味がある出版物です。以下は、各カテゴリーでラベル付けされたツイートの割合と、最も一般的なキーワードを含む憎悪発言と非憎悪発言のツイートの割合です。

No content available at the moment

Share your thoughts

Go share your ideas~~

ALL

Welcome to exchange and share

Your sharing can help others better utilize data.

Data usage instructions:

I. Data Source and Display Explanation:

1. The data originates from internet data collection or provided by service providers, and this platform offers users the ability to view and browse datasets.

2. This platform serves only as a basic information display for datasets, including but not limited to image, text, video, and audio file types.

3. Basic dataset information comes from the original data source or the information provided by the data provider. If there are discrepancies in the dataset description, please refer to the original data source or service provider's address.

II. Ownership Explanation:

1. All datasets on this site are copyrighted by their original publishers or data providers.

III. Data Reposting Explanation:

1. If you need to repost data from this site, please retain the original data source URL and related copyright notices.

IV. Infringement and Handling Explanation:

1. If any data on this site involves infringement, please contact us promptly, and we will arrange for the data to be taken offline.

Points：

45 Go earn points？

714
2
0
collect
Share

Select Language

AI Technology Community

Today search ranking

month_search_ranking

Dataset Category

Open Dataset

マルチモーダルな憎悪言語、テキストと画像付きのツイート150,000件、憎悪検出に使用

Data Structure ? 6.55G

Data Structure ?

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

Similar Data

The dataset is currently being organized and other channels have been prepared for you. Please use them

The dataset is currently being organized and other channels have been prepared for you. Please use them

ALL

I. Data Source and Display Explanation:

II. Ownership Explanation:

III. Data Reposting Explanation:

IV. Infringement and Handling Explanation: