Open Dataset

皮肉（サーカズム）検出に使用するニュースタイトルデータセット、皮肉と偽ニュース検出タスクに使用する高品質データセット

11.13M

337 hits

0 likes

1 downloads

0 discuss

NLP,Deep Learning,Classification,Earth and Nature,Computer Science,Programming Classification

皮肉検出に関する過去の研究では、主にハッシュタグに基づく監督を用いて収集されたTwitterデータセットを利用していますが、そのような......

Introduction
Data file
Related papers
Code
Discuss(0)
Instructions

Data Structure ? 11.13M

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

皮肉検出に関する過去の研究では、主にハッシュタグに基づく監督方式で収集されたTwitterデータセットが使用されていますが、そのようなデータセットはラベルや言語の面でノイズが多いです。さらに、多くのツイートは他のツイートへの返信であり、これらのツイートで皮肉を検出するには、文脈となるツイートが必要です。

Twitterデータセットのノイズに関する制限を克服するために、この皮肉検出用のニュース見出しデータセットは2つのニュースウェブサイトから収集されました。TheOnionは、現在の出来事を皮肉ったバージョンを制作することを目的としており、我々は「ニュース要約」と「写真付きニュース」のカテゴリーからすべての見出し（皮肉を含んだもの）を収集しました。また、HuffPostからは、実際の（皮肉ではない）ニュース見出しを収集しました。

この新しいデータセットは、既存のTwitterデータセットに比べて以下の利点があります：

ニュース見出しは専門家によって正式な方法で書かれているため、スペルミスや非公式な表現がありません。これにより、疎性が減少し、事前学習された埋め込みを見つける可能性も高まります。
さらに、TheOnionの唯一の目的は皮肉を含んだニュースを掲載することであるため、Twitterデータセットと比較して、ノイズがはるかに少ない高品質のラベルが得られます。
他のツイートへの返信であるツイートとは異なり、我々が取得したニュース見出しは自立しています。これにより、本当に皮肉を含んだ要素を見分けるのに役立ちます。

内容

各レコードは3つの属性から構成されています：

is_sarcastic：レコードが皮肉を含んでいる場合は1、そうでない場合は0
headline：ニュース記事の見出し
article_link：元のニュース記事へのリンク。補足データを収集する際に便利です

データの一般的な統計情報、Pythonでデータを読み取る方法の説明、および基本的な探索的分析は、このGitHubリポジトリで見つけることができます。このデータセットで訓練されたハイブリッドNNアーキテクチャは、このGitHubリポジトリで見つけることができます。

引用

あなたがこのデータセットをあなたの研究で使用する場合は、以下の記事を引用してください：

テキスト形式の引用：

1. Misra, Rishabh and Prahal Arora. "Sarcasm Detection using News Headlines Dataset." AI Open (2023).
2. Misra, Rishabh and Jigyasa Grover. "Sculpting Data for ML: The first act of Machine Learning." ISBN 9798585463570 (2021).

BibTex形式の引用：

@article{misra2023Sarcasm,
  title = {Sarcasm Detection using News Headlines Dataset},
  journal = {AI Open},
  volume = {4},
  pages = {13-18},
  year = {2023},
  issn = {2666-6510},
  doi = {https://doi.org/10.1016/j.aiopen.2023.01.001},
  url = {https://www.sciencedirect.com/science/article/pii/S2666651023000013},
  author = {Rishabh Misra and Prahal Arora},
}
@book{misra2021sculpting,
  author = {Misra, Rishabh and Grover, Jigyasa},
  year = {2021},
  month = {01},
  pages = {},
  title = {Sculpting Data for ML: The first act of Machine Learning},
  isbn = {9798585463570}
}

このデータセットの出典として、rishabhmisra.github.io/publicationsにリンクしてください。ありがとう！

着想

あなたは皮肉を含んだ文を見分けることができますか？あなたは偽ニュースと正当なニュースを区別することができますか？

データの読み取り

以下のコードスニペットを使用してデータを読み取ることができます：

import json

def parse_data(file):
    for l in open(file,'r'):
        yield json.loads(l)

data = list(parse_data('./Sarcasm_Headlines_Dataset.json'))

No content available at the moment

Share your thoughts

Go share your ideas~~

ALL

Welcome to exchange and share

Your sharing can help others better utilize data.

Data usage instructions:

I. Data Source and Display Explanation:

1. The data originates from internet data collection or provided by service providers, and this platform offers users the ability to view and browse datasets.

2. This platform serves only as a basic information display for datasets, including but not limited to image, text, video, and audio file types.

3. Basic dataset information comes from the original data source or the information provided by the data provider. If there are discrepancies in the dataset description, please refer to the original data source or service provider's address.

II. Ownership Explanation:

1. All datasets on this site are copyrighted by their original publishers or data providers.

III. Data Reposting Explanation:

1. If you need to repost data from this site, please retain the original data source URL and related copyright notices.

IV. Infringement and Handling Explanation:

1. If any data on this site involves infringement, please contact us promptly, and we will arrange for the data to be taken offline.

Points：

16 Go earn points？

337
1
0
collect
Share

Select Language

AI Technology Community

Today search ranking

month_search_ranking

Dataset Category

Open Dataset

皮肉（サーカズム）検出に使用するニュースタイトルデータセット、皮肉と偽ニュース検出タスクに使用する高品質データセット

Data Structure ? 11.13M

Data Structure ?

*The above analysis is the result extracted and analyzed by the system, and the specific actual data shall prevail.

README.md

内容

引用

着想

データの読み取り

Similar Data

The dataset is currently being organized and other channels have been prepared for you. Please use them

The dataset is currently being organized and other channels have been prepared for you. Please use them

ALL

I. Data Source and Display Explanation:

II. Ownership Explanation:

III. Data Reposting Explanation:

IV. Infringement and Handling Explanation: