Public Dataset for "Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior"
Publication Date
2-21-2020
Abstract
Dataset for the "Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior" paper, published in ICWSM 2018. The full text of the paper can be found here.
The dataset provided here includes an updated version of the original dataset, with ~100k tweets annotated using the CrowdFlower platform:
-
hatespeech_id_label_PUBLIC_100K.csv: contains ~100K rows, where every row consists of a unique Tweet ID.
-
hatespeech_text_label_vote_RESTRICTED_100K.csv: contains ~100K rows, where every row consists of the tweet text, its label according to majority annotation and the number of majority annotators. Available only here.
-
retweets.csv: contains ~2K rows, where every row consists of the row number in the hatespeech_text_label_vote_RESTRICTED_100K.csv file which is the first occurrence of a Tweet text followed by comma-separated row numbers of all other occurrences of the same Tweet text in the same file. There are ~8K other occurrences due to retweets. Available only here.
UPDATE: It has come to our understanding that a number of the tweets are not available anymore for download on Twitter. Therefore, we provide here the hatespeech_text_label_vote_RESTRICTED_100K file with the full ~100K tweet texts, their associated majority label, and the number of votes for the majority label. The tweets are shuffled so that there is no connection between tweet IDs and texts (in order to be in line with the T&C of Twitter).
Please cite the paper in any published work that uses any of these resources.
@inproceedings{founta2018large,
title={Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior},
author={Founta, Antigoni-Maria and Djouvas, Constantinos and Chatzakou, Despoina and Leontiadis, Ilias and Blackburn, Jeremy and Stringhini, Gianluca and Vakali, Athena and Sirivianos, Michael and Kourtellis, Nicolas},
booktitle={11th International Conference on Web and Social Media, ICWSM 2018},
year={2018},
organization={AAAI Press}
}
For any further questions contact a.m.founta at gmail dot com AND markos.charalambous at eecei dot cut dot ac dot cy
Repository
Zenodo
Distribution License
This work is licensed under a Creative Commons Attribution 4.0 International License.
Access Instructions and Link
This data is available under the CC-BY 4.0 License
Funder
Funder: European Commission
Funder DOI: 10.13039/501100000780
EnhaNcing seCurity And privacy in the Social wEb: a user centered approach for the protection of minors
691025
Recommended Citation
Founta, Antigoni-Maria; Djouvas, Constantinos; Chatzakou, Despoina; Leontiadis, Ilias; Blackburn, Jeremy; Stringhini, Gianluca; Vakali, Athena; Sirivianos, Michael; and Kourtellis, Nicolas, "Public Dataset for "Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior"" (2020). UAB Research Data Catalog. 41.
https://digitalcommons.library.uab.edu/datasets/41