YT-30M: A multi-lingual multi-category dataset of YouTube comments
Abstract
This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).
- Publication:
-
arXiv e-prints
- Pub Date:
- December 2024
- DOI:
- arXiv:
- arXiv:2412.03465
- Bibcode:
- 2024arXiv241203465S
- Keywords:
-
- Computer Science - Social and Information Networks;
- Computer Science - Artificial Intelligence;
- Computer Science - Computation and Language;
- Computer Science - Information Retrieval;
- Computer Science - Machine Learning