BotArtist: Generic approach for bot detection in Twitter via semi-automatic machine learning pipeline
Abstract
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.
- Publication:
-
arXiv e-prints
- Pub Date:
- May 2023
- DOI:
- 10.48550/arXiv.2306.00037
- arXiv:
- arXiv:2306.00037
- Bibcode:
- 2023arXiv230600037S
- Keywords:
-
- Computer Science - Social and Information Networks;
- Computer Science - Artificial Intelligence;
- Computer Science - Machine Learning