An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

doi:10.48550/arXiv.2412.17361

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.

Publication:

arXiv e-prints

Pub Date:

December 2024

DOI:

10.48550/arXiv.2412.17361

arXiv:

arXiv:2412.17361

Bibcode:

2024arXiv241217361R

Keywords:

Computer Science - Computation and Language

E-Print:

Accepted at The 27th Annual Meeting of the Association for Natural Language Processing (NLP2021). Published version available at: https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/D3-1.pdf

ADS

An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification

Abstract