Rank-frequency distribution of natural languages: A difference of probabilities approach
Abstract
In this paper we investigate the time variation of the rank k of words for six Indo-European languages using the Google Books N-gram Dataset. Based on numerical evidence, we regard k as a random variable whose dynamics may be described by a Fokker-Planck equation which we solve analytically. For low ranks the distinct languages behave differently, maybe due to the syntax rules, whereas for k > 50 the law of large numbers predominates. We analyze the frequency distribution of words using the data and their adjustment in terms of time-dependent probability density distributions. We find small differences between the data and the fits due to conflicting dynamic mechanisms, but the data show a consistent behavior with our general approach. For the lower ranks the behavior of the data changes among languages presumably, again, due to distinct dynamic mechanisms. We discuss a possible origin of these differences and assess the novel features and limitations of our work.
- Publication:
-
Physica A Statistical Mechanics and its Applications
- Pub Date:
- October 2019
- DOI:
- arXiv:
- arXiv:1811.09451
- Bibcode:
- 2019PhyA..53221795C
- Keywords:
-
- Rank dynamics;
- Languages;
- Master equation;
- Fokker-Planck equation;
- Physics - Physics and Society;
- Statistics - Applications
- E-Print:
- 11 pages