| Peer-Reviewed

A New Stylometry Method Basing on the Numerals Statistic

Received: 22 March 2017     Accepted: 25 April 2017     Published: 22 May 2017
Views:       Downloads:
Abstract

A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,…, 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of non-parametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.

Published in International Journal on Data Science and Technology (Volume 3, Issue 2)
DOI 10.11648/j.ijdst.20170302.11
Page(s) 16-23
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2017. Published by Science Publishing Group

Keywords

Benford’s Law, Statistic of Numerals, Text Attribution, Text Processing, English-Language Fiction, Mann-Whitney U Test, Pearson's Chi-Squared Test

References
[1] F. Benford, “The law of anomalous numbers”. Proceedings of American Philosophical Society. 1938. vol. 78. No. 4. pp. 551–572.
[2] T. P. Hill, “A Statistical Derivation of the Significant-Digit Law”. Statistical Science. 1995. vol. 10. pp. 354–363.
[3] W. M. Goodman, “Reality Checks for a Distributional Assumption: The Case of ‘Benford’s Law’”. JSM 2013 – Business and Economic Statistics Section, pp. 2789–2803.
[4] M. J. Nigrini, Benford’s Law: applications for forensic accounting, auditing, and fraud detection. Hoboken: John Wiley & Sons, 2012.
[5] B. F. Roukemaa, “A first-digit anomaly in the 2009 Iranian presidential election”. Journal of Applied Statistics. 2014. vol. 41. No. 1. pp. 164–199.
[6] D. Biau, “The first-digit frequencies in data of turbulent flows”. Physica A. 2015. vol. 440, pp. 147–154.
[7] T. P. Hill and R. F. Fox, “Hubble’s Law Implies Benford’s Law for Distances to Galaxies”. Journal of Astrophysics and Astronomy. 2016. vol. 37. No. 4. 8 pages.
[8] M. Sambridge, H. Tkalčić, and P. Arroucau, “Benford’s Law of First Digits: from Mathematical Curiosity to Change Detector”. Asia Pacific Mathematics Newsletter. 2011. vol. 1. No. 4. pp. 1–6.
[9] P. Andriotis, G. Oikonomou, and T. Tryfonas, “JPEG steganography detection with Benford’s Law”. Digital Investigation. 2013. vol. 9. No. 3–4. pp. 246–257.
[10] A. D. Alves, H. H. Yanasse, and N. Y. Soma, “Benford’s Law and articles of scientific journals: comparison of JCR and Scopus data”. Scientometrics. 2014. vol. 98. pp. 173–184.
[11] A. V. Zenkov, “Deviation from Benford’s law and identification of author peculiarities in texts”. Computer Research and Modeling, 2015, vol. 7, No. 1, pp. 197–201 (in Russian).
[12] The Best American Humorous Short Stories, by G. P. Morris, E. A. Poe, C. M. S. Kirkland, E. Leslie, G. W. Curtis, E. E. Hale, O. W. Holmes, M. Twain, H. S. Edwards, R. M. Johnston, H. C. Bunner, F. R. Stockton, F. Bret Harte, O. Henry, G. R. Chester, G. MacGowan Cooke, W. J. Lampton, and W. Hastings. The Project Gutenberg eBook, eBook #10947;
[13] The Short-story, by W. Irving, E. A. Poe, N. Hawthorne, F. Bret Harte, R. L. Stevenson, and R. Kipling. The Project Gutenberg eBook, transcribed from the 1916 Allyn and Bacon edition, eBook # 21964.
[14] The Lock And Key Library, Classic Mystery And Detective Stories, by R. Kipling, A. Conan Doyle, E. Castle, S. J. Weyman, W. Collins, and R. L. Stevenson. The Project Gutenberg eBook, transcribed from the 1909 Review of Reviews Co. edition, eBook # 2038.
[15] Shorter Novels, Eighteenth Century. The History of Rasselas, The Castle of Otranto, Vathek, by S. Johnson, H. Walpole, and W. Beckford. The Project Gutenberg eBook, transcribed from the 1903 Aldine House edition, eBook # 34766.
[16] The Best of the World's Classics, Vol. V – Great Britain and Ireland, by J. Boswell, W. Wordsworth, W. Scott, S. T. Coleridge, R. Southey, W. S. Landor, C. Lamb, W. Hazlitt, T. De Quincey, Lord Byron, P. Bysshe Shelley, G. Grote, T. Carlyle, Lord Macaulay. The Project Gutenberg eBook, transcribed from the 1909 Funk & Wagnalls Co. edition, eBook # 22182.
[17] The Great English Short-Story Writers, Vol. 1, by D. Defoe, J. Hogg, W. Irving, N. Hawthorne, E. A. Poe, J. Brown, C. Dickens, F. R. Stockton, M. Twain, F. Bret Harte, T. Hardy, H. James, and R. L. Stevenson. The Project Gutenberg eBook, transcribed from the 1910 Readers's Library edition, eBook # 10135.
[18] A House to Let, by C. Dickens, W. Collins, E. Gaskell, and A. A. Procter. The Project Gutenberg eBook, transcribed from the 1903 Chapman and Hall edition, eBook #2324.
[19] Masterpieces of Mystery, Vol. 1, Ghost Stories, by A. Blackwood, M. R. James, K. Rickford, W. F. Harvey, R. A. Cram, R. L. Stevenson, and W. D. Steele. The Project Gutenberg eBook, transcribed from the 1920 Doubleday, Page & Co. edition, eBook # 27722.
[20] J. N. Binongo, “Who wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution”. Chance. 2003. vol. 16. No. 2, pp. 9–17.
[21] The Oxford Handbook of Computational Linguistics (Ed. R. Mitkov). Oxford (a.o.): Oxford University Press, 2003.
[22] The Handbook of Linguistics (Eds. M. Aronoff and J. Rees-Miller). Oxford (a.o.): Blackwell Publishing, 2004.
[23] B. Ryabko, J. Astola, and M. Malyutov, Compression-Based Methods of Statistical Analysis and Prediction of Time Series. Springer International Publishing Switzerland, 2016.
Cite This Article
  • APA Style

    Andrei Viacheslavovich Zenkov, Larisa Anatolievna Sazanova. (2017). A New Stylometry Method Basing on the Numerals Statistic. International Journal on Data Science and Technology, 3(2), 16-23. https://doi.org/10.11648/j.ijdst.20170302.11

    Copy | Download

    ACS Style

    Andrei Viacheslavovich Zenkov; Larisa Anatolievna Sazanova. A New Stylometry Method Basing on the Numerals Statistic. Int. J. Data Sci. Technol. 2017, 3(2), 16-23. doi: 10.11648/j.ijdst.20170302.11

    Copy | Download

    AMA Style

    Andrei Viacheslavovich Zenkov, Larisa Anatolievna Sazanova. A New Stylometry Method Basing on the Numerals Statistic. Int J Data Sci Technol. 2017;3(2):16-23. doi: 10.11648/j.ijdst.20170302.11

    Copy | Download

  • @article{10.11648/j.ijdst.20170302.11,
      author = {Andrei Viacheslavovich Zenkov and Larisa Anatolievna Sazanova},
      title = {A New Stylometry Method Basing on the Numerals Statistic},
      journal = {International Journal on Data Science and Technology},
      volume = {3},
      number = {2},
      pages = {16-23},
      doi = {10.11648/j.ijdst.20170302.11},
      url = {https://doi.org/10.11648/j.ijdst.20170302.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdst.20170302.11},
      abstract = {A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,…, 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of non-parametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.},
     year = {2017}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - A New Stylometry Method Basing on the Numerals Statistic
    AU  - Andrei Viacheslavovich Zenkov
    AU  - Larisa Anatolievna Sazanova
    Y1  - 2017/05/22
    PY  - 2017
    N1  - https://doi.org/10.11648/j.ijdst.20170302.11
    DO  - 10.11648/j.ijdst.20170302.11
    T2  - International Journal on Data Science and Technology
    JF  - International Journal on Data Science and Technology
    JO  - International Journal on Data Science and Technology
    SP  - 16
    EP  - 23
    PB  - Science Publishing Group
    SN  - 2472-2235
    UR  - https://doi.org/10.11648/j.ijdst.20170302.11
    AB  - A new method of statistical analysis of texts is suggested. The frequency distribution of the first significant digits in numerals of connected authorial English-language texts is considered. Benford's law is found to hold approximately for these frequencies with a marked predominance of the digit 1. Deviations from Benford's law are statistically significant author peculiarities that allow, under certain conditions, to consider the problem of authorship and distinguish between texts by different authors. At the end of {1, 2,…, 8, 9} row, the digits distribution is subject to strong fluctuations and thus unrepresentative for our purpose. The approach suggested and the conclusions are backed by the examples of the computer analysis of works by W. M. Thackeray, M. Twain, R. L. Stevenson et al. The results are confirmed on the basis of non-parametric range Mann-Whitney and Kruskal-Wallis tests as well as the parametric Pearson's chi-squared test.
    VL  - 3
    IS  - 2
    ER  - 

    Copy | Download

Author Information
  • Department “Modelling of Controllable Systems”, Ural Federal University, Ekaterinburg, Russia

  • Department of Statistics, Econometrics and Computer Science, Ural State University of Economics, Ekaterinburg, Russia

  • Sections