Unveiling the Anonymous Author: Stylometry Techniques
Published at – 12 min read – 2404 words
One of the techniques that fascinates me most about writing is textual analysis, which includes semantics (the meaning of words) and syntax (the use of signs and letters to construct sentences).
Throughout history, both classical and modern, there are numerous examples of authors who wanted to keep their identity hidden. At a first glance, it might seem quite simple to create anonymity: You write the text and invent a name for an author. This being the case, all one needs to do is find a name that piques the interest of the reader and make sure that it has not already been taken by someone else. However, there is something more imperative that you have to consider: Text conceals many details. Sentence construction, language style, and word usage can reveal a lot about an author and one’s cultural linguistic background. Therefore, in order to be anonymous and remain shrouded in mystery, it is not only necessary to have a pseudonym, but also a unique writing style that reveals a minimum amount of information about one’s culture, age, bias, etc.
We cite a few cases where statistical analysis applied to writing style has revealed famous “literary” mysteries. For example, researchers at a Swiss university identified the author behind the novels of Elena Ferrante ― a popular writer of novels, such as L’amica geniale (My Brilliant Friend), which have been translated into more than 30 languages. Through analysis of several terms from the works of Ferrante, along with her use of certain character names, it was revealed that Domenico Starnone may possibly be the true author. Put simply, this was achieved by comparing several texts of known authors, which made it possible to estimate whether the same person was writing under a pseudonym.
Statistical analysis performed on any type of text is called stylometry. It can also be applied to computer code and intrinsic plagiarism detection, which involves detecting plagiarism based on changes in writing style within the document. Stylometry can also be used to predict whether someone is a native or non-native speaker by analyzing sentence structure and text grammar.
As a method of analysis, stylometry can also be applied (and has also been applied, historically) to “classical” cryptanalysis in an effort to find the keys to ciphers and decode messages. A known example of this approach was the decoding of the Zimmer Telegram during the First World War ― where the British, through primitive stylometry techniques, had managed to decipher part of the message. The telegram was a proposal, from Germany to Mexico, to form an alliance in the instance where the United States entered the war. It was through stylometric analysis of character frequency that enabled the British to pre-empt the alliance and anticipate the moves of their enemies, even though Mexico had already refused.
Another oft-cited application of stylometry is determining the authorship of the Federalist Papers ― a series of articles, published between 1787 and 1788, written with the purpose of promoting the ratification of the new U.S. Constitution. The articles were written by three authors: Alexander Hamilton, James Madison, and John Jay under the pseudonym “Publius.” While the primary author of some articles was already known, the authorship of others are still in question. In the early 1960s, researchers Frederick Mosteller and David Wallace used stylometric methods in an attempt to resolve this uncertainty and today further research is underway to determine with certainty the original author(s).
The examples discussed above, along with countless others, it is clear that stylometry can be a great tool for examining and comparing the writing style of different authors. While historically it was more difficult to compare texts (both because of the “manual” comparison performed by humans and the small number of samples), computer science and the Internet have opened the door to new, faster, and more accurate textual analysis techniques. Today, it is possible to compare multiple texts at the same time without error. Moreover, it is possible to access an infinity of texts without having to waste time retrieving books from the dusty shelves of libraries and studies.
As we all know, there are still many unsolved mysteries not only in literature, but everywhere. A popular case is the identity of Satoshi Nakamoto. At the moment, in fact, the identity of this individual (or group of individuals) that wrote the Bitcoin whitepaper is unknown. Several people have tried to analyze Satoshi’s texts (including the Bitcoin whitepaper) in an attempt to prove who is truly behind his identity, but no one has yet been able to actually link Satoshi to a physical person. This article is meant to be further encouragement in uncovering and unraveling that mystery, while also trying to use stylometric techniques to link messages left on the Bitcointalk forum.
You might be wondering how this science works in practice and what concepts it is based upon. In the following paragraphs, we help illustrate some simple techniques that can be used to analyze texts and provide an overview of possible indies for comparing different texts.
Stylometry: Basic Techniques
The idea behind stylometric analysis is very simple. Given an input text, you start by deriving some statistical characteristics regarding word usage, punctuation marks, transcription errors and compare these scores with other texts. When the input text turns out to be similar to another text we have already analyzed, we continue with the analysis of other statistics for both texts.
Intuitively, when the characteristics of the two texts are similar in many forms, there is a fair chance that the two texts have an author in common or that the style of one author influenced the other (or it could reveal a plagiarism). Choosing the stylometric features of the text is the most important phase of the study. Researchers identify a thousand features at different levels of analysis: lexical (including character and letter levels), syntactic, semantic, structural, and subject-specific.
Stylometric methods, however, have one major drawback. While it is true that two authors who have the same writing technique are likely to be the same person, on the other hand we cannot be certain. At the origin, there is the idea that every text has a characteristic style and that consequently texts that have very similar stylistic characteristics are by the same author. Furthermore, it is assumed that style characteristics that result from unconscious choices cannot be consciously altered: An anonymous author often does not realize that he or she is leaving hidden traces of his or her writing style.
Like any science, stylometry requires great patience in finding the unique features of each text and comparing them. In addition, the sample of texts with which we compare authors may be unique and have no characteristics in common with the text under study. Some of the metrics that are tracked, for example, include: punctuation usage, frequency of errors, arcane words, or even unpublished (i.e. never considered) features between two or more authors. In subsequent sections, we will go on to discuss three metrics: n-gram, hapax legomenon, and readability indices.
An n-gram is a continuous sequence of n items from a given textual sample. It can be a fixed sequence of characters, such as the word “friend” (n-gram of 5 items of size 1 each: ‘a’ ’m’ ‘i’ ‘c’ ‘o’), syllables (‘a’ ‘mi’ ‘co’), whole words, or a list of words.
The size of the n-gram and what details to reference depend on search to search. When particular words or lexical formulae are present, larger n-grams that include those words are identified.
While for the frequency of errors, typographical typos, it is important to focus your search on smaller n-grams (digrams or trigrams). Since the most common syllables are fairly well known, it is possible to detect syntax errors almost immediately without comparing word for word with a dictionary, which is more expensive in computational terms.
A common approach using n-grams has been to attribute the authorship of the Bixby letter to John Hay, Lincoln’s secretary. The analysis performed by several researchers included: dividing the texts into n-grams of different sizes and comparing the n-grams across texts. Measuring the percentage of n-gram types found in the queried document that recur at least once in each possible author’s writing sample. Finally, attribute the queried document to the possible author who uses the highest percentage of these n-grams.
A simpler analysis that allows you to immediately attribute the authorship of two different texts to the same author is to rely on hapax legomenon. A hapax legomenon is a word or set of words that do not repeat in a text. Literally translated from Greek, it means “Something said only once.”
The use of word frequency analysis focuses on the author’s vocabulary, which is variable over time and is heavily influenced by cultural, economic, and social factors.
There are many factors that can explain the number of hapax legomena in a work:
- Length of the text: it directly affects the expected number and percentage of hapax legomena.
- Topic of the text: if the author is writing about different topics, obviously many topic-specific words recur only in limited contexts.
- Textual audience: if the author is writing to a peer instead of a student or a spouse instead of an employer, completely different vocabulary will appear.
- Time: over the years, both an author’s knowledge and the use of language will evolve.
To examine the lexical richness of the text, the Token Type Ratio (TTR) is used. The Token Type ratio is the total number of unique words (called types) divided by the total number of words (token) in a given segment of language. The TTR can give an estimate of the reading complexity of a text that is strongly related to the presence of many or few unique words.
A text is dense if it is full of words that appear only once. It is often that a denser text is more difficult to understand in terms of its complexity, especially if it is a specialized text. To give a numerical example, the lexical density of everyday speech is around 0.3 or 0.4, while more technical texts (academic and non-academic papers) have a lexical density of 0.7.
A readability score is a number that tells you how easy or difficult your text is to read. The idea behind it is that people read at different levels and something that is absolutely readable for a Ph.D. can spin the heads of undergraduate students.
Professional writing/editing firms, which employ ghostwriters and editors, routinely use readability indexes to standardize the readability of each paragraph. Calculating the readability index of each sentence/paragraph allows you to determine whether the text was actually designed to be easily readable and whether there are different styles of writing.
Here, we find a few submetrics: Flesch Reading Ease, Flesch-Kincaid Grade Level, and Gunning Fog Index.
Flesch Reading Ease
The Flesch Reading Ease, created in 1948, tells us approximately what level of education someone will need to be able to read a piece of text easily.
The formula generates a score between 1 and 100 ― although scores below and above this range can be generated. A conversion table is then used to interpret this score. For example, a score of 70-80 is equivalent to American school level 7 or Italian sixth grade, so it should be fairly easy for an average adult to read.
The formula is as follows:
FR Score = 206.835 - 1.015 * (Total Words/Number of Sentences) - 84.6 * (Total Syllables/Total Words)
|90-100||Very easy to read, easily understood by an average 11 year old student|
|80-90||Easy to read|
|70-80||Fairly easy to read|
|60-70||Easily understood by 13-15 year old students|
|50-60||Fairly difficult to read|
|30-50||Difficult to read, best understood by high school students|
|0-30||Very difficult to read, better understood by college graduates|
Flesch-Kincaid Grade Level
In the mid-1970s, the U.S. Navy was looking for a way to measure the difficulty of technical manuals used in training. The Flesch Reading Ease test was revisited and, along with other readability tests, the formula was modified to be more suitable for use in the Navy. The new calculation was called the Flesch-Kincaid Grade Level. Grade levels are based on the scores of participants in a test group.
FKG Level = 0.39 * (Total Words/Total Sentences) + 11.8 (Total Syllables/Total Words) - 15.59
|FKG Score||School Level||Comprehension|
|5.0-5.9||5th Grade||Very easy to read|
|6.0-6.9||6th Grade||Easy to read|
|7.0-7.9||7th Grade||Fairly easy to read|
|8.0-9.9||8th and 9th||Grade Colloquial English|
|10.0-12.9||10th and 11th & 12th||Grade Medium Difficulty|
|13.0-15.9||College||Difficult to read|
|16.0-17.9||College Graduate||Very difficult to read, requires medium level skills.|
|18.0+||Professional (academic)||Complex reading, requires specific skills.|
For those like me who cannot think in terms of US school ages, just add 5 to the score to find the age of the reader.
Gunning Fog Index
The Gunning Fog Index is another readability index for English writing. The index estimates the years of formal education a person needs to comprehend text on first reading. For example, a Gunning Fog 12 index requires the reading level of a U.S. high school graduate (about 18 years old). The test was developed in 1952 by Robert Gunning, an American businessman who had been involved in newspaper and textbook publishing.
Texts for a broad audience generally require a Gunning Fog index of less than 12. Texts requiring near-universal comprehension generally require an index of less than 8.
G = 0.4 * [(Words/Sentences) + 100 (Complex Words/Words)]
In the formula, “Complex Words” are words with three or more syllables excluding prefixes and suffixes while “Sentences” are the number of grammatically complete sentences.
Conclusion: Stylometry as a Probabilistic Science
In this article, we have reviewed the various basic metrics used by stylometry, citing several examples from literature. Among the many parameters (of readability and writing), we have determined how in the analysis phase it is important to choose the right metrics to identify an anonymous author. Although stylometry is a fascinating science and can help in the search for authorship of texts, we must remember that these techniques are all probabilistic and therefore could also reveal false positives.
In the next article, we are going to focus our study on writing a simple program that can analyze a multitude of texts using two imperative programming languages: C and Go.