While writing an oncoming post about syllables, I came to realize there is an important topic I must discuss first: type-token ratio (TTR). In particular I want to test a more advanced technique called MATTR, which may prove useful in Voynich research.

Tokens, Types and TTR: introduction

In short, tokens are all words in a text, while types are all different words. If you increase a text’s amount of tokens it becomes longer. If you increase its amount of types, its vocabulary becomes more diverse. By dividing the amount of types in a text by its amount of tokens, you get its type-token ratio (TTR).

TTR is mostly used in linguistics to determine the richness of a text’s or speaker’s vocabulary. This has applications in literature and in the study of children’s language development. But TTR also has applications that might be of more interest to us; it can provide an indication of the type of language we are dealing with. This is discussed in Kimmo Kettunen’s paper Can Type-Token Ratio be Used to Show Morphological Complexity of Languages? [1], which I will regularly refer to in this post.

A difficulty in researching TTR is how it is affected by text length. The longer a text runs on, the fewer novel vocabulary will be introduced. Hence, longer texts will increasingly lean towards the “tokens” side of the equation: more words (tokens) are added, but fewer of those words represent new types. In a single sentence on the other hand, almost every word is new and TTR is typically 1 or close to it. Or, to put it more simply, the amount of tokens you can add is infinite, while type diversity will approach a ceiling after a while. Tokens increase linearly, while types do not.

TTR and Voynichese

TTR has a huge advantage for the study of the Voynich manuscript: it is possible. For contrast, one way the complexity of languages is compared is to count the amount of nominal case forms. We can’t do this for Voynichese, because we don’t know what its nouns are. But we can compare its total amount of “words” to its amount of different “words”. Additionally, TTR is not bothered by many of the parsing questions we face with Voynichese. It doesn’t matter whether we transcribe the common “aiin” ending as four letters or two, as long as we are consistent.

One possible problem is the presence of different “languages” in the manuscript, and the way they may or may not be divided. Traditionally, the manuscript is considered to contain two “languages” Currier A and B. Currier B introduces other words (or rather word parts), which might cause an abnormal bump in the type graph. While several researchers have argued that matters are not so simple and there might be more of a continuum of change, it is certain that the division between A and B is a meaningful one (for a clear example, see this post by Julian Bunn).

So while I acknowledge that a hard split between Currier A and B is not the ideal approach, it is the best I can do for now, and for this particular study it is to be preferred over using the corpus as a whole.

Kettunen demonstrates that TTR is “able to show morphological complexity of a language reliably enough regardless of the used material.” In other words, TTR should give an indication of the language’s morphological complexity – which can be compared to that of other languages.

MATTR

Kettunen also discusses a more advanced form of TTR called MATTR: Moving Average Type-Token Ratio. Covington and McFall (2010) [2] describe it as follows:

We cut the Gordian knot by computing and averaging the moving average type–token ratio (MATTR). … We choose a window length (say 500 words) and then compute the TTR for words 1–500, then for words 2–501, then 3–502, and so on to the end of the text. The mean of all these TTRs is a measure of the lexical diversity of the entire text and is not affected by text length or by any statistical assumptions.

MATTR will cut up the text in equal chunks (typically 500 words) and calculate the TTR for each of them. The mean value is then selected as, in a way, the most typical TTR value for 500 words of this language. This means that theoretically a text of 10k words should have a similar MATTR as one of 100k words in the same language. With simple TTR on the other hand, it is impossible to compare texts of such unequal sizes.

MATTR has an additional advantage for Voynichese. If it is true that the “language” in the manuscript fluctuates from page to page rather than just between Currier A and B sections, the potential increase in vocabulary caused by these minor fluctuations is negated by MATTR, which only measures TTR one window at a time. For this reason, it should be interesting to compare both TTR and MATTR data for the Voynich Manuscript.

For reference, I’ll summarize Kettunen’s results with MATTR (500 word window). He compared two opposite text types:

  1. The EU constitution in all available languages. The EU constitution is a modern legal text about a relatively limited subject. Languages are all European, but include various families and types. A language like Finnish has 26 noun forms, while Spanish only has 2. The vocabulary in this text is expected to be very narrow, given the genre.
  2. Unrelated sentences from a corpus (Leipzig corpora). All sentences are natural and grammatical, but there is no thematic connection between them. The vocabulary is expected to be broad, resulting in higher MATTR values.

For the EU constitution, MATTR percentages ranged from 39% for English to 60% for Finnish. For the Leipzig corpus (unrelated sentences) they ranged from 61% for Spanish to a whopping 86% for Finnish.

These values are just to give you an idea of the possibilities though. There is no sense in comparing the EU constitution to anything written in the 15th century.

The window size we choose for MATTR is important. This is explained in Covington & McFall (2010). They suggest a window of 500 words for the type of analysis we will be doing (Kettunen also uses 500). But then they add something which might be very interesting for our purpose:

A short window, perhaps as short as 10 words, is appropriate if the goal is to detect repetition of immediately preceding words or phrases due to dysfluent production. In fact, the ratio of MATTRs with two different window sizes is a potentially useful indication of whether repetition occurs over short or long spans.

Seasoned Voynich researchers will understand where I’m going with this; it is often claimed that the VM displays unusual repetition patterns over short distances, words being repeated a few times within a short distance. If we compre a standard window MATTR (500 words per chunk) with a shorter one for each language, we can test whether Voynichese is indeed unusual in its short-span repetition.

Corpora

As an initial test I decided to compare Voynichese to a few texts which are easily accessible to me and existed in the Middle Ages.

For Latin I selected the first 10k words from Pliny the Elder’s Natural History, Liber II (on astronomy). The text was cleaned with punctuation removed, but elements like Greek words and Roman numerals within the text were kept.

For Greek, I took the first 10k words of the Iliad, from books 1 and 2 as found here.

For Middle Dutch, I selected two texts. On the one hand, the first 10k words of Van den vos ReynaerdeThis edition is from the Comburg manuscript, written between 1380 and 1425 in the area of Ghent. Also, to study the effect of text type within the same language, I included the first 10k words of Maerlants Der Naturen Bloeme, a didactic poem on nature. The chapters I selected were regular trees, spice trees and healing herbs. Text found here.

More languages and text types must be studied at a later stage, but these will suffice for an initial demonstration.

MATTR on Ancient and Medieval texts

Let’s first have a look at the MATTR values for the Latin, Greek and Middle Dutch texts.

Naamloos-2

Two things stand out in the above graph. First of all, it is clear that Latin and Greek (blue and red) show very similar behavior. This is not entirely unexpected, because the languages are similar in the number of noun cases. The same is true for both Dutch texts (yellow and green). Here, the text type seems to have caused a small difference, with Reynaert being more “repetitive”. This makes perfect sense, since the Reynaert is one narrative text about a limited number of characters, while the Maerlant fragment deals with various trees and herbs.

Secondly, MATTR values converge as the window becomes smaller. When taking the average over 2000 word windows (left side of the graph), the differences between language types really stand out. This is still the case at the 500 window. But when the window is shrunk all the way to 25 and the minimum of 5, values rush upwards. This is normal: for a window of only five words, an average of 99% unique words can be expected.

Let’s zoom in on the rightmost part of this graph, MATTR window size 25 to 5. We see the expected convergence, but no lines crossing. This means that the diversity of vocabulary is evenly distributed in all texts. Using a larger or smaller window does not affect their relative position.

Naamloos-4

In conclusion, these texts behave completely as expected and MATTR appears to be an effective tool also for non-modern texts.

Enter the Voynich

Since we have seen that Latin and Greek behave similarly and both Middle Dutch texts do as well, I will limit the following to one example of each type: Pliny and Maerlant.

Introducing the Voynich into the mix gives an interesting result. On the large scale, things look normal, but zooming in, an anomaly can be found. [Note: this is a transcription of the entire VM as one fluent text]. Let’s look at the big picture first.

Naamloos-3.jpg

The red line represents the Voynich data. It sits nicely in between Latin and Middle Dutch, and the line evolves in precisely the same way. What this graph tells us is the following: as far as vocabulary variation goes, the VM behaves exactly like a Medieval text.

However, there is one caveat. Let us zoom in on the narrower MATTR windows, just like we did for the previous graph. Note how the yellow Maerlant line and the red VM line cross.

Naamloos-7

This means that, averaged over 15-word windows, Maerlant is still slightly more repetitive than the VM. But reduce the window to 10 words, and the VM takes over. These statistics confirm the suspicion that the VM likes to repeat words to an unusual degree.

It must be noted that the differences are tiny, and the VM is still at 98.5% average word uniqueness per 5-word window.

Window Maerlant VM Pliny
5 0.99 0.985 0.995
10 0.97 0.969 0.984
25 0.917 0.931 0.956
100 0.777 0.827 0.87
200 0.687 0.757 0.812
300 0.634 0.713 0.778
400 0.596 0.681 0.752
500 0.567 0.655 0.732
1000 0.483 0.575 0.662
2500 0.376 0.474 0.561

So in my opinion we must really consider both graphs simultaneously: the VM behaves normally, but has a slightly higher tendency to repeat words in close proximity. The difference with Middle Dutch is only one percent, but this does cause the graphs to cross.

What about Currier languages?

It has long been observed that various sections and subsections of the VM exhibit a different “vocabulary”. Traditionally, the two main varieties are called Currier A and Currier B, although it is likely that the situation is much more complex. For the purpose of this post, however, I will stick to both Currier languages. To my best understanding, these are best thought of as dialects of the same “language”, with many similarities but also consistent differences and preferences.

Since these differences can be observed at the level of vocabulary, TTR and MATTR might be useful tools in their study. I intend the following only as a demonstration of the concept – it is well beyond my skill to solve the age-old question of variation in Voynichese.

In order to test this, I isolated a few sections: Herbal A, which is Currier A; Herbal B (relatively small) and Quire 20 (Currier B). This is how their MATTR data compare.

Window VM Herbal A Herbal B Q20
5 0.985 0.982 0.988 0.987
10 0.969 0.963 0.973 0.975
25 0.931 0.921 0.942 0.943
100 0.827 0.812 0.844 0.843
200 0.757 0.743 0.773 0.771
500 0.655 0.641 0.659 0.666
1000 0.575 0.555 0.565 0.588

 

Here’s the graph from these measurements. The outer lines are still Pliny and Maerlant, the central red line is still the entire VM.

Naamloos-9.gif

What we see is that Herbal A has a slightly lower degree of vocabulary variation than the B-sections. Herbal B is like the entire manuscript for large windows, but from around the 300-word window it aligns perfectly with Quire 20, which is indeed in the same “dialect”. Overall, there appears to be more repetition of vocabulary in Herbal A than there is in the B-sections. Here’s the zoom:

Naamloos-10.gif

The yellow line, Maerlant, catches up with Herbal A around the 20-word window. For the B-sections, the window needs to be smaller to achieve the same density of repetition as Maerlant. [Side note: Reynaert does the same, so this is not some quirk of Maerlant’s text].

Still, all in all the differences remain small. I am surprised to see how both B-sections coincide, and how each VM section separately still behaves like a normal language. It is hard to know what exactly these data imply, but my subjective feeling is that it would be hard to achieve these numbers if the VM did not have some “text” at its basis.

Conclusion

I have shown that the lexical diversity in the VM sits well in the range of that of normal texts. Pliny and Homer have a higher lexical diversity, which I attribute mainly to the lager amount of noun forms in Latin and Greek. On the other hand, two different Middle Dutch texts showed a lower lexical diversity than Voynichese. This can be explained by the lower number of distinct noun forms in Middle Dutch, but perhaps also by the text types. More classical and medieval texts and languages must be compared.

Secondly, it appeared that Currier A sections have a lower per-window lexical diversity than Currier B sections. Still, this difference remained much lower than the difference between Latin and Middle Dutch. In other words, the different lexical densities between Currier A and B are what we would expect from related languages or from slightly different text types, not from vastly different languages.

The data we can compare using TTR and MATTR are unlimited. There is room for the comparison of additional languages and text types, as well as VM-internal study.


[1] Kimmo Kettunen (2014) Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?, Journal of Quantitative Linguistics, 21:3, 223-245, DOI: 10.1080/09296174.2014.911506

[2] Covington, M., & McFall, J. D. (2010). Cutting the gordian knot: The Moving-Average
Type-Token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.

— Software for calculating MATTR is available here: http://ai1.ai.uga.edu/caspr/

— Special thanks to Marco Ponzi for his valuable assistance during this research, and to Rene Zandbergen and bi3mw for providing the required text files.