This is the third post about moving average type-token ratio (MATTR). If you are not familiar with this concept, I highly advise reading the other installments first:

Type-Token Ratio
Type-Token Ratio II

Language Clouds

Different texts and languages have different profiles over increasing TTR windows. One text repeats much vocabulary within the same paragraph, while another may spread out repeated words evenly throughout the entire text. Therefore, it is useful to compare several values. For this post, I opted for a 50- and 1000-word window. Fifty-word chunks are large enough to be ignorant of Voynichese’s odd small-scale behavior. And 1000 words (after pre-processing) is the minimum requirement for texts to be included in the corpus.

Measuring two windows for each text allows me to represent the evolving density of vocabulary in a two-dimensional field. The following graph looks chaotic at first, but it illustrates the method well. Each dot represents a medieval text, and each color is a language or language group. The green arrow points at Voynichese.


Some points of interest

  • Languages overlap, but they do form clear clouds.
    • English (dark blue) sits at low m1000-values and low-to-mid m50. German (light blue) overlaps with English but can go higher. Taking both Germanic languages together, they form a continuum in the lower m1000 regions.
    • Latin (red) is at the top of the spectrum. However, as I’ve noted before, medieval Latin was written by people of very different nationalities and backgrounds, so it is represented all over the graph.
    • Greek sits somewhat between Latin and Germanic.
  • Voynichese dots (green arrow) are no outliers at these window sizes. They blend into the pack.
  • Voynichese dots are situated to the right of the trend line. This means that the smaller window (m50) has relatively high TTR values compared to the large window. In the previous posts, we saw that Voynichese values become abnormally low for small windows, making them outliers to the left of the trend line.

This and previous tests [1], suggests that Voynichese’s tendency to immediately repeat repeat words repeat words does not have a noticeable impact on large-window values. This means that we can safely treat Voynichese’s overall TTR and its small-window patterns of repetition as two separate problems.

In the next figures, I will isolate the various corpora we have collected so far to get a better idea of how Voynichese compares to other medieval texts.


Let us get Latin out of the way first, since its data stretch the graph on both sides. TTR-wise, medieval latin is a fickle beast, it’s all over the place. The green Voynichese dots don’t look out of place in the Latin herd.


There are outliers bottom, top and left, but those are all Latin texts, not Voynichese. One green dot does drift far away from the rest: Quire 13 has significantly lower TTR values than any other Voynichese subsection in any of the major transcriptions. But more on that later. If Voynichese were a Latin text, its would be among the more repetitive ones.



Our Voynichese measurements land in the periphery of the Greek cloud, but again don’t look like complete outliers. As opposed to Latin, most Greek texts have a lower TTR than Voynichese. Again except for Quire 13 which appears to fall from the cloud at the bottom.

German and English


Above, the light blue dots represent medieval German texts, with dark blue for English. Since these are related languages, it is no wonder that they form one cloud. Here, we see a different picture than with Latin and Greek. Most Voynichese measurements are at the high end, while Quire 13 feels right at home in the middle of the Germanic cloud.

Celtic and Slavic


Celtic texts (light grey) appear rather variable. I have collected a dozen medieval Irish texts, and they don’t form a clear cloud like other languages do. These first results suggest that I might better focus on other languages first.

Slavic languages (dark grey / black) on the other hand, do show promise. Three out of five collected texts fall right into Voynichese territory. Collecting more medieval Slavic texts should be a priority in further TTR research so we can build a complete cloud.

French and Italian


Collecting medieval texts, checking for quality and pre-processing takes time, and I have not yet been able to expand a corpus for a Romance language. I spent some time looking for Italian texts, but apart from Dante and some other well-known authors, I haven’t found them yet. Let me know if you can find a collection.

So far, the few entries for Italian (yellow) show that its cloud might get close to the main Voynichese pack. This can only be tested by gathering some two dozen more medieval Italian texts.


The French corpus is also still limited for now. Based on these very provisional data, it looks like French might sit around the Quire 13 value. This also requires expanding.

What’s up with Quire 13?

The following graph shows MATTR values over 500-word windows. Blue bars represent the median value of all texts for a language. Keep in mind that for Slavic, French and Italian these are still limited.

Naamloos-16 kopiëren.gif

Here, it becomes clear how much different Quire 13 is from the other VM text sections. It’s lower than the median German value, and just a hair’s length above (preliminary) French and English. On the other hand, the VM Herbal B section rivals (preliminary) Slavic for second place after Latin. Quire 13 and the other VM sections, whether in Currier language A or B, exist on opposite ends of the TTR spectrum. They allow languages as diverse as German, Celtic, Italian and Greek in between them.

The next graph illustrates this differently. It contains normalized TTR data for all measured values. The median values for German (blue), Latin (red) and English (orange) as well as Quire 13 (light green) and the rest of the VM without Q13 (dark green).

Naamloos-18 kopiëren.gif

What probably stands out most is how the VM graphs don’t entirely behave like a regular language. If you read the graph from right to left, you’ll notice that they slump between 50 and 10, but really plummet for lower windows. In fact, VM values in lower windows deviate so that in the normalized dataset all other values are compressed. Between 50 and 1000, the right-hand side of the graph, there is no such drastic variation.

Both green VM curves have a similar shape. They both escape the pull of extreme small-window repetition at around 50, where they peak. Between 50 and 1000, the small-scale effect is lost and they proceed parallel to the German line (blue).

However, they do so at drastically different levels. At its ~50 peak, the VM-without-Q13 line approaches Latin, the highest value of the set. Meanwhile, at the same 50-word window, Quire 13 is still between English and German, two bottom languages. In fact, at the 1000-word window, Quire 13’s TTR becomes even lower than English.

On the one hand, Quire 13 performs exactly the same weird tricks as all other VM sections. This frequent doubling or even tripling of words is one of the main arguments against the possibility that Voynichese is a normal text obscured by simple substitution. But on the other hand, vocabulary in Quire 13 as a whole is significantly less diverse than in the rest of the VM, while other individual sections stick together.

What does this all mean?

For TTR values to be reliable, we need two conditions to be true:

  • The VM text has some form of real language at its base. This could mean it’s a new creation, or that it was “generated” from a real text through some encoding mechanism, simple or complex, conservative or destructive.
  • Unique words in the source text or language broadly still correspond to unique “words” in Voynichese. If “apple” and “dog” both become [daiin], TTR is useless. But if “apple” becomes [dain] and “dog” [daiin], there is no problem. This is, in fact, a huge advantage of TTR as a method of comparison. Even if “apple” becomes the most horrendous string of characters in Voynichese, this doesn’t matter as long as this happens consistently.

So let’s assume those conditions to be true. In that case, TTR suggests two possible solutions for Quire 13’s deviant values:

  1. Quire 13 is in a different language than the rest of the VM. Not different like English and German, but more like English and Latin – different language families.
  2. Quire 13 is a very different text type than the rest of the VM. You might say “of course, it’s not about plants.” But subject alone is not enough to explain this huge shift. The difference is like that between a varied prose text and a repetitive hymn.

It should be possible to eliminate the first option by comparing the actual vocabulary. If Quire 13 uses lots of vocabulary that’s absent from the rest of the MS, this might point towards a different language family. However, my guess is that we’re looking at option two, since the vocabulary of Quire 13 is not so different from that of the rest of the MS.

As for the frequent reduplication of words, I’m obviously not the first one to write about this. Based on what I’ve seen in TTR statistics, I would think small-window effects are a phenomenon isolated from the text as a whole. That words were sometimes duplicated for padding a paragraph or confusing a would-be decipherer. But that is extremely speculative.

I plan on further expanding the corpus to create some “clouds” for other languages. Medieval Slavic must certainly be expanded to at least two dozen texts. Italian and French are other targets. And at a later stage, non-European languages. If you know of a page(s) where I can gather texts in these or other under-represented languages, do let me know. [2]

To be continued…




As usual, I owe thanks to the competent forum members who helped me out and especially nablator, whose java codes make these calculations possible.

[1] Voynichese likes to repeat words within 5-10 word windows. I wondered whether this reduplication impacts larger-window TTR. If a lot of duplicate words appear, one would expect MATTR values to go down no matter the window size. I noticed, however, that this is not the case. Bringing m5 values to a normal level by deleting words that repeat within a five-word distance hardly affects m500.

[2] Keep in mind that from a TTR perspective, there is not much difference between closely related languages. I could easily expand the Dutch corpus but this would just thicken the English-German Germanic cloud.