In the previous post, I introduced the concept of Type-Token Ratio (TTR) and the more advanced Moving Average Type Token Ratio as a possible method to gain more insight in the text of the Voynich manuscript. I strongly recommend reading that post first, especially if you are new to these concepts. In the current post, I will assume that TTR and MATTR are understood.

Type-Token Ratio

After my first exploration of the subject, I took some time to learn better ways to process and visualize the data. Since I did not have much experience working with text statistics, I asked for help over at the Voynich.ninja forum and many people were kind enough to assist. I especially want to thank Marco Ponzi and Rene Zandbergen for their help with calculations and graphs, and their general input and interest for the project.

Current goal

What I want to do now is to find texts which match the vocabulary density of Voynichese as closely as possible. I have no idea what I’m looking for and expect that it will be difficult to find anything with Voynichese’s strange statistics, so this will be like looking for the proverbial needle in the haystack (and you don’t even know if there is a needle to begin with).

Two tests

After experimenting for a while, I decided on two tests to determine whether or not a text is similar to the VM in its vocabulary density and distribution.

1. m500-test

The first and most straightforward test, which I will call the m500-test, measures the average type-token ratio over 500-word windows. In the literature (see post Type-Token Ratio) 500-word windows are suggested as a standard. After some first tests on the forum, we found out that the m500 value is noticeably influenced by two factors: language and text type.

If you use a parallel corpus (the same text in different languages), you will notice that the values nicely reflect the languages’ inflectional variety. The graph below compares m500 values for the first 30 chapters of Genesis for a variety of languages:

genesis.gif

The amount of noun forms (cases) and verb forms a language has is proportionate to its TTR value. This is why Latin scores high, English low and German in between.

Unfortunately though, we don’t know what the VM text is, so we can’t use a parallel corpus; we will have to include a variety of text types, and this complicates matters. The Genesis text from the example is in itself very repetitive (relatively low TTR values) regardless of the language. In the graph below, I added a sample of Ovid’s Latin to the Genesis graph. Notice that its vocabulary is much richer than that of the Latin Genesis sample.

ovidadded

So language and text type together determine the TTR value of a text. This is why we need to combine the m500 test with a second one.

m5/20-test

Something I noticed in the initial post is that Voynichese starts behaving strangely at very small windows (5 words). Down to a certain point it ranks among the more vocabulary rich texts, but over 5-word windows it shifts to the more repetitive group. In other words, Voynichese likes to repeat identical words within 5-word stretches to an abnormal degree. This is not a new observation, and is in fact often used as an argument in support of Voynichese not being a real language.

Initial tests suggested that Voynichese shifts most abruptly somewhere between the 20 and 5-word window. This means that if we compare these values for the selected texts, we can try to find one that behaves precisely like Voynichese.

Both tests together should paint a more sophisticated picture than a simple value. We will get to know its overall vocabulary density (for which the 500-window should be a good indication) but also its behavior in smaller windows.

First stage: variety of languages and texts

At first, I gathered random Medieval and Classical texts in various languages, based on what I could find and suggestions people gave me. Since I’d need a large amount of text, availability in a copy-pasteable format was a major factor. A second requirement was that texts needed to be at the very least a few thousand words long (10,000 or more is a plus).

Initially, I gathered the following texts (or the first ca. 10,000 words thereof):

  1. Chaucer, Canterbury Tales (fragment 1)
  2. Boetius (480-525), Peri hermeneias liber Aristotelis latine versus
  3. Pliny’s Natural History
  4. Van den Vos Reynaerde (Middle Dutch)
  5. Maerlant’s Der naturen bloeme (Middle Dutch)
  6. Iliad (ancient Greek)
  7. Machiavelli, Il Principe
  8. Dante, Inferno
  9. Roman de la Rose
  10. Ovid’s Metamorphoses
  11. Caesar’s De bello Gallico
  12. Leganda Aurea
  13. Manilius
  14. Navigatio Brendani (c.800)
  15. Cronica Catalan
  16. Vita Caroli German
  17. Vita Caroli Czech
  18. Vita Caroli Latin
  19. Chronica Boemorum German
  20. Chronica Boemorum Czech
  21. Chronica Boemorum Latin
  22. Haytonus Armenus (-c.1310), Flos historiarum terrae orientis
  23. Thomas Morus, Utopia (1518)
  24. Ps-Galenus, Ad Glauconem liber tertius
  25. Alexander Nequam (1157-1217), Tractatus super mulierem fortem
  26. ვისრამიანი, 12th century Middle Georgian epic (chapter 1-20)
  27. Amirandar, 12th century Middle Georgian romance (chapter 1-10)
  28. Codex Wormianus, Old Icelandic
  29. 14th century Welsh medical text
  30. 14th century Welsh natural history
  31. Old Church Slavonic
  32. The Prussian Enchiridion (1561)
  33. Aelfric Old Testament (Old English)
  34. Veldeke, Sente Servas
  35. Bestiary (Middle English)
  36. Orpheus, (Scottisch Middle English)
  37. The Equatorie of the Planetis (Middle English)
  38. Br̥hajjātakam (Sanskrit) more spaces
  39. Br̥hajjātakam (Sanskrit) fewer spaces
  40. A short fragment of Bactrian
  41. Classical Armenian NT (first chapters)
  42. Classical Armenian Hagiography
  43. Vita Constantini in Old Church Slavonic
  44. Zadonshchina, Old Russian

I also tried with medieval Hebrew and Arabic, but had difficulties processing these scripts and (given my complete ignorance of the script and languages) assessing the quality of the transcription.

One thing to understand is that similar languages (like the various versions of Spanish) will score similarly as long as there are no huge grammatical differences. This is why I color-coded by language family rather than individual languages.

500test

In the above graph, the pale-green bars represent various VM measurements: different sections and transcriptions. This is a topic worth its own post, but for now I just tried to establish a range of VM-relevant values, trying to capture both extremes: a transcription which results in many identical words on the low end, and one which encodes as much variation as possible on the other.

In this random sample, the VM values occupy the middle to middle-high spots. The Voynich text tends to be more varied in its vocabulary than medieval Germanic (blue) and Romance (orange) languages. Latin (red) tends to score higher than Voynichese. An important note here is that medieval Latin varies strongly, producing texts all over the spectrum. Since Voynichese sits on average above Germanic and Romance but below Latin, it matches relatively well with the Slavic language group.

To put it very simply: if you take a random 500-word block of Voynichese text, this block will contain more unique words than a 500-word block of German (or French, Dutch, Italian…) text, but fewer than a classical Greek or Latin text.

The following texts fell entirely within Voynichese limits for the 500-test:

A first conclusion we can draw is that over large text parts, Voynichese vocabulary density is completely normal. It matches best with languages that have slightly more morphological forms than Medieval Germanic languages. In this limited test, Slavic languages and the mid-lower ends of Medieval Latin scored best.

However, the 500-window gives us just one value. Using MATTR, it is also possible to find out whether vocabulary repeats over short or long distances. It is important to involve smaller windows in this study, since the repetition of words over short distances is precisely one of the arguments often used to demonstrate the non-linguistic nature of Voynichese. And indeed, as I explained earlier, initial tests suggested that Voynichese values for 5-word windows were unexpectedly low (i.e. repetition happens more within 5-word windows than expected).

I went with Marco’s suggestion of using a scatter plot to visualize this phenomenon:

Naamloos-2 kopiëren

Notice how the pale green Voynich values are located to the left of the rend line. If their m5 value was in line with the other values, the Voynichese dots would be at the same height, but shifted to the right. The Voynichese values occupy their own section of the graph and no other text comes close.

There are outliers in various directions though. So if we used enough texts, could we find one with the same values as Voynichese? What is required to replicate these results? Or does such a text simply not exist?

More texts

The first suitable corpus I found was a collection of around 300 Medieval Greek texts. They were in a very convenient format, so I included all texts of sufficient size. I also collected about 70 Medieval Latin texts, again the only criterium being size. And finally, about 45 Medieval German texts were collected in the same fashion.

At this point the reader might wonder “and why not [this or that language or text]?” and the answer is that all of this takes a lot of time and effort and I had to draw the line somewhere for this post. Of course, given the preliminary results, it would make sense to later include a truckload of Medieval Slavic texts too, as well as explore other languages. But for now, I did Greek, Latin and German..

Greek

Sorting all TTR values for 500-word windows, the various VM measurements occupy the following spots (out of 312): 182, 234, 263, 268, 279, 298. Their averaged position is 254/312. This means that 80% of examined medieval Greek texts have a lower vocabulary density than Voynichese. So if the VM were a Greek text, it would belong to the 20% with the most diverse vocabulary.

The 5/20 test, however, tells a different story:

Greek

We see a large bunch of hits around the trend line, with various clear outliers. Two or three dots are in the region of Voynichese. On the graph it is clear how Voynichese still behaves as expected on the vertical axis, ranking among the top quarter of Greek hits. On the horizontal axis (m5) it’s only around the centre of the main cluster. This exercise does provide us with a few potentially interesting texts, but I’ll get to those later.

Latin

For the 500-test, Voynichese is situated in the lower half of Medieval Latin:

lat500.jpg

This is the result of Latin’s 5/20 test:

Naamloos-3 kopiëren

The Latin results stick close to the trend line in the rich-vocabulary section top right, but is more unpredictable in the lower values. This can be explained by the fact that this is Medieval Latin, which was used by people of all different backgrounds, nationalities, education level and for many different purposes. Texts produced in Medieval Latin are highly variable. Voynichese is surrounded by some dots, and there are even two cases where Latin and Voynichese overlap. More on those later.

German

Comparing Voynichese to German texts for its w500-value gives pretty conclusive results. I found it amusing that one of the two closest matches was the Willehalm, since I’ve written about its imagery before.

german

Naamloos-5 kopiëren

The German results illustrate well what the problem with Voynichese is. On the vertical axis (m20) Voynichese ranks among the richest German texts (as in the m500 results), but on the horizontal axis (m5) it sinks to the lower half. I was not able to find a German text that matches Voynichese behavior, though they might exist.

Let me put this another way; if you take a block of anything between 20 and 500 words of Voynichese, its vocabulary will appear rich and diverse compared to a similar block of German. But Voynichese blocks of five words or less are on average as repetitive as a relatively poor German text.

Best Matches

In the Greek and especially the Latin corpus, a few texts did appear to pass the critical 5/20 test. Let’s have a look at these matches in isolation.

The three Latin candidates are all collections of poems. One by Arrigo da Settimello, a 12th century Italian writer. A second by Walter Mapes, a 12th century English writer. The third was a number of Medieval poems I gathered myself as an experiment – they were too short to serve my purpose individually.

I collected measured their type-token ratios for windows of 5, 10, 20, 50, 100, 500 and 1000 words. These values were then normalized using specific formulas provided by Rene Zandbergen, and plotted on a logarithmic scale. Let’s compare their evolution to that of three different VM transcriptions:

latlines

As detected by the 5/20 test, the lines bundle nicely on the left (smaller windows). However, they diverge as the window gets larger. The three Latin texts stay together, and reach significantly higher values than Voynichese for the larger windows. This can be explained easily by the fact that these are collections of different poems. The overall vocabulary is varied, but within one poem there are patterns of repetition.

Even though I was not able to find a Latin text that completely behaves like Voynichese, it might be of interest that so far I’ve only been able to find Voynichese’s short-window behavior paralleled in poetry/song. Make of that what you wish.

Now, the same for the two best Greek texts. These are precisely the same values for Voynichese, but with two Greek lines (green and orange) instead of the Latin ones.

Greeklines

The difference with the previous graph is remarkable. The Greek lines zig-zag a bit, but they generally stay within Voynichese limits and never deviate much. Especially the orange line, Hymn 15, behaves quite nicely.

Unfortunately I don’t read Greek, and I have not been able to study these texts well yet. They are Christian religious hymns, so again in the realm of poetry and song. They are full of words that repeat over short distances, as well as semi-repetition:

τω μεν ποταμώ τω βήματι προσεγγίζων,
τω δε Προδρόμω
το φως το απρόσιτον.

Conclusion

One of the most fundamental questions we may ask ourselves about the text in the Voynich manuscript is whether or not it is linguistic. Are its words words, its spaces spaces, its language language. If we were able to decipher it, would it contain a text (a story, scientific information, instructions…) or something more abstract like a set of coordinates?

One of several issues with the text-as-text is its large amount of short-distance repetition. I have attempted to find a number of known texts which behave the same from the perspective of type-token ratio, in an attempt to gain a better understanding of this phenomenon. The preliminary results are that yes, it is possible to find similar texts, but not (yet) in prose. The best short-window matches in Latin and Greek were all in genres of poetry and song.

Looking at TTR more generally, it appears that Voynichese’s vocabulary is more varied than that of many medieval European vernacular languages. It does find matches in medieval Greek texts (where it is among the richest ones) and medieval Latin (where it would rank among the poorer half).

Finally, I’d like to add that this research is far from finished, and it raises several questions.

  • What makes the Greek hymns’ TTR so like Voynichese?
  • Slavic languages make a natural match for Voynichese’s large-window TTR. If I assemble a larger corpus, will I be able to find more overall hits? Will this also be in poetry or more general?
  • Which other languages need to be examined?
  • What can we learn about the various Voynichese sections using the MATTR technique?

And so on. But this is enough for this post.