In the previous post, I introduced the concept of Type-Token Ratio (TTR) and the more advanced Moving Average Type Token Ratio as a possible method to gain more insight in the text of the Voynich manuscript. I strongly recommend reading that post first, especially if you are new to these concepts. In the current post, I will assume that TTR and MATTR are understood.
After my first exploration of the subject, I took some time to learn better ways to process and visualize the data. Since I did not have much experience working with text statistics, I asked for help over at the Voynich.ninja forum and many people were kind enough to assist. I especially want to thank Marco Ponzi and Rene Zandbergen for their help with calculations and graphs, and their general input and interest for the project.
What I want to do now is to find texts which match the vocabulary density of Voynichese as closely as possible. I have no idea what I’m looking for and expect that it will be difficult to find anything with Voynichese’s strange statistics, so this will be like looking for the proverbial needle in the haystack (and you don’t even know if there is a needle to begin with).
After experimenting for a while, I decided on two tests to determine whether or not a text is similar to the VM in its vocabulary density and distribution.
The first and most straightforward test, which I will call the m500-test, measures the average type-token ratio over 500-word windows. In the literature (see post Type-Token Ratio) 500-word windows are suggested as a standard. After some first tests on the forum, we found out that the m500 value is noticeably influenced by two factors: language and text type.
If you use a parallel corpus (the same text in different languages), you will notice that the values nicely reflect the languages’ inflectional variety. The graph below compares m500 values for the first 30 chapters of Genesis for a variety of languages:
The amount of noun forms (cases) and verb forms a language has is proportionate to its TTR value. This is why Latin scores high, English low and German in between.
Unfortunately though, we don’t know what the VM text is, so we can’t use a parallel corpus; we will have to include a variety of text types, and this complicates matters. The Genesis text from the example is in itself very repetitive (relatively low TTR values) regardless of the language. In the graph below, I added a sample of Ovid’s Latin to the Genesis graph. Notice that its vocabulary is much richer than that of the Latin Genesis sample.
So language and text type together determine the TTR value of a text. This is why we need to combine the m500 test with a second one.
Something I noticed in the initial post is that Voynichese starts behaving strangely at very small windows (5 words). Down to a certain point it ranks among the more vocabulary rich texts, but over 5-word windows it shifts to the more repetitive group. In other words, Voynichese likes to repeat identical words within 5-word stretches to an abnormal degree. This is not a new observation, and is in fact often used as an argument in support of Voynichese not being a real language.
Initial tests suggested that Voynichese shifts most abruptly somewhere between the 20 and 5-word window. This means that if we compare these values for the selected texts, we can try to find one that behaves precisely like Voynichese.
Both tests together should paint a more sophisticated picture than a simple value. We will get to know its overall vocabulary density (for which the 500-window should be a good indication) but also its behavior in smaller windows.
First stage: variety of languages and texts
At first, I gathered random Medieval and Classical texts in various languages, based on what I could find and suggestions people gave me. Since I’d need a large amount of text, availability in a copy-pasteable format was a major factor. A second requirement was that texts needed to be at the very least a few thousand words long (10,000 or more is a plus).
Initially, I gathered the following texts (or the first ca. 10,000 words thereof):
- Chaucer, Canterbury Tales (fragment 1)
- Boetius (480-525), Peri hermeneias liber Aristotelis latine versus
- Pliny’s Natural History
- Van den Vos Reynaerde (Middle Dutch)
- Maerlant’s Der naturen bloeme (Middle Dutch)
- Iliad (ancient Greek)
- Machiavelli, Il Principe
- Dante, Inferno
- Roman de la Rose
- Ovid’s Metamorphoses
- Caesar’s De bello Gallico
- Leganda Aurea
- Navigatio Brendani (c.800)
- Cronica Catalan
- Vita Caroli German
- Vita Caroli Czech
- Vita Caroli Latin
- Chronica Boemorum German
- Chronica Boemorum Czech
- Chronica Boemorum Latin
- Haytonus Armenus (-c.1310), Flos historiarum terrae orientis
- Thomas Morus, Utopia (1518)
- Ps-Galenus, Ad Glauconem liber tertius
- Alexander Nequam (1157-1217), Tractatus super mulierem fortem
- ვისრამიანი, 12th century Middle Georgian epic (chapter 1-20)
- Amirandar, 12th century Middle Georgian romance (chapter 1-10)
- Codex Wormianus, Old Icelandic
- 14th century Welsh medical text
- 14th century Welsh natural history
- Old Church Slavonic
- The Prussian Enchiridion (1561)
- Aelfric Old Testament (Old English)
- Veldeke, Sente Servas
- Bestiary (Middle English)
- Orpheus, (Scottisch Middle English)
- The Equatorie of the Planetis (Middle English)
- Br̥hajjātakam (Sanskrit) more spaces
- Br̥hajjātakam (Sanskrit) fewer spaces
- A short fragment of Bactrian
- Classical Armenian NT (first chapters)
- Classical Armenian Hagiography
- Vita Constantini in Old Church Slavonic
- Zadonshchina, Old Russian
I also tried with medieval Hebrew and Arabic, but had difficulties processing these scripts and (given my complete ignorance of the script and languages) assessing the quality of the transcription.
One thing to understand is that similar languages (like the various versions of Spanish) will score similarly as long as there are no huge grammatical differences. This is why I color-coded by language family rather than individual languages.
In the above graph, the pale-green bars represent various VM measurements: different sections and transcriptions. This is a topic worth its own post, but for now I just tried to establish a range of VM-relevant values, trying to capture both extremes: a transcription which results in many identical words on the low end, and one which encodes as much variation as possible on the other.
In this random sample, the VM values occupy the middle to middle-high spots. The Voynich text tends to be more varied in its vocabulary than medieval Germanic (blue) and Romance (orange) languages. Latin (red) tends to score higher than Voynichese. An important note here is that medieval Latin varies strongly, producing texts all over the spectrum. Since Voynichese sits on average above Germanic and Romance but below Latin, it matches relatively well with the Slavic language group.
To put it very simply: if you take a random 500-word block of Voynichese text, this block will contain more unique words than a 500-word block of German (or French, Dutch, Italian…) text, but fewer than a classical Greek or Latin text.
The following texts fell entirely within Voynichese limits for the 500-test:
- Vita Constantini in Old Church Slavonic
- Chronica Boemorum in Medieval Czech
- Vita Caroli in Medieval Latin
- Zadonshchina, Old Russian
- The Navigatio Brendani in Medieval Latin
- Haytonus Armenus in Medieval Latin
- An Armenian Hagiography
- The Legenda Aurea in Medieval Latin
A first conclusion we can draw is that over large text parts, Voynichese vocabulary density is completely normal. It matches best with languages that have slightly more morphological forms than Medieval Germanic languages. In this limited test, Slavic languages and the mid-lower ends of Medieval Latin scored best.
However, the 500-window gives us just one value. Using MATTR, it is also possible to find out whether vocabulary repeats over short or long distances. It is important to involve smaller windows in this study, since the repetition of words over short distances is precisely one of the arguments often used to demonstrate the non-linguistic nature of Voynichese. And indeed, as I explained earlier, initial tests suggested that Voynichese values for 5-word windows were unexpectedly low (i.e. repetition happens more within 5-word windows than expected).
I went with Marco’s suggestion of using a scatter plot to visualize this phenomenon:
Notice how the pale green Voynich values are located to the left of the rend line. If their m5 value was in line with the other values, the Voynichese dots would be at the same height, but shifted to the right. The Voynichese values occupy their own section of the graph and no other text comes close.
There are outliers in various directions though. So if we used enough texts, could we find one with the same values as Voynichese? What is required to replicate these results? Or does such a text simply not exist?
The first suitable corpus I found was a collection of around 300 Medieval Greek texts. They were in a very convenient format, so I included all texts of sufficient size. I also collected about 70 Medieval Latin texts, again the only criterium being size. And finally, about 45 Medieval German texts were collected in the same fashion.
At this point the reader might wonder “and why not [this or that language or text]?” and the answer is that all of this takes a lot of time and effort and I had to draw the line somewhere for this post. Of course, given the preliminary results, it would make sense to later include a truckload of Medieval Slavic texts too, as well as explore other languages. But for now, I did Greek, Latin and German..
Sorting all TTR values for 500-word windows, the various VM measurements occupy the following spots (out of 312): 182, 234, 263, 268, 279, 298. Their averaged position is 254/312. This means that 80% of examined medieval Greek texts have a lower vocabulary density than Voynichese. So if the VM were a Greek text, it would belong to the 20% with the most diverse vocabulary.
The 5/20 test, however, tells a different story:
We see a large bunch of hits around the trend line, with various clear outliers. Two or three dots are in the region of Voynichese. On the graph it is clear how Voynichese still behaves as expected on the vertical axis, ranking among the top quarter of Greek hits. On the horizontal axis (m5) it’s only around the centre of the main cluster. This exercise does provide us with a few potentially interesting texts, but I’ll get to those later.
For the 500-test, Voynichese is situated in the lower half of Medieval Latin:
This is the result of Latin’s 5/20 test:
The Latin results stick close to the trend line in the rich-vocabulary section top right, but is more unpredictable in the lower values. This can be explained by the fact that this is Medieval Latin, which was used by people of all different backgrounds, nationalities, education level and for many different purposes. Texts produced in Medieval Latin are highly variable. Voynichese is surrounded by some dots, and there are even two cases where Latin and Voynichese overlap. More on those later.
Comparing Voynichese to German texts for its w500-value gives pretty conclusive results. I found it amusing that one of the two closest matches was the Willehalm, since I’ve written about its imagery before.
The German results illustrate well what the problem with Voynichese is. On the vertical axis (m20) Voynichese ranks among the richest German texts (as in the m500 results), but on the horizontal axis (m5) it sinks to the lower half. I was not able to find a German text that matches Voynichese behavior, though they might exist.
Let me put this another way; if you take a block of anything between 20 and 500 words of Voynichese, its vocabulary will appear rich and diverse compared to a similar block of German. But Voynichese blocks of five words or less are on average as repetitive as a relatively poor German text.
In the Greek and especially the Latin corpus, a few texts did appear to pass the critical 5/20 test. Let’s have a look at these matches in isolation.
The three Latin candidates are all collections of poems. One by Arrigo da Settimello, a 12th century Italian writer. A second by Walter Mapes, a 12th century English writer. The third was a number of Medieval poems I gathered myself as an experiment – they were too short to serve my purpose individually.
I collected measured their type-token ratios for windows of 5, 10, 20, 50, 100, 500 and 1000 words. These values were then normalized using specific formulas provided by Rene Zandbergen, and plotted on a logarithmic scale. Let’s compare their evolution to that of three different VM transcriptions:
As detected by the 5/20 test, the lines bundle nicely on the left (smaller windows). However, they diverge as the window gets larger. The three Latin texts stay together, and reach significantly higher values than Voynichese for the larger windows. This can be explained easily by the fact that these are collections of different poems. The overall vocabulary is varied, but within one poem there are patterns of repetition.
Even though I was not able to find a Latin text that completely behaves like Voynichese, it might be of interest that so far I’ve only been able to find Voynichese’s short-window behavior paralleled in poetry/song. Make of that what you wish.
Now, the same for the two best Greek texts. These are precisely the same values for Voynichese, but with two Greek lines (green and orange) instead of the Latin ones.
The difference with the previous graph is remarkable. The Greek lines zig-zag a bit, but they generally stay within Voynichese limits and never deviate much. Especially the orange line, Hymn 15, behaves quite nicely.
Unfortunately I don’t read Greek, and I have not been able to study these texts well yet. They are Christian religious hymns, so again in the realm of poetry and song. They are full of words that repeat over short distances, as well as semi-repetition:
τω μεν ποταμώ τω βήματι προσεγγίζων,
τω δε Προδρόμω
το φως το απρόσιτον.
One of the most fundamental questions we may ask ourselves about the text in the Voynich manuscript is whether or not it is linguistic. Are its words words, its spaces spaces, its language language. If we were able to decipher it, would it contain a text (a story, scientific information, instructions…) or something more abstract like a set of coordinates?
One of several issues with the text-as-text is its large amount of short-distance repetition. I have attempted to find a number of known texts which behave the same from the perspective of type-token ratio, in an attempt to gain a better understanding of this phenomenon. The preliminary results are that yes, it is possible to find similar texts, but not (yet) in prose. The best short-window matches in Latin and Greek were all in genres of poetry and song.
Looking at TTR more generally, it appears that Voynichese’s vocabulary is more varied than that of many medieval European vernacular languages. It does find matches in medieval Greek texts (where it is among the richest ones) and medieval Latin (where it would rank among the poorer half).
Finally, I’d like to add that this research is far from finished, and it raises several questions.
- What makes the Greek hymns’ TTR so like Voynichese?
- Slavic languages make a natural match for Voynichese’s large-window TTR. If I assemble a larger corpus, will I be able to find more overall hits? Will this also be in poetry or more general?
- Which other languages need to be examined?
- What can we learn about the various Voynichese sections using the MATTR technique?
And so on. But this is enough for this post.
Speaking of poems . Some time ago, or to be precise on Jan. 12th, 2017, I offered an hypothesis that folio 49v might represent a thirteen-line rondeau and the arabic numerals written vertically might record the pattern of the Rondeau cinquain.
It was just an hypothesis; quite discardable but now I wonder about the greater number of repetitions you’ get with poetry and whether those make a substantial difference to the statistics.
As you know it was a common practice to separate the first letter of songs and poems from the rest of the line, and the same occurs on f.66r and f.76r.
Would it be worth it, even for curiosity’s sake to test just those three folios against verse/lyrics in the various languages?
It may be a stupid question; I’ve no background in historical linguistics.
My next project, before including more languages, is to separate various VM sections and compare their stats. So this is an excellent suggestion. It’s all pretty new so I don’t think any questions are stupid at this point 🙂
Due to short time, I was not able to perform the statistical analysis I’d like to do. However I can answer your first question, what makes the Greek hymns’ TTR so like Voynichese, with a good amount of certainty: aliasing. The employed statistical analysis is not “deep” enough to detect the differences among those texts. In the case of the Greek hymn, the strange behaviour happens because its author “abused” the (perfectly fine in Greek) repetition of the definite article when using adjectives or appositions: e.g. το φως το απρόσιτον (literally something like “the light the inaccessible one”), instead of το φως απρόσιτον (“the inaccessible light”). However the Greek hymn 15 never repeats two words one after the others, which is instead a common feature of the VM.
This is apparent from the half-finished (more like half-started to say the truth) variance analysis I prepared with various texts, including the first 25 folios from the VM (dubbed Voynich A in the plot), and now the hymn 15. It shows how much regular is the TTR along the text:
When the window is small the VM is the most variant by far. When the window is very small, let’s say below 5 words, the VM is exceptionally variant and a clear outlier. As I said before, the hymn 15 is quite repetitive, but it never uses the same word twice in a row, unlike the Voynich. A few texts sometimes have twice the same word in a row, but usually that’s because their authors employed a good old chiasmus with no conjunctions, but it is very rare. The Voynich could sports those strange repetition because those are numbers (I do not think so), onomatopoeic words, gibberish or I-do-not-know-what.
LikeLiked by 1 person
Thanks Stefano, I agree. Marco ran some tests on the forum and came to the same conclusion; the real holy grail is to find a text which does immediate reduplication as often as Voynichese. This means that, as you suggest, my window of 5 is still too large.
Now theorerically if I set my window to two and three, I should be able to catch texts which do these things as well. The software I was using has a lower limit of 5, but some helpful forum members have been teaching me how to do this in Java. So I should be able to add a reduplication test in the next posts.
indeed I agree with Stefano’s observations. Hymn 15 does not feature consecutive reduplication of words, but it features 19 (!) occurrences of a 4-words phrase (the one mentioned by Stefano), while in the much longer VMS phrases that repeat more than twice are basically absent.
About reduplication, this is a plot based on the Universal Declaration of Human Rights corpus I shared on the ninja forum a few months ago:
The Y axis corresponds to the frequency of the most frequent word-initial character. The plot is relevant for the current subject because the X axis is the average number of exact word reduplication in 1000-characters. Voynich samples are represented by the blue diamonds. The graph illustrates the fact that several (non-European) languages have many more occurrences of reduplication than the VMS. The highlighted circles correspond to:
hms Hmong, Southern Qiandong (China)
auc Waorani (Ecuador)
pam Pampangan (Philippines)
cot Caquinte (Peru)
rar Rarotongan (Polynesia)
plt Malagasy, Plateau (Madagascar)
flm Chin, Falam (South-East Asia)
njo Naga, Ao (North-East India)
gla Gaelic, Scottish
gle Gaelic, Irish
ydd Yiddish, Eastern
eus Basque Euskara
The only European languages in which I could spot some reduplication (about one order of magnitude less frequent than in the VMS) are Basque and some forms of Gaelic: together with the Slavic languages mentioned by Emma, these could be other candidates for further investigation. But looking at languages from other continents will also be interesting.
The UDHR is basically legalese: one can expect that texts written in a different style might have different reduplication values.
While decreasing the MATTR window down to 2 will measure exact reduplication, the TTR is by definition unable to address another impressive feature of the VMS: quasi-reduplication (“semi repetition” in Koen’s words).
qokedy.okedy.qokeedy.okeedy (f84v) will be seen as something perfectly normal, with TTR=100%. Quasi-reduplication is also ignored in my plot, but of course it is possible to define ad-hoc measures to further analyse the phenomenon.
BTW, I want to congratulate again with Koen for opening this line of investigation. Like other statistical methods, MATTR allows to focus on some specific features of Voynichese and other languages. There is no single “silver bullet”, but all new angles add to the general picture and increase our understanding of the text.
Indeed, I had no idea beforehand that m5 would still be too large, but this appears to be the case.
I will be experimenting with lower windows and see what comes up.
Apart from that, I still believe the larger windows (m500) might be useful in classifying languages (and thus categorizing Voynichese). See my reply to Emma’s comment.
Excellent work Koen. I think this is well evidenced and argued. The outcomes are reasonable and specific: the Voynich text could be linguistic, but it still isn’t typical (at least for prose). I hope that you can find some corpus material to explore the Slavic languages.
About the Slavic languages, they naturally occupy the spot between Germanic/Romance vernacular on the one hand and Classical Latin on the other because of linguistic properties: the amount of noun cases and so forth. Medieval Latin is all over the place, but there will be other languages that, like Slavic, have their centre of gravity in the same zone.
It would be useful as a preliminary selection to find out which medieval languages with a similar amount of noun and verb forms. For Old Chuch Slavonic for example, this is *a lot*. Although many of them are formally identical, there is still a lot of variation.
I recall reading Stolfi’s response to someone’s challenging him about some repetitious string – like dain, daiin, qokedy dain (or something of the sort) and he came back immediately with an exactly comparable string from some south-east Asian language, if I recall. If I can, I could try to find that again (or if anyone is in touch with him and could put the question again) would that he helpful?
About Marco’s list –
pre-1438 contact between Mediterranean peoples and Ecuador or Polynesia is not recorded, and I’m not suggesting Marco meant to imply it was. But some readers may not have encountered the documentary and archaeological evidence for pre-da Gama contact between Mediterranean peoples and other regions on his list – such as Southern China; the Philippines; Madagascar; southeast Asia and north-East India. Before the period 1404-1438, Armenian Christians were established in (mod.) Malaysia (which Ptolemy knew as the ‘Golden peninsula’.). By the late 13thC/early 14thC a number of Italians were in Guangdong, or rather its foreign traders’ port, Guangzhou. One Sicilian friar – Montecorvino- is now (I hope) pretty well known. And so on. I’m not trying to make a case for any language from Marco’s list, but in terms of the historical record, many are perfectly feasible. (and I’m sorry if this is o.t., Koen)
I might also mention that the Basque mariners are much neglected in historical studies of the medieval Mediterranean, and especially their role in creating some of our earliest among the ‘new’ sort of cartes marine (often, if wrongly called “portolan” charts). If you feel it worthwhile, and can find enough material, I’d be very interested to know how Basque might appear in your comparative charts. Sorry to say that I have no means to see or credit Marco’s work on this point.
Diane: you are right, I should certainly include Basque. Even from a purely linguistic standpoint this would be interesting.