This is the third post about moving average type-token ratio (MATTR). If you are not familiar with this concept, I highly advise reading the other installments first:
Type-Token Ratio
Type-Token Ratio II
Language Clouds
Different texts and languages have different profiles over increasing TTR windows. One text repeats much vocabulary within the same paragraph, while another may spread out repeated words evenly throughout the entire text. Therefore, it is useful to compare several values. For this post, I opted for a 50- and 1000-word window. Fifty-word chunks are large enough to be ignorant of Voynichese’s odd small-scale behavior. And 1000 words (after pre-processing) is the minimum requirement for texts to be included in the corpus.
Measuring two windows for each text allows me to represent the evolving density of vocabulary in a two-dimensional field. The following graph looks chaotic at first, but it illustrates the method well. Each dot represents a medieval text, and each color is a language or language group. The green arrow points at Voynichese.
Some points of interest
- Languages overlap, but they do form clear clouds.
- English (dark blue) sits at low m1000-values and low-to-mid m50. German (light blue) overlaps with English but can go higher. Taking both Germanic languages together, they form a continuum in the lower m1000 regions.
- Latin (red) is at the top of the spectrum. However, as I’ve noted before, medieval Latin was written by people of very different nationalities and backgrounds, so it is represented all over the graph.
- Greek sits somewhat between Latin and Germanic.
- Voynichese dots (green arrow) are no outliers at these window sizes. They blend into the pack.
- Voynichese dots are situated to the right of the trend line. This means that the smaller window (m50) has relatively high TTR values compared to the large window. In the previous posts, we saw that Voynichese values become abnormally low for small windows, making them outliers to the left of the trend line.
This and previous tests [1], suggests that Voynichese’s tendency to immediately repeat repeat words repeat words does not have a noticeable impact on large-window values. This means that we can safely treat Voynichese’s overall TTR and its small-window patterns of repetition as two separate problems.
In the next figures, I will isolate the various corpora we have collected so far to get a better idea of how Voynichese compares to other medieval texts.
Latin
Let us get Latin out of the way first, since its data stretch the graph on both sides. TTR-wise, medieval latin is a fickle beast, it’s all over the place. The green Voynichese dots don’t look out of place in the Latin herd.
There are outliers bottom, top and left, but those are all Latin texts, not Voynichese. One green dot does drift far away from the rest: Quire 13 has significantly lower TTR values than any other Voynichese subsection in any of the major transcriptions. But more on that later. If Voynichese were a Latin text, its would be among the more repetitive ones.
Greek
Our Voynichese measurements land in the periphery of the Greek cloud, but again don’t look like complete outliers. As opposed to Latin, most Greek texts have a lower TTR than Voynichese. Again except for Quire 13 which appears to fall from the cloud at the bottom.
German and English
Above, the light blue dots represent medieval German texts, with dark blue for English. Since these are related languages, it is no wonder that they form one cloud. Here, we see a different picture than with Latin and Greek. Most Voynichese measurements are at the high end, while Quire 13 feels right at home in the middle of the Germanic cloud.
Celtic and Slavic
Celtic texts (light grey) appear rather variable. I have collected a dozen medieval Irish texts, and they don’t form a clear cloud like other languages do. These first results suggest that I might better focus on other languages first.
Slavic languages (dark grey / black) on the other hand, do show promise. Three out of five collected texts fall right into Voynichese territory. Collecting more medieval Slavic texts should be a priority in further TTR research so we can build a complete cloud.
French and Italian
Collecting medieval texts, checking for quality and pre-processing takes time, and I have not yet been able to expand a corpus for a Romance language. I spent some time looking for Italian texts, but apart from Dante and some other well-known authors, I haven’t found them yet. Let me know if you can find a collection.
So far, the few entries for Italian (yellow) show that its cloud might get close to the main Voynichese pack. This can only be tested by gathering some two dozen more medieval Italian texts.
The French corpus is also still limited for now. Based on these very provisional data, it looks like French might sit around the Quire 13 value. This also requires expanding.
What’s up with Quire 13?
The following graph shows MATTR values over 500-word windows. Blue bars represent the median value of all texts for a language. Keep in mind that for Slavic, French and Italian these are still limited.
Here, it becomes clear how much different Quire 13 is from the other VM text sections. It’s lower than the median German value, and just a hair’s length above (preliminary) French and English. On the other hand, the VM Herbal B section rivals (preliminary) Slavic for second place after Latin. Quire 13 and the other VM sections, whether in Currier language A or B, exist on opposite ends of the TTR spectrum. They allow languages as diverse as German, Celtic, Italian and Greek in between them.
The next graph illustrates this differently. It contains normalized TTR data for all measured values. The median values for German (blue), Latin (red) and English (orange) as well as Quire 13 (light green) and the rest of the VM without Q13 (dark green).
What probably stands out most is how the VM graphs don’t entirely behave like a regular language. If you read the graph from right to left, you’ll notice that they slump between 50 and 10, but really plummet for lower windows. In fact, VM values in lower windows deviate so that in the normalized dataset all other values are compressed. Between 50 and 1000, the right-hand side of the graph, there is no such drastic variation.
Both green VM curves have a similar shape. They both escape the pull of extreme small-window repetition at around 50, where they peak. Between 50 and 1000, the small-scale effect is lost and they proceed parallel to the German line (blue).
However, they do so at drastically different levels. At its ~50 peak, the VM-without-Q13 line approaches Latin, the highest value of the set. Meanwhile, at the same 50-word window, Quire 13 is still between English and German, two bottom languages. In fact, at the 1000-word window, Quire 13’s TTR becomes even lower than English.
On the one hand, Quire 13 performs exactly the same weird tricks as all other VM sections. This frequent doubling or even tripling of words is one of the main arguments against the possibility that Voynichese is a normal text obscured by simple substitution. But on the other hand, vocabulary in Quire 13 as a whole is significantly less diverse than in the rest of the VM, while other individual sections stick together.
What does this all mean?
For TTR values to be reliable, we need two conditions to be true:
- The VM text has some form of real language at its base. This could mean it’s a new creation, or that it was “generated” from a real text through some encoding mechanism, simple or complex, conservative or destructive.
- Unique words in the source text or language broadly still correspond to unique “words” in Voynichese. If “apple” and “dog” both become [daiin], TTR is useless. But if “apple” becomes [dain] and “dog” [daiin], there is no problem. This is, in fact, a huge advantage of TTR as a method of comparison. Even if “apple” becomes the most horrendous string of characters in Voynichese, this doesn’t matter as long as this happens consistently.
So let’s assume those conditions to be true. In that case, TTR suggests two possible solutions for Quire 13’s deviant values:
- Quire 13 is in a different language than the rest of the VM. Not different like English and German, but more like English and Latin – different language families.
- Quire 13 is a very different text type than the rest of the VM. You might say “of course, it’s not about plants.” But subject alone is not enough to explain this huge shift. The difference is like that between a varied prose text and a repetitive hymn.
It should be possible to eliminate the first option by comparing the actual vocabulary. If Quire 13 uses lots of vocabulary that’s absent from the rest of the MS, this might point towards a different language family. However, my guess is that we’re looking at option two, since the vocabulary of Quire 13 is not so different from that of the rest of the MS.
As for the frequent reduplication of words, I’m obviously not the first one to write about this. Based on what I’ve seen in TTR statistics, I would think small-window effects are a phenomenon isolated from the text as a whole. That words were sometimes duplicated for padding a paragraph or confusing a would-be decipherer. But that is extremely speculative.
I plan on further expanding the corpus to create some “clouds” for other languages. Medieval Slavic must certainly be expanded to at least two dozen texts. Italian and French are other targets. And at a later stage, non-European languages. If you know of a page(s) where I can gather texts in these or other under-represented languages, do let me know. [2]
To be continued…
NOTES
As usual, I owe thanks to the competent voynich.ninja forum members who helped me out and especially nablator, whose java codes make these calculations possible.
[1] Voynichese likes to repeat words within 5-10 word windows. I wondered whether this reduplication impacts larger-window TTR. If a lot of duplicate words appear, one would expect MATTR values to go down no matter the window size. I noticed, however, that this is not the case. Bringing m5 values to a normal level by deleting words that repeat within a five-word distance hardly affects m500.
[2] Keep in mind that from a TTR perspective, there is not much difference between closely related languages. I could easily expand the Dutch corpus but this would just thicken the English-German Germanic cloud.
Interesting post, particularly concerning Q13. I wonder whether this is equally true for the two ‘half-quires’ of Q13, Q13A and Q13B, or if one of those is even more skewed than the other.
LikeLike
Koen,
Nice to see your research come together in this post!
Nick,
I asked exactly the same question on the forum, here is a link to the replies I received.
https://www.voynich.ninja/thread-2818-post-28511.html#pid28511
LikeLike
All very interesting. I don’t think you would see differences between different parts of the quire but it would be an interesting exercise nonetheless. I am surprised how well it fits with the various languages other than that one behaviour. I see a few languages missing though, do you have plans to check against Arabic, Spanish, Portuguese, etc?
LikeLike
Hi Koen,
I have read your three posts about MATTR and I would like to ask you some questions.
First, you did not include Spanish because “similar languages (like the various versions of Spanish) will score similarly as long as there are no huge grammatical differences”. Correct me if I’m wrong, do you mean dialects or the other three Romance languages (Galician, Catalan and Aranese)?
Second, if you apply MATTR to the Voynichese language in the same way as you do to readable (non standardized) languages, does the MATTR behave as if the Voynechese language were a single substitution code?
Third, is it possible to apply MATTR to several manuscripts which contains common Latin abbreviations (personal documents)?
Finally, some days ago I was reading the theory defended by James R. Child and David T. Bernholz and they state the Voynichese language consists of different dialects of German and Gothic. Could MATTR be applied to a Gothic hand written text like Ulphilas’ Bible? I don’t think the Gothic language has something to do with the Voynichese manuscript but it would be worth using Gothic for MATTR linguistic comparisons.
Thanks
LikeLike
Hi Carmen
Sorry again for the delay, the wordpress spam filter has been eating a lot of good comments lately.
About not including languages, I mean it would make little sense to include Spanish *and* its various dialects since those are very similar in syntax and grammar. But I’d certainly like to include one or more Romance languages – French, Italian, Spanish…
What I do believe is that there will be greater gains in gathering more diverse languages first. My prediction is that a Spanish cloud won’t differ much from English even. But Czech, Basque or Arabic would certainly open up entirely different sections of the graphs. All work in progress… anyway it’s probably worth it since the collection can be used for different tests later as well.
About Voynichese behaving like a substitution code, MATTR can’t tell because it doesn’t really look inside words. What I mean is, if you have a code where you consistently replace “apple” by “udididdivbividds”, that’s perfectly fine for MATTR, as long as the same word is consistently replaced with the same code word.
At the same time, this is a strength of the method since internal word parsing doesn’t matter. We transcribe “daiin” with five letters, but we don’t know if these are actually meant as five individual glyphs. It might as well be “dam”. For MATTR this doesn’t matter (no pun intended), as long as your transcription is consistent. A unique word in the source text must be a unique word in your transcription.
What MATTR can say is that *if* Voynichese is a relatively straightforward representation of an existing language, its lexical richness (and hence its amount of noun cases etc) is like that of language x or y. It can also quantify and compare problems with Voynichese we still wrestle with, like the unusual amount of exact reduplication.
Third question: I can apply it to any copy-pastable text 🙂 So if abbreviations are transcribed in a specific way, the method will take this into account. But I’d need a good transcription first. Whenever it is possible, I use the most true-to-manuscript transcription available. But this isn’t always possible.
Now on the other hand, if a scribe always abbreviates a word the same way and the transcription always writes this word in full, MATTR data will be the same since transcription is consistent.
On Gothic, again the same. I can very easily include any text as long as it’s available in a format I can copy-paste. For Gothic I would carefully predict that its data are closer to Voynichese than other Germanic languages since it is morphologically richer.
LikeLiked by 1 person
Thanks for your reply, Koen.
My last question is about the concept introduced by Captain Prescott Currier. In “Proceedings of a Seminar”, he made the observation of “the line as a functional unit”. Could the MATTR be applied to an isolated line of any text (no matter the language)? Or does this not make any difference in the final results? Thanks.
LikeLike
Carmen, I’m not sure if I understand what you mean, but it sounds like something that’s worth investigating, so please explain if you meant something else.
If you mean to take one line of Voynichese and compare it to one line of something else, then the sample size is just too small. Those are not enough words to generate any meaningful stats.
But if you mean that I should try the method using lines of text as fluctuating window sizes, that could work. So average the TTR for each line across the text. It could be an interesting exercise, but difficult to interpret the results. This is because TTR depends on the size of your text fragments. For example, if two identical words occur in a line of 10 words, type-token ratio of that line will be 90%. But if those two words appear in a line of four words, TTR of that line will be 75%.
The interesting part of this test would be that TTR will also be affected by layout. And as you say, Currier’s research showed that this affects Voynichese as well. But please explain if you meant something else.
LikeLiked by 1 person
Now that I read my question again, I see why you didn’t understand what I meant. Sorry for that.
As you say, my idea would be “to take one line of Voynichese and compare it to one line of something else”. At first, I also thought the samples are too small or irrelevant to be compared. Anyway, as Statistics is not my area of expertise 😢, I thought I had to ask about it.
Currier said it the line as a functional unit was really significant and it seemed to me it could be interesting to use the MATTR as a tool to check that. But…if the sample size is non-pertinent, well definitely there’s no point in trying it.
Thanks.
LikeLike
Koen,
Could you explain hoe you define ‘English’ in the study? I mean, does it begin with Chaucer, or do you include e.g. Anglo-Norman?
LikeLike
Nick: good question. I’ll check this later today and share some graphs in a new post.
Linda: the more the merrier. But a single text is basically worthless, you’d need at least ten to get some idea and 20 to get a decent cloud. To find a variety copy-pastable medieval texts is harder than you’d think.
Additionally, as my familiarity with the language decreases, so does my ability to judge the quality of the text and to properly pre-process it (remove punctuation, editor’s additions etc.) But both Spanish and Arabic should certainly be added at some point.
Diane: I aim for 14th-15th century but there are some much earlier texts included as well. I’m not sure how intuitive this is, but TTR is quite forgiving when it comes to different stages of a language’s evolution. This is because TTR values are mostly determined by grammar, and grammar is very slow to evolve.
For the same reason it makes little sense to check multiple dialects of the same language, because those will mainly differ in vocabulary and the way they pronounce words, while TTR-determining factors hardly change.
LikeLiked by 1 person
VViews: going by Rene’s earlier graph, I thought there would be little difference, but to my surprise it appears there might be some after all.
LikeLike
Hi Koen,
I checked the Q13 / rest_of_VMS difference when removing all cases of perfect reduplication. It seems that things are almost unaffected, the 0.14 difference for a 500-words window is still there:
texts/Q13 0.505
texts/VMS-noQ13 0.646
texts/Q13_no_redup 0.509
texts/VMS-noQ13_no_redup 0.6503
Removing reduplication, only slightly reduces the MATTR difference for a 5-words window:
texts/Q13 0.973
texts/VMS-noQ13 0.985
texts/Q13_no_redup 0.981
texts/VMS-noQ13_no_redup 0.990
The difference at 5-words is mostly due to alternating patterns:
X Y X
X Y1 Y2 X
X Y1 Y2 Y3 X
These appear to be twice more frequent in Q13 than in the rest of the manuscript.
LikeLike
Hi Marco
If I understand correctly, this confirms the results of the similar test I did, right?
Is there a way you can see how the 5-word patterns affect larger windows?
LikeLike
I am sorry Koen, at the moment, I don’t remember the details of your similar experiment. If you have a link handy, I’ll check how close our results are.
My impression is that the 5-words window does not contribute much to the 500-words results. I mean that Q13 appears to have a more systematic lexicon and this shows at all window sizes. I cannot think of a simple way to check this, but the fact that Q13 has a lower TTR also with very large windows (e.g. 5000) seems to confirm that it just uses fewer word-types in general.
An unrelated point. I am not sure I understand what you wrote here:
“Based on what I’ve seen in TTR statistics, I would think small-window effects are a phenomenon isolated from the text as a whole. That words were sometimes duplicated for padding a paragraph or confusing a would-be decipherer. But that is extremely speculative.”
Both sentences are unclear to me: what does it mean that small-window effects are isolated from the text as a whole? What makes you think that reduplicated words are not significant?
I guess the two answers are related, but I must have lost some step in your line of reasoning.
Do you think that these ideas also apply to the phenomenon of quasi-reduplication? It is at least as frequent as exact reduplication and totally transparent to TTR. Sequences like “chody schody” “okeey qokeey” “chol cholor” “cheody cheeody”…
LikeLike
Yeah, in retrospect that was a dumb thing of me to add, especially in the light of the equally pervasive quasi-reduplication. It’s not like you get a completely normal text when you remove only exact reduplication.
My previous experiment was to remove duplicate words within 5-word windows at random from Q13, until its m5 was just as high as that of the rest of the VM. This had little effect on higher window values.
LikeLike
Thanks to the details you added at 18:10, I think I found the post where you described the experiment you mention:
https://www.voynich.ninja/thread-2818-post-28520.html#pid28520
As you say, my results confirm what you found. Actually, your experiment is more informative, since you directly address co-occurrences within a 5-words window, rather than immediate reduplication only.
According to Rene’s analysis here:
http://www.voynich.nu/extra/curabcd.html
the Biological section is one of the two extremes of the Currier A/B drift. If this drift is caused by an “evolution” in the writing system (whatever its nature), could it be that this more regular lexicon is the result of some kind of standardization in how words are written?
LikeLike
Koen, thanks for the reply.
LikeLike
You can find all the Italian medieval texts you want on http://www.liberliber.it (also known as Progetto Manuzio), it is the Italian version of Project Gutenberg.
LikeLiked by 1 person
Thank you Stefano, that looks like exactly what I needed. Italian will be included in the next TTR post.
LikeLike