While writing an oncoming post about syllables, I came to realize there is an important topic I must discuss first: type-token ratio (TTR). In particular I want to test a more advanced technique called MATTR, which may prove useful in Voynich research.
Tokens, Types and TTR: introduction
In short, tokens are all words in a text, while types are all different words. If you increase a text’s amount of tokens it becomes longer. If you increase its amount of types, its vocabulary becomes more diverse. By dividing the amount of types in a text by its amount of tokens, you get its type-token ratio (TTR).
TTR is mostly used in linguistics to determine the richness of a text’s or speaker’s vocabulary. This has applications in literature and in the study of children’s language development. But TTR also has applications that might be of more interest to us; it can provide an indication of the type of language we are dealing with. This is discussed in Kimmo Kettunen’s paper Can Type-Token Ratio be Used to Show Morphological Complexity of Languages? [1], which I will regularly refer to in this post.
A difficulty in researching TTR is how it is affected by text length. The longer a text runs on, the fewer novel vocabulary will be introduced. Hence, longer texts will increasingly lean towards the “tokens” side of the equation: more words (tokens) are added, but fewer of those words represent new types. In a single sentence on the other hand, almost every word is new and TTR is typically 1 or close to it. Or, to put it more simply, the amount of tokens you can add is infinite, while type diversity will approach a ceiling after a while. Tokens increase linearly, while types do not.
TTR and Voynichese
TTR has a huge advantage for the study of the Voynich manuscript: it is possible. For contrast, one way the complexity of languages is compared is to count the amount of nominal case forms. We can’t do this for Voynichese, because we don’t know what its nouns are. But we can compare its total amount of “words” to its amount of different “words”. Additionally, TTR is not bothered by many of the parsing questions we face with Voynichese. It doesn’t matter whether we transcribe the common “aiin” ending as four letters or two, as long as we are consistent.
One possible problem is the presence of different “languages” in the manuscript, and the way they may or may not be divided. Traditionally, the manuscript is considered to contain two “languages” Currier A and B. Currier B introduces other words (or rather word parts), which might cause an abnormal bump in the type graph. While several researchers have argued that matters are not so simple and there might be more of a continuum of change, it is certain that the division between A and B is a meaningful one (for a clear example, see this post by Julian Bunn).
So while I acknowledge that a hard split between Currier A and B is not the ideal approach, it is the best I can do for now, and for this particular study it is to be preferred over using the corpus as a whole.
Kettunen demonstrates that TTR is “able to show morphological complexity of a language reliably enough regardless of the used material.” In other words, TTR should give an indication of the language’s morphological complexity – which can be compared to that of other languages.
MATTR
Kettunen also discusses a more advanced form of TTR called MATTR: Moving Average Type-Token Ratio. Covington and McFall (2010) [2] describe it as follows:
We cut the Gordian knot by computing and averaging the moving average type–token ratio (MATTR). … We choose a window length (say 500 words) and then compute the TTR for words 1–500, then for words 2–501, then 3–502, and so on to the end of the text. The mean of all these TTRs is a measure of the lexical diversity of the entire text and is not affected by text length or by any statistical assumptions.
MATTR will cut up the text in equal chunks (typically 500 words) and calculate the TTR for each of them. The mean value is then selected as, in a way, the most typical TTR value for 500 words of this language. This means that theoretically a text of 10k words should have a similar MATTR as one of 100k words in the same language. With simple TTR on the other hand, it is impossible to compare texts of such unequal sizes.
MATTR has an additional advantage for Voynichese. If it is true that the “language” in the manuscript fluctuates from page to page rather than just between Currier A and B sections, the potential increase in vocabulary caused by these minor fluctuations is negated by MATTR, which only measures TTR one window at a time. For this reason, it should be interesting to compare both TTR and MATTR data for the Voynich Manuscript.
For reference, I’ll summarize Kettunen’s results with MATTR (500 word window). He compared two opposite text types:
- The EU constitution in all available languages. The EU constitution is a modern legal text about a relatively limited subject. Languages are all European, but include various families and types. A language like Finnish has 26 noun forms, while Spanish only has 2. The vocabulary in this text is expected to be very narrow, given the genre.
- Unrelated sentences from a corpus (Leipzig corpora). All sentences are natural and grammatical, but there is no thematic connection between them. The vocabulary is expected to be broad, resulting in higher MATTR values.
For the EU constitution, MATTR percentages ranged from 39% for English to 60% for Finnish. For the Leipzig corpus (unrelated sentences) they ranged from 61% for Spanish to a whopping 86% for Finnish.
These values are just to give you an idea of the possibilities though. There is no sense in comparing the EU constitution to anything written in the 15th century.
The window size we choose for MATTR is important. This is explained in Covington & McFall (2010). They suggest a window of 500 words for the type of analysis we will be doing (Kettunen also uses 500). But then they add something which might be very interesting for our purpose:
A short window, perhaps as short as 10 words, is appropriate if the goal is to detect repetition of immediately preceding words or phrases due to dysfluent production. In fact, the ratio of MATTRs with two different window sizes is a potentially useful indication of whether repetition occurs over short or long spans.
Seasoned Voynich researchers will understand where I’m going with this; it is often claimed that the VM displays unusual repetition patterns over short distances, words being repeated a few times within a short distance. If we compre a standard window MATTR (500 words per chunk) with a shorter one for each language, we can test whether Voynichese is indeed unusual in its short-span repetition.
Corpora
As an initial test I decided to compare Voynichese to a few texts which are easily accessible to me and existed in the Middle Ages.
For Latin I selected the first 10k words from Pliny the Elder’s Natural History, Liber II (on astronomy). The text was cleaned with punctuation removed, but elements like Greek words and Roman numerals within the text were kept.
For Greek, I took the first 10k words of the Iliad, from books 1 and 2 as found here.
For Middle Dutch, I selected two texts. On the one hand, the first 10k words of Van den vos Reynaerde. This edition is from the Comburg manuscript, written between 1380 and 1425 in the area of Ghent. Also, to study the effect of text type within the same language, I included the first 10k words of Maerlants Der Naturen Bloeme, a didactic poem on nature. The chapters I selected were regular trees, spice trees and healing herbs. Text found here.
More languages and text types must be studied at a later stage, but these will suffice for an initial demonstration.
MATTR on Ancient and Medieval texts
Let’s first have a look at the MATTR values for the Latin, Greek and Middle Dutch texts.
Two things stand out in the above graph. First of all, it is clear that Latin and Greek (blue and red) show very similar behavior. This is not entirely unexpected, because the languages are similar in the number of noun cases. The same is true for both Dutch texts (yellow and green). Here, the text type seems to have caused a small difference, with Reynaert being more “repetitive”. This makes perfect sense, since the Reynaert is one narrative text about a limited number of characters, while the Maerlant fragment deals with various trees and herbs.
Secondly, MATTR values converge as the window becomes smaller. When taking the average over 2000 word windows (left side of the graph), the differences between language types really stand out. This is still the case at the 500 window. But when the window is shrunk all the way to 25 and the minimum of 5, values rush upwards. This is normal: for a window of only five words, an average of 99% unique words can be expected.
Let’s zoom in on the rightmost part of this graph, MATTR window size 25 to 5. We see the expected convergence, but no lines crossing. This means that the diversity of vocabulary is evenly distributed in all texts. Using a larger or smaller window does not affect their relative position.
In conclusion, these texts behave completely as expected and MATTR appears to be an effective tool also for non-modern texts.
Enter the Voynich
Since we have seen that Latin and Greek behave similarly and both Middle Dutch texts do as well, I will limit the following to one example of each type: Pliny and Maerlant.
Introducing the Voynich into the mix gives an interesting result. On the large scale, things look normal, but zooming in, an anomaly can be found. [Note: this is a transcription of the entire VM as one fluent text]. Let’s look at the big picture first.
The red line represents the Voynich data. It sits nicely in between Latin and Middle Dutch, and the line evolves in precisely the same way. What this graph tells us is the following: as far as vocabulary variation goes, the VM behaves exactly like a Medieval text.
However, there is one caveat. Let us zoom in on the narrower MATTR windows, just like we did for the previous graph. Note how the yellow Maerlant line and the red VM line cross.
This means that, averaged over 15-word windows, Maerlant is still slightly more repetitive than the VM. But reduce the window to 10 words, and the VM takes over. These statistics confirm the suspicion that the VM likes to repeat words to an unusual degree.
It must be noted that the differences are tiny, and the VM is still at 98.5% average word uniqueness per 5-word window.
Window | Maerlant | VM | Pliny |
5 | 0.99 | 0.985 | 0.995 |
10 | 0.97 | 0.969 | 0.984 |
25 | 0.917 | 0.931 | 0.956 |
100 | 0.777 | 0.827 | 0.87 |
200 | 0.687 | 0.757 | 0.812 |
300 | 0.634 | 0.713 | 0.778 |
400 | 0.596 | 0.681 | 0.752 |
500 | 0.567 | 0.655 | 0.732 |
1000 | 0.483 | 0.575 | 0.662 |
2500 | 0.376 | 0.474 | 0.561 |
So in my opinion we must really consider both graphs simultaneously: the VM behaves normally, but has a slightly higher tendency to repeat words in close proximity. The difference with Middle Dutch is only one percent, but this does cause the graphs to cross.
What about Currier languages?
It has long been observed that various sections and subsections of the VM exhibit a different “vocabulary”. Traditionally, the two main varieties are called Currier A and Currier B, although it is likely that the situation is much more complex. For the purpose of this post, however, I will stick to both Currier languages. To my best understanding, these are best thought of as dialects of the same “language”, with many similarities but also consistent differences and preferences.
Since these differences can be observed at the level of vocabulary, TTR and MATTR might be useful tools in their study. I intend the following only as a demonstration of the concept – it is well beyond my skill to solve the age-old question of variation in Voynichese.
In order to test this, I isolated a few sections: Herbal A, which is Currier A; Herbal B (relatively small) and Quire 20 (Currier B). This is how their MATTR data compare.
Window | VM | Herbal A | Herbal B | Q20 |
5 | 0.985 | 0.982 | 0.988 | 0.987 |
10 | 0.969 | 0.963 | 0.973 | 0.975 |
25 | 0.931 | 0.921 | 0.942 | 0.943 |
100 | 0.827 | 0.812 | 0.844 | 0.843 |
200 | 0.757 | 0.743 | 0.773 | 0.771 |
500 | 0.655 | 0.641 | 0.659 | 0.666 |
1000 | 0.575 | 0.555 | 0.565 | 0.588 |
Here’s the graph from these measurements. The outer lines are still Pliny and Maerlant, the central red line is still the entire VM.
What we see is that Herbal A has a slightly lower degree of vocabulary variation than the B-sections. Herbal B is like the entire manuscript for large windows, but from around the 300-word window it aligns perfectly with Quire 20, which is indeed in the same “dialect”. Overall, there appears to be more repetition of vocabulary in Herbal A than there is in the B-sections. Here’s the zoom:
The yellow line, Maerlant, catches up with Herbal A around the 20-word window. For the B-sections, the window needs to be smaller to achieve the same density of repetition as Maerlant. [Side note: Reynaert does the same, so this is not some quirk of Maerlant’s text].
Still, all in all the differences remain small. I am surprised to see how both B-sections coincide, and how each VM section separately still behaves like a normal language. It is hard to know what exactly these data imply, but my subjective feeling is that it would be hard to achieve these numbers if the VM did not have some “text” at its basis.
Conclusion
I have shown that the lexical diversity in the VM sits well in the range of that of normal texts. Pliny and Homer have a higher lexical diversity, which I attribute mainly to the lager amount of noun forms in Latin and Greek. On the other hand, two different Middle Dutch texts showed a lower lexical diversity than Voynichese. This can be explained by the lower number of distinct noun forms in Middle Dutch, but perhaps also by the text types. More classical and medieval texts and languages must be compared.
Secondly, it appeared that Currier A sections have a lower per-window lexical diversity than Currier B sections. Still, this difference remained much lower than the difference between Latin and Middle Dutch. In other words, the different lexical densities between Currier A and B are what we would expect from related languages or from slightly different text types, not from vastly different languages.
The data we can compare using TTR and MATTR are unlimited. There is room for the comparison of additional languages and text types, as well as VM-internal study.
[1] Kimmo Kettunen (2014) Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?, Journal of Quantitative Linguistics, 21:3, 223-245, DOI: 10.1080/09296174.2014.911506
[2] Covington, M., & McFall, J. D. (2010). Cutting the gordian knot: The Moving-Average
Type-Token ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94–100.
— Software for calculating MATTR is available here: http://ai1.ai.uga.edu/caspr/
— Special thanks to Marco Ponzi for his valuable assistance during this research, and to Rene Zandbergen and bi3mw for providing the required text files.
Hi Koen,
I haven’t read the papers you mentioned yet, but I find this kind of analysis both original and informative: thank you for sharing these results!
In my opinion, your measures for the VMS are likely to be a higher bound for the language. A hand written text might appear to be more varied than it really is, because of phenomena that are frequent in medieval manuscripts: e.g. abbreviations and inconsistencies in words spaces and spelling. They can cause what should be a single type to actually appear in different forms. So I am not sure that the language underlying the VMS really is (so much) more varied than Middle Dutch, but I think the evidence you provide suggests we should look for something less inflected than Latin and Greek.
It would now be interesting to find texts that compare well both with the overall measures for the VMS (the 500 tokens window) and for the frequent local repetitions you observed. BTW, I initially thought that what you observed was due to reduplication, i.e. the consecutive occurrence of tokens of the same type (like “daiin daiin”), but my preliminary checks suggest that this is not the case: there is a high number of occurrences of identical tokens that are separated by a few words. Of course, reduplication also contributes, as well as the “alternating pattern” X Y X we discussed on the forum:
https://www.voynich.ninja/thread-2357-post-20258.html#pid20258
But I think the short-range repetitions you observed are an area where very little has been done yet.
LikeLiked by 1 person
Thanks, Marco. I think you are right about the higher bond. Unfortunately there is not much we can do for Voynichese than take it as it is presented to us. The trick will be to find better texts to compare it with. Like I answered to Rene though, the Reynaert does maintain spelling variation, so that’s one factor eliminated already.
Abbreviations, that’s something else… I wonder how they would impact MATTR 500. If one abbreviation, say “domin9”, usually corresponds to one particular word within the 500-word range, then impact is minimal. If within 500 words you typically get abbreviations that look the same but “develop” into different types, then it could have a notable impact.
LikeLike
The MATTR approach could possibly be used as yet another way of identifying the different Currier languages in the text. Basically you could just apply the method as sort of a convolution and then blur it to get a nice graph of how the repetitivity changes throughout the entire text.
LikeLike
Some of the graphs are equivalent with graphs on an old web page of mine, which, by coincidence, I recently re-did:
http://www.voynich.nu/extra/wordent.html
especially Figure 1 and Figure 5.
For a hypothetical text that exactly follows Zipf’s law, one can also predict exactly the TTR.
For all texts that follow Zipf’s law “to some extent”, the TTR graphs will then be similar to some extent.
Gabriel Landini analysed a text that did not follow Zipf’s law – it was a thesaurus. It’s easy to see how such a text would have a different TTL, as new words appear at a high rate all the time. And we can already conclude with some confidence that the Voynich MS is not a thesaurus.
I fully agree with Marco that the result for the Voynich MS (and for any old hand-written text) need to be treated with care, since spelling variations introduce a bias towards a higher TTR.
For the quoted example of the EU constitution in various languages, all texts were undoubtedly following national spelling rules, and had been spell-checked.
LikeLike
Yes, I understand the EU constitution is largely irrelevant for the VM, that is why I didn’t compare the data directly. I just wanted to illustrate the spread caused by the various languages’ inflectional complexity.
Spelling variations do increase TTR, but I think that especially with smaller MATTR windows this effect will be reduced (though still present). Looking at the limited data I collected so far, I think we must not overestimate their effect.
The Pliny and Iliad texts are standardized, but their MATTR values are much higher than those of both Middle Dutch texts. And in Reynaert, spelling variations are kept: for “Grimbeert” I have the spellings Grimbeert, Grimbert and Grinbert – ignoring variations that might be caused by inflection like “Grimbeerte”, those are still three different spellings.
So the non-standardized Middle Dutch texts achieve a much lower score than the VM. This suggests to me that the effects of language type are large enough to overcome erratic scribal behavior.
LikeLike
By analogy with Rene’s graphs, I expect Dante’s Italian to have a TTR similar to Middle Dutch (but of course this is a dubious speculation). Italian only has two noun forms (singular and plural): is this the case also for Middle Dutch?
LikeLike
I would not have dared to predict the size of the impact of these spelling variations, and it is very useful to have such a text to measure it.
Slightly OT, but such variations will also have a negative impact on the number of repeating phrases.
I have long had in mind to experiment with that, but not yet found the time.
LikeLike
Very interesting research Koen! I don’t have much to say other than I look forward to reading more like it.
If you have texts from other tongues, moreso non-Indo-European ones, it would be great to see the MATTR for them also.
LikeLiked by 1 person
In theory there are four cases in Middle Dutch, but since the case system is in decline, many noun forms have become formally the same. I believe that most nouns can appear in four forms: base, base+s, base+e, base+n.
If Dante turns out to have higher TTR numbers than Reynaert (which I believe likely) then maybe text type is to blame? Many more texts need to be checked in order to get a picture of the Medieval situation. But in an ideal situation (i.e. without complicating interpretation) I would think Dante’s stats a tiny bit lower than Middle Dutch. I have some time now, I’ll check it right away 🙂
LikeLike
So I used the text from this page, which was very easy to prepare for analysis:
http://www.filosofico.net/ladivinacommedia.htm
Dante sits somewhere between Voynichese and Middle Dutch, which makes it the closest text to the VM so far. With only 4% between Dante and Reynaert, they are very distant from the form-rich Latin.
MATTR500 in descending order:
Pliny .732
VM .655
Dante .593
Reynaert .553
LikeLike
Thank you, Koen!
So Dante really turns out to be close to Middle Dutch; and the Voynich ms falls between Pliny and Dante, as in Rene’s graphs.
As you say, the higher TTR of Dante wrt Middle Dutch may be due to poetic style. For instance, Dante often truncates words in order to fit them into his verses (e.g. ‘cammin’ for ‘cammino’ in the first line).
A prose Italian text that is sometimes used as comparison for Voynichese is Il Principe by Machiavelli (1520 ca). It is available on wikisource (with no hyphenation), but it is split into individual chapters, so collecting a large enough section would require some copy-and-paste:
https://it.wikisource.org/wiki/Il_Principe/Capitolo_I
LikeLike
You are right! Il Principe (first 10k words) is the lowest one I’ve measured so far, with 0.533 under Reynaert’s 0.553.
LikeLike
Additionally, from the 50-word window on, Principe and Reynaert have exactly the same values.
LikeLike
About your experiments with Il Principe: Dante and Machiavelli are two quite different texts. One was written two centuries later than the other, one is prose and the other poetry. We can provisionally take the 0.6 difference as the maximum to be expected for texts in the same language.
The small-window results are more difficult to interpret: I have no idea of why Machiavelli converges towards the same values as Middle Dutch…
LikeLike
Hello, when the latest “solution” to the mystery of the Voynich Manuscript made the headlines, early this week, I started searching for more information about the issue and finally I found your blog. I think that your analysis of the iconography of the VM is the most interesting thing I read about the Voynich so far.
That said, I am replying to this post, because I found your discussion here with Marco Ponzi a bit lacking. In fact, Dante’s language is quite complex and, especially in the Comedy, it is known to adapt to the given context. Besides that he wrote much more than the Comedy, so I thought he could be used for some further analysis.
I did that analysis and here it is (I took the first ten canti of the Inferno as a benchmark to see if my texts were as prepared as yours, I got a 0.594 against your 0.593, so I think that my methodology is consistent with yours):
MATTR (500 window size)
[Inferno (first ten canti)* 0.594]
Inferno (poetry, Italian) 0.594
Purgatorio (poetry, Italian) 0.604
Paradiso (poetry, Italian) 0.587
Commedia (all the above) 0.595
9 random canti from the Commedia 0.604
Vita Nuova (poetry and prose, Italian) 0.497
Convivio (mostly prose, Italian) 0.472
De Vulgari Eloquentia (mostly prose, Latin and vernacular) 0.646
Quaestio de Aqua et Terra (prose, Latin) 0.518
De Monarchia (prose, Latin) 0.599
Egloghe (poetry, Latin) 0.823
So there is a wild variation among them, even if we take out the short (a little more than 2000 words) Egloghe (however, please note that the MATTR100 of the Egloghe is still an exceptional 0.921!). The language of the Commedia changes from one part to the other, so if we mix random canti from the three different parts, we get a higher complexity, hence the MATTR does not detect complexity if that complexity is spread on a large text. The De Vulgari Eloquentia is written in Latin, but with a lot of citations in Italian, Occitan and French vernaculars, hence its relatively high complexity (similar to that of the VM, by the way).
My take on the issue is that the issue is much more complex than it seems, that is MATTR should be evaluated on a very large set of texts to get an idea of what we are looking at (i.e. subject, writing style, language, use of abbreviations etc.).
LikeLiked by 2 people
Hi Stefano, thank you for your very interesting comment.
Note that I saw this post only as a first step, a way to familiarize myself and readers with a tool that *could* tell us something about which texts behave like the VM in a specific way.
Since this post was published, I have reached the same conclusion as you did, which is that the impact of text type should not be underestimated. You have shown accurately that even within the work of the same author, MATTR values vary significantly.
The variations you spotted within the same work (DC) are worth keeping in mind as well, but of course they are smaller than the differences between separate works.
But language and text type really work in tandem to determine MATTR values.
We have been discussing methodology (I needed to learn a few things) in a thread on the Vonich Ninja forum. It’s kind of trial and error, but I’ll link straight to a post by Marco which contains a very telling graph:
https://voynich.ninja//thread-2770-post-27618.html#pid27618
If you plot the M500 and M5 values against each other, you’ll see that the VM occupies a space outside of the trend. Its M5 is low compared to the M500 (or the other way around depending on your angle).
If we can find a text which does exactly *that*, plus is around the same values as Voynichese, then we might learn something.
What we should do now is test a whole range of texts and see what comes out. Like taking handfuls of darts, tossing them at the target and hoping one hits.
We are gathering texts in this thread, which is also worth checking out if you are interested:
https://www.voynich.ninja/thread-2764.html
LikeLike
Koen – if you feel like testing 14thC English, the Canterbury Tales are here
https://www.sacred-texts.com/neu/eng/mect/index.htm
LikeLiked by 1 person
Excellent, thanks 🙂
LikeLike
Koen, the links you give for that discussion only works if your reader joins the forum.
LikeLike
Ah sorry, I always forget which subforums are public.
Here are the graphs I was talking about, Marco compiled these from the data as they were added.
For the first one, I wanted to test how languages would behave in a parallel corpus. These are all from Genesis 1-30 in a variety of languages. IMPORTANT note: Genesis is a very repetitive text type, so the values are all low. What I was interested in, however, is the relative distances between the languages for the same text. As you see, the languages fall within a band around the trend line. The Asian scripts fall outside of this band, as does the Voynich (though they do so on opposite sides)
And the following graph adds a few different texts to the mix.
So the exercise would be to find a text which deviates from the trend line in a similar way the VM does, which is to have a relatively low value (=more repetitive) for very small windows.
LikeLike
Ah, interesting, the Voynich Manuscript looks quite suspicious. Among the texts I surveyed, only the De Vulgari Eloquentia (MATTR500 0.646 – MATTR5 0.989) falls in the same area as those sections of the VM, but that is expected. The De Vulgari Eloquentia comprises different languages, so at global level (MATTR500) it is very diverse because there you find words from multiple languages, but at local level (MATTR5) is quite average, since most of those small sections comprises only one language. However I doubt that the VM is a multi-language text like the DVE.
The VM could be a kind of encyclopedia, like the Naturalis Historia, but with small sections about different arguments. However I think that MATTR is not enough for a good statistical analysis of the text, it should be supported by a variance or standard deviation test. That way one can also have a measurement of the uniformity of the text.
LikeLiked by 1 person
Note though that the suspicious behavior only really comes to the fore in M5 windows, as compared to larger ones. It is at the relative density of M5 that our focus should be.
But M5, that’s less than a sentence. This is not caused by subject, or variety of subject. Changing the subject each page, or even each paragraph, will not impact M5 much, if at all.
So what can cause this deviation in other texts? I’m eager to sample more texts and try to find a few that match, but unfortunately I don’t have much spare time this week.
Could it be the language, one I have not sampled yet? Or certain text types?
There are some languages I haven’t been able to include yet. Hebrew gave very weird values, I think because something was wrong with my txt file or the way the program handles it. I simply don’t know enough about Hebrew script to know whether or not I’m processing it correctly. The same for Arabic. But to get a complete picture, such scripts and languages should be included as well.
So there’s still much work 🙂
LikeLike
I think that we need to take a step back. The fact that the moving average shows an anomaly when using a window size of 5 words does not imply that the source data (text, in this case) is anomalous at that level.
For example, let’s take a 40 words text, split in two sections of 20 words each. Those two sections employ two different lexicons. Then let’s analyze that text with MATTR. Using a window of 40 words, we get just one sample with a quite diverse lexicon, because it comprises both those different sections. Using a window of 20 words, we get one homogeneous sample that comprises all the first section, one homogeneous sample that comprises all the second section and 19 samples made of a mix of the two different sections: over 90% of the samples (19 out of 21) are still quite diverse and the MATTR should still reflect that. When using a window of 5 words, we get 4 samples (11%) made of a mix of the two different sections and 32 homogeneous samples (89%) from either section: now we have found an anomaly, because the MATTR showed diversity with larger windows, but with a small enough window it shows a remarkable homogeneity of lexicon. However the anomaly is not at the level where we spotted it (5 words), but at higher level (20 words).
I made a test with real text. I took a text made of the first 240 words of the De Bello Gallico by Caesar, and another 240 words long text made of the first 120 words from the De Bello Gallico, followed by the first 120 words from Boccaccio’s Filocolo. Here is the plot of the MATTR trends:
https://drive.google.com/open?id=1KaE1BqlohTAzlfh3rHIXX39V3Ut2Bnvo
The two lines intersect between 5 and 10, but we know that there is no anomaly at that level: the “anomaly” is at a much higher level. This test is very reproducible and I am sure that different combination of different texts will give very different results.
LikeLiked by 2 people
A valid remark.
On the one hand, I try to counter this by including various VM stats: Herbal A, Herbal B, Quire 20. I should include smaller sections as well.
Secondly, my approach to this is to just gather data and see which ones are like the VM. For example, if the only similarly behaving texts we can find are highly composite, then this may be a clue. It’s not sniping, more like spray and pray at this point 🙂
LikeLike
A bit O.T. perhaps but still ringing in my ears, even now, is a statement once made by Julian Bunn,
“I am convinced that it is not as simple as it appears (i.e. that the words are not words at all)”
Since I can’t see that on his site today, I guess Julian changed his mind about it but the idea still fascinates about what it might be if not words.
Back then, thinking about what alternatives there could be, I noticed many of the images Julian had posted in his ‘Page Positional Gallows, Mk. II’ post looked really reminiscent of drafting patterns for weaving, so on spec. I sent two or three of the pictures from his post to an expert in textile analysis, asking if she would let me know whether or not they were weave-able. All was smiles and the conversation was going very well till I foolishly let slip that scary ‘V’ word … and that was that. Another conversation with a specialist scared into flight by the mere word ‘Voynich’.
But what if Voynichese isn’t made up of words? Fascinating question, don’t you think?
LikeLike
Found that quote. In Julian’s post
‘How was the Voynich Manuscript text written?’ (August 23, 2012)
images were from ‘Page Positional Gallows, Mk. II’ post (June 29, 2012)
LikeLike
It’s certainly a possibility that they aren’t words. In fact I think this is often in the back of researchers’ minds. “If” it is language, “if” it contains any meaning at all.
My position on this matter is that observations can only be made by putting these possibilities aside, valid as they may be. As far as we know, it could still be entirely meaningless. But since we will never be able to prove that, we could either do nothing or try to prove the opposite.
The same could be true for the “it’s not text” option. If there’s a good idea, it’s worth checking out, of course.
Fact of the matter is though that it does present itself as a text, in looks and feel. So while we can remain open to other possibilities, I think it’s worth the effort to investigate the “words” as words.
LikeLike
Oh yes, it wasn’t an argument for abandoning text analyses. Just a fascinating idea to idle away the morning train ride. 🙂 Thanks, btw, for including Chaucer’s English in your latest data.
LikeLike
Am I right that this essentially assumes that we have a known or unknown language in an unknown script? If we have a cipher, as I strongly believe, then I would imagine this measure is much less useful, but I could be wrong. The use in some way of nulls and homophones in a cipher would skew your TTR numbers I would think, so that they don’t correlate to the underlying language. You could calculate TTR figures for a medieval enciphered text, but I can’t see that helping as the figures would be influenced by whatever cipher was used and however it was implemented.
LikeLike