This post assumes that you are familiar with character entropy and its relation to Voynichese and EVA. If not, please start here: Entropy Hunting.
In my first post about conditional character entropy, I tested how successive changes to an EVA transcription of Voynichese impact its entropy values. It was easy to get up to a certain level, but after that, additional changes made little impact. I assumed there are two causes for this:
- Introducing new characters for common bigrams raises not only h2, but h1 as well. Since the formula I’m using is h2/h1, we can expect to reach a virtual ceiling after a while.
- Even if we replace common clusters like [iin] with single characters, these are often still positionally rigid. In the case of [iin], its substitute would also always be followed by a space, which is a low-entropy situation.
In the first post, I added each tweak on top of the previous one; at the fourth step, I could not further increase h2/h1. Was this because the limit for introducing new characters (and increasing h1) had been reached, or just because my later replacements were ineffective?
To answer this question, I will redo the test, but now always starting from an unmodified EVA transcription, so we can monitor the independent impact of each change. Additionally, I will test some suggestions made by Nick Pelling in a comment.
Let’s start by individually testing the changes I made successively in the previous post.
- [ch] –> C
[sh] –> S
- [ckh] –> Ck
[cth] –> Ct
[cfh] –> Cf
[cph] –> Cp
- [iin] –> M
[in] –> N
[iir] –> W
[ir] –> V
- [qo] –> q
- [ee] –> U
- [y] –> a
The following graph shows the difference in h2/h1 each change in isolation made to the base transcription.
A number of things stand out. First of all, the effect of these changes combined is much greater than the sum of individual changes. Pulling apart benched gallows (step 2) even has a negative impact if it’s done without first collapsing regular benches as well. The best strategy seems to be to treat benches and benched gallows in one step (dark bar below).
Converting i-clusters to their own characters also makes a decent impact. However, converting [qo] to [q] and [ee] to [u] seems useless, even in isolation. Equalizing [y] and [a] does have an effect, which is increased when combined with other changes.
This also means that the properties of EVA combined drag down h2 even more than they would individually.
Nick Pelling’s comments
In response to the first entropy post, Nick Pelling made some interesting remarks. First, he suggested more likely candidates for tokenization:
I’ve spent a long time suggesting that ar/or/al/ol/am (I expect om will turn out to be a copying slip for am) should also be parsed as tokens, and that there’s a good chance dy will too (I’m kind of 50-50 on dy). I also wondered in Curse whether o+gallows and y+gallows should be parsed as (verbose) tokens. Moreover, I think an/ain/aiin/aiiin and air/aiir/aiiir should also be read as tokens, rather than breaking them down into strokes below that.
He also added that we don’t know whether Voynichese contains numerals, punctuation (like truncation marks) or some kind of metadata. If this were true, it would become much harder to find proper texts for comparison. I will test punctuation in a next post, and focus now on replacing common glyph clusters.
Let’s start with the easy bits; I’m most curious to find out what the impact would be of not only transcribing [iin] as one character but also including the preceding [a].
In EVA, we transcribe the above as [daiin], a very common “word”. In my first test, I would have altered this to [daM], eliminating the internal predictability of [iin]. What Nick says is that [a] should be included as well, so I could transcribe this, for example, as [dM]. What would the impact of this change be on entropy? Starting from the original EVA transcription, I altered as follows:
[an] –> 1
[ain] –> 2
[aiin] –> 3
[aiiin] –> 4
[air] –> 5
[aiir] –> 6
[aiiir] –> 7
The original VM fragment achieved 55% of its maximum h2, which we’d like to increase to over 70%. So for (h2/h1), the gap we have to bridge is 0.15 at the very least. Replacing just [i]-clusters adds 0.03, while replacing [ai]-clusters adds 0.05. So Nick’s suggestion is much more effective.
Nick also suggests turning ar/or/al/ol/am into one token each, which is something I hadn’t considered yet. Let’s see what it does (just following my keyboard for the replacement characters – all characters are equal for entropy, and my goal is not to produce a readable text).
[ar] –> 8
[or] –> 9
[al] –> 0
[ol] –> &
[am] –> é
Starting from an unmodified EVA file, these replacements add 0.017 max entropy towards the required 0.15. It’s not as much as some other modifications, but it’s something. Replacing [om] by the same token as [am] does not change this value at all, so we can set aside this additional suggestion.
Nick gave this one a 50% chance. Whatever may be going on with [dy], I’d predict that replacing it with one character would not have a huge effect, because this character is still positionally restricted.
As it turns out, this operation decreases h2 by a tiny 0.001. Keep in mind that it might be more effective in conjunction with other measures.
Replace o+gallows and y+gallows
This is a tricky one. On the one hand, I think it may be an essential operation, if only to get some more variation at the beginning of “words”. On the other hand, there are so many factors involved. What with benched gallows? What with [qo] + gallow?
For this test, I’ll just stick with [o] or [y] + gallow. There are four different gallows, so this requires the addition of eight symbols.
[ok] –> ‘
[ot] –> (
[op] –> §
[of] –> è
[yk] –> !
[yt] –> ç
[yp] –> à
[yf] –> )
In isolation, this operation performs even worse than the previous one, decreasing our entropy value by -0.015.
Below, I summarize what each transformation we have tried so far would individually add towards our goal of a 0.15 increase of (h2/h1). Again, keep in mind that combined replacements may give different results.
Changing [y] to [a] is a different type of operation than the others, so I will ignore this for now. Parsing [i]-clusters as tokens will be dropped in favor of [ai]-clusters. We have already learned that handling benches and benched gallows simultaneously has a positive effect. This leaves us with three operations with demonstrated efficiency:
- benches + benched gallows: +0.054
- [ai]-clusters: +0.047
- [ar, or,…]: +0.017
The sum of these individual operations is +0.118 out of a required +0.150. What is their effect when applied cumulatively?
Combining just these replacements adds +0.168, raising our h2/h1 to 0.716, which is at the very bottom tier for normal text but it is much better than before, all with a few not-too-controversial choices. However, h2 by itself is still at 2.678 (up from 2.085), while we really need a minimum of 3.
What if we ignore spaces?
Another thing to test is what would happen if spaces are removed from Voynichese. Imagine for example that Voynichese’s h2 has been lowered artificially because a space was introduced before or after specific characters. This would lower entropy, because the position of the character “space” is predictable. It follows [n], precedes [q] and so on.
With spaces removed, h2/h1= 0.604, which is a 0.056 increase. Removing spaces from the altered file, raises h2/h1 to 0.807 (!) and h2 to a satisfactory 3.145.
However, removing spaces is of a different order than changing EVA [ch] to [C]. We don’t know whether “the bench” is one glyph or two combined glyphs. So representing it as one or two letters is a choice one must make. There are no conclusive arguments in favor of one or the other, and the choices made for EVA are not better or worse than other options. But removing spaces is a much more invasive procedure, since spaces are clearly present in the text. Hence, removing spaces is destructive since we are unable to reconstruct the original transcription.
Still, this is something to keep in mind. Not only can we bump Voynichese entropy by removing spaces, we can also easily lower entropy in other texts by introducing spaces. Statistically, this is nothing special, because a space is a character like any other. If you shove spaces in at predictable locations, you decrease entropy. The question is whether it is possible that something like that happened in Voynichese.
Other common bigrams?
Are there other replacements that improve our modified transcription?
[e, ee, eee, eeee]
I tried experimenting with [e]-groups in various ways, but this has limited effects so far. Also on the altered transcription, it doesn’t do much to increase entropy.
The [o]-character in Voynichese is predictable, and many of Nick’s suggestions involve it altering the next glyph. But here’s the problem: [o]+gallow often appears word-initially. If you were to remove these [o]’s, entropy would decrease, because the o’s add an unpredictable factor to gallows always being word-initial. You can artificially counter this by replacing all combinations with a new character, but the effect on h2 is limited, while you do increase h0.
Simply put, replacing [o,y]+gallows with one character only adds more characters that uniquely appear at the beginning of words.
Now if you combine this mess with the removal of spaces, you get good results:
h0 = 5.357552004618084
h1 = 4.018222361153936
h2 = 3.233913175216506
h2/h1 = 0.805
Spaces are in the way of further progress.
Conclusion for Part II
With just three types of non-destructive alterations, we see h2/h1 of Voynichese climb to the level of a low-entropy medieval text. Its conditional character entropy itself remained too low, however.
Of these changes, two should be uncontroversial, since they rely only on a different interpretation of stroke groups, whether to write them down as one glyph or many. For researchers testing character entropy on EVA, I would advise to at the very least consider replacing [ch] and [sh] by a single character each, and to do the same for [i] or [ai]-clusters.
Finally, I hit a brick wall again because of spaces. Voynichese has very limited options word-initially, and combining common word-initial bigrams does not really increase conditional entropy.