This post assumes that you are familiar with character entropy and its relation to Voynichese and EVA. If not, please start here: Entropy Hunting.
In my first post about conditional character entropy, I tested how successive changes to an EVA transcription of Voynichese impact its entropy values. It was easy to get up to a certain level, but after that, additional changes made little impact. I assumed there are two causes for this:
- Introducing new characters for common bigrams raises not only h2, but h1 as well. Since the formula I’m using is h2/h1, we can expect to reach a virtual ceiling after a while.
- Even if we replace common clusters like [iin] with single characters, these are often still positionally rigid. In the case of [iin], its substitute would also always be followed by a space, which is a low-entropy situation.
In the first post, I added each tweak on top of the previous one; at the fourth step, I could not further increase h2/h1. Was this because the limit for introducing new characters (and increasing h1) had been reached, or just because my later replacements were ineffective?
To answer this question, I will redo the test, but now always starting from an unmodified EVA transcription, so we can monitor the independent impact of each change. Additionally, I will test some suggestions made by Nick Pelling in a comment.
Let’s start by individually testing the changes I made successively in the previous post.
- [ch] –> C
[sh] –> S - [ckh] –> Ck
[cth] –> Ct
[cfh] –> Cf
[cph] –> Cp - [iin] –> M
[in] –> N
[iir] –> W
[ir] –> V - [qo] –> q
- [ee] –> U
- [y] –> a
The following graph shows the difference in h2/h1 each change in isolation made to the base transcription.
A number of things stand out. First of all, the effect of these changes combined is much greater than the sum of individual changes. Pulling apart benched gallows (step 2) even has a negative impact if it’s done without first collapsing regular benches as well. The best strategy seems to be to treat benches and benched gallows in one step (dark bar below).
Converting i-clusters to their own characters also makes a decent impact. However, converting [qo] to [q] and [ee] to [u] seems useless, even in isolation. Equalizing [y] and [a] does have an effect, which is increased when combined with other changes.
This also means that the properties of EVA combined drag down h2 even more than they would individually.
Nick Pelling’s comments
In response to the first entropy post, Nick Pelling made some interesting remarks. First, he suggested more likely candidates for tokenization:
I’ve spent a long time suggesting that ar/or/al/ol/am (I expect om will turn out to be a copying slip for am) should also be parsed as tokens, and that there’s a good chance dy will too (I’m kind of 50-50 on dy). I also wondered in Curse whether o+gallows and y+gallows should be parsed as (verbose) tokens. Moreover, I think an/ain/aiin/aiiin and air/aiir/aiiir should also be read as tokens, rather than breaking them down into strokes below that.
He also added that we don’t know whether Voynichese contains numerals, punctuation (like truncation marks) or some kind of metadata. If this were true, it would become much harder to find proper texts for comparison. I will test punctuation in a next post, and focus now on replacing common glyph clusters.
Replace [aiin]
Let’s start with the easy bits; I’m most curious to find out what the impact would be of not only transcribing [iin] as one character but also including the preceding [a].
In EVA, we transcribe the above as [daiin], a very common “word”. In my first test, I would have altered this to [daM], eliminating the internal predictability of [iin]. What Nick says is that [a] should be included as well, so I could transcribe this, for example, as [dM]. What would the impact of this change be on entropy? Starting from the original EVA transcription, I altered as follows:
[an] –> 1
[ain] –> 2
[aiin] –> 3
[aiiin] –> 4
[air] –> 5
[aiir] –> 6
[aiiir] –> 7
The original VM fragment achieved 55% of its maximum h2, which we’d like to increase to over 70%. So for (h2/h1), the gap we have to bridge is 0.15 at the very least. Replacing just [i]-clusters adds 0.03, while replacing [ai]-clusters adds 0.05. So Nick’s suggestion is much more effective.
Replace [ar/or/al/ol/am]
Nick also suggests turning ar/or/al/ol/am into one token each, which is something I hadn’t considered yet. Let’s see what it does (just following my keyboard for the replacement characters – all characters are equal for entropy, and my goal is not to produce a readable text).
[ar] –> 8
[or] –> 9
[al] –> 0
[ol] –> &
[am] –> é
Starting from an unmodified EVA file, these replacements add 0.017 max entropy towards the required 0.15. It’s not as much as some other modifications, but it’s something. Replacing [om] by the same token as [am] does not change this value at all, so we can set aside this additional suggestion.
Replace [dy]
Nick gave this one a 50% chance. Whatever may be going on with [dy], I’d predict that replacing it with one character would not have a huge effect, because this character is still positionally restricted.
As it turns out, this operation decreases h2 by a tiny 0.001. Keep in mind that it might be more effective in conjunction with other measures.
Replace o+gallows and y+gallows
This is a tricky one. On the one hand, I think it may be an essential operation, if only to get some more variation at the beginning of “words”. On the other hand, there are so many factors involved. What with benched gallows? What with [qo] + gallow?
For this test, I’ll just stick with [o] or [y] + gallow. There are four different gallows, so this requires the addition of eight symbols.
[ok] –> ‘
[ot] –> (
[op] –> §
[of] –> è
[yk] –> !
[yt] –> ç
[yp] –> à
[yf] –> )
In isolation, this operation performs even worse than the previous one, decreasing our entropy value by -0.015.
Overview
Below, I summarize what each transformation we have tried so far would individually add towards our goal of a 0.15 increase of (h2/h1). Again, keep in mind that combined replacements may give different results.
Changing [y] to [a] is a different type of operation than the others, so I will ignore this for now. Parsing [i]-clusters as tokens will be dropped in favor of [ai]-clusters. We have already learned that handling benches and benched gallows simultaneously has a positive effect. This leaves us with three operations with demonstrated efficiency:
- benches + benched gallows: +0.054
- [ai]-clusters: +0.047
- [ar, or,…]: +0.017
The sum of these individual operations is +0.118 out of a required +0.150. What is their effect when applied cumulatively?
Combining just these replacements adds +0.168, raising our h2/h1 to 0.716, which is at the very bottom tier for normal text but it is much better than before, all with a few not-too-controversial choices. However, h2 by itself is still at 2.678 (up from 2.085), while we really need a minimum of 3.
What if we ignore spaces?
Another thing to test is what would happen if spaces are removed from Voynichese. Imagine for example that Voynichese’s h2 has been lowered artificially because a space was introduced before or after specific characters. This would lower entropy, because the position of the character “space” is predictable. It follows [n], precedes [q] and so on.
With spaces removed, h2/h1= 0.604, which is a 0.056 increase. Removing spaces from the altered file, raises h2/h1 to 0.807 (!) and h2 to a satisfactory 3.145.
However, removing spaces is of a different order than changing EVA [ch] to [C]. We don’t know whether “the bench” is one glyph or two combined glyphs. So representing it as one or two letters is a choice one must make. There are no conclusive arguments in favor of one or the other, and the choices made for EVA are not better or worse than other options. But removing spaces is a much more invasive procedure, since spaces are clearly present in the text. Hence, removing spaces is destructive since we are unable to reconstruct the original transcription.
Still, this is something to keep in mind. Not only can we bump Voynichese entropy by removing spaces, we can also easily lower entropy in other texts by introducing spaces. Statistically, this is nothing special, because a space is a character like any other. If you shove spaces in at predictable locations, you decrease entropy. The question is whether it is possible that something like that happened in Voynichese.
Other common bigrams?
Are there other replacements that improve our modified transcription?
[e, ee, eee, eeee]
I tried experimenting with [e]-groups in various ways, but this has limited effects so far. Also on the altered transcription, it doesn’t do much to increase entropy.
[o, y]+gallow
The [o]-character in Voynichese is predictable, and many of Nick’s suggestions involve it altering the next glyph. But here’s the problem: [o]+gallow often appears word-initially. If you were to remove these [o]’s, entropy would decrease, because the o’s add an unpredictable factor to gallows always being word-initial. You can artificially counter this by replacing all combinations with a new character, but the effect on h2 is limited, while you do increase h0.
Simply put, replacing [o,y]+gallows with one character only adds more characters that uniquely appear at the beginning of words.
Now if you combine this mess with the removal of spaces, you get good results:
h0 = 5.357552004618084
h1 = 4.018222361153936
h2 = 3.233913175216506
h2/h1 = 0.805
Spaces are in the way of further progress.
Conclusion for Part II
With just three types of non-destructive alterations, we see h2/h1 of Voynichese climb to the level of a low-entropy medieval text. Its conditional character entropy itself remained too low, however.
Of these changes, two should be uncontroversial, since they rely only on a different interpretation of stroke groups, whether to write them down as one glyph or many. For researchers testing character entropy on EVA, I would advise to at the very least consider replacing [ch] and [sh] by a single character each, and to do the same for [i] or [ai]-clusters.
Finally, I hit a brick wall again because of spaces. Voynichese has very limited options word-initially, and combining common word-initial bigrams does not really increase conditional entropy.
Koen, while you work upwards I have been working “downwards”…
Japanese transliterated into latin alphabet gives h2/h1 of 0.708, which is very low indeed for a natural language.
So actually you passed the border of modern languages with substitutions alone at 0.716!
Anyway, Voynich does not need such rigorous substitutions to get in the neighborhood as those up to [a->y] already showed. Here an overview:
h2/h1=0.805 h2=3.234 h1=4.018 Herbal-A [spaces]:
h2/h1=0.798 h2=3.223 h1=4.040 Modern Dutch (Het Diner)
h2/h1=0.716 __________________ Herbal-A upto [ar,or…]
h2/h1=0,708 h2=2.746 h1=3.877 Modern Japanese in romaji:
h2/h1=0.698 h2=2.790 h1=3.994 Modern Dutch (Het Diner) in hiragana
h2/h1=0.666 h2=2.406 h1=3.611 Herbal-A upto [y->a]
h2/h1=0.544 h2=2.073 h1=3.811 Herbal-A
You will note a strange item in this list: I have transformed Dutch into Japanese romanized hiragana – i.e. what it would be if the complete Dutch text were thought to be Japanese loan words. It was done by a simple algorithm so it is very crude. But the result is similar to actual Japanese hiragana !
Ger Hungerink.
LikeLike
This is what my (crude) algorithm made of Dutch into hiragana. Some transformations first: ch>x, ng>n, ij>y, double letters>single.
we gingen eten in hetu resutaurantu iku ga nietu zegen weluku resutaurantu wantu dan zitu hetu eru de volugende keru warusuxynlyku volu metu mensen die komen kyken ofu wy eru oku weru ziten seruge hadu gereseruverudu datu doetu hy alutydu reseruveren hetu resutaurantu isu eru zo en van hetu sorutu waru je durie manden van tevoren moetu belen ofu zesu ofu axutu iku ben inmidelusu de telu kuwytu zelufu wilu iku noitu durie manden van tevoren weten waru iku opu en bepalude avondu ga eten maru kenelyku zyn eru mensen die daru totalu gen moeite me heben alusu de hisutorici overu en paru euwen wilen weten hoe axuterulyku de mensuheidu wasu an hetu begin van de enentuwintigusute euwu hoeven ze alen maru en kykuje in de comuputerusu van de zogenoemude topuresutaurantusu te nemen wantu alu die gegevensu woruden bewarudu datu wetu iku toevaligu waneru meneru lu de vorige keru bereidu wasu omu durie manden opu en ramutafelutuje te waxuten dan waxutu hy nu oku welu vyfu manden opu en tafelutuje nasutu de deuru van hetu toiletu
LikeLike
Actually when written in Japanese they would use katakana as for all foreign non-Chinese words 🙂
LikeLike
Haha, that’s certainly a unique experiment. But the situation of Japanese is so special; basically you *have* to cause the entropy of Dutch to get lower, because so many combinations are not allowed, so you must shove “u” in between.
A similar thing is going on in Voynichese, but on top of that there is positional rigidity. You can go only so far in altering the interior of words, before you hit the problem that what comes before and after spaces remains pretty predictable.
So I don’t see how we can solve the Voynichese entropy problem if we don’t equalize certain glyphs that are in complementary distribution, like y -> a. This is hard to sell though, because you’d be guessing which initial/centre/ end forms are positionally dependent expressions of the same phoneme.
LikeLike
In general, I see this as a promising approach. It fits into the category of ‘verbose cipher’, where a plaintext character is represented by a group of symbols in the MS. It remains a challenge to get all three statistics to match that of a normal language. I once posted a result in the Ninja forum where I also achieved an h2 over 3.0 . In my case, the h1 was too low because the most frequent PT character was far too frequent. In your case I see the problem with the PT alphabet size, which is 41.
But like I said I consider this promising.
LikeLike