In information theory, entropy is the opposite of predictability or certainty. For example, before a die roll (6 options) there is more uncertainty about the outcome than before a coin flip (2 options). This post deals with character entropy: how predictable are characters within a text? One kind of character entropy appears to set aside Voynichese the most from “normal” texts: h2, conditional entropy. What this means is best explained through some examples.
Take the letter “q” in English as our current character. When q is the first letter of an English word, which letter will follow? This is easy to predict: u. So the entropy here is low; In fact, when q is not followed by u in English, the word is likely foreign, like in Qom (the seventh largest city in Iran), Qatar, qanat, Qin etc. The tendency of q to be followed by u decreases the character h2 of English.
On the other hand, the letter r can be followed by almost anything, in alphabetical order: rat, curb, starch, lard, red, wharf, surge, Rhode, rise… you get the picture. This increases conditional entropy because it’s hard to tell what will follow r. Note that even when a letter can be immediately followed by many others, it might still prefer one or two. This again influences entropy statistics.
Character h2 of Voynichese is notoriously low, which means the next character is often predictable. In general, [i] is followed by [i] or [n]. And [n] is mostly followed by a space. [q] is followed by [o] and so on.
What I’d like to do today is run some tests to see how the commonly used transcription alphabet EVA influences Voynichese h2 values. The thing to understand about EVA is that it is, in part, stroke-based. Consider the following word, how many letters would you count?
Minimally, one would count four, all separated by a tiny space: a big loopy thing, a horizontal bar with two legs, something that looks like “u”, and something like 9. But you could also count more: the first one could be a ligature, the second one as well, the “u” could be two c’s. So one could reasonably transcribe the above using four to seven characters. In EVA, this would be [kcheey]. We transcribe some groups of strokes as one glyph (k), but others are pulled apart (ee).
I think EVA is well-designed and serves its purpose. We simply don’t know the Voynich glyph inventory, so a balance has to be found between transcribing as single glyphs or strokes. But I often wonder what the effect would be if these are divided differently. This post is not meant as a proposed “correct” way of transcribing Voynichese. I’m just exploring the impact of some options.
(Check the Wiki page on minims for examples of how dividing strokes into glyphs was also problematic in Gothic scripts.)
The conditional character entropy of Voynichese-as-EVA is unusually low, so it is often easy to predict what the next glyph will be. How low exactly?
The first experiment I did to get a feel for these values involves four texts:
- the first 40,000 characters of VM Quire 13 (including spaces)
- the first 40,000 characters of Pliny’s natural history (including spaces, all lowercase, punctuation removed).
- the exact same Q13 text but with all characters scrambled randomly
- the exact same Pliny text but with all characters scrambled randomly
The Q13 scramble starts like this. Note that there are also consecutive spaces but WordPress removes those.
a herko akndcclyq lohlkeseal ld y k qhsy qyhedo dyekooylosspi eea okolyleh actqaeqyk dqoqetyholeotyoordhdnkdhdnl ded e yyo h yd hed h eqecal paqkdet le y te cyqiscyedfyryoleeh qeqayaeeekqyy ooieyooldkq yl skos r drks kn nd y dy e n qhooqynkonlkhcsylyee ptn yoa q lleqydk oootnshl a yec ddkhdnnooi yede dlcded y q tqyyo dket a yehdeckio o ho ee hhaih r lyhrks ciceqosedcehk klyqadehrhyhhtatal soi qoylao eyak hk cikhs oqpenyaed e dcyehy aqceeeocioyyyhaqdnyo ielleeghey lelrl loshro kykay r laod scilie
And here’s an example from the Pliny scramble:
buondseiaelsmtbr me remlatpaciut toofuantu iisnaiai enilin e uriuvprh ttciougiielosgr ci o suoob tlauee csvuescscctus aap s npiaeeu uiniieaoasnpua cif muma r irimiaeirmmtxs enctpeurtaieaecseadbdf npaet rsuuroe amei le bux di s iasneniue boseirnur anus tnalmmmri tssdneu enrnmtemqnocisneciseepeu un l dtn irfeheirsburrtdui rquim setem l mfninesqbe c nnviiuqmeedoedmi traaolrnt asrsvqnuc uno uitesor
(It’s funny to see how even after scrambling, you can still spot which one is EVA).
What’s the point of this? Well, scrambling adds entropy; it breaks down any patterns present in the text, and predicting the next glyph is now purely a matter of knowing glyph frequencies and always guessing the most common one. Think of it as shuffling a sorted deck of cards: your chances of correctly predicting the next card will plummet after shuffling.
Here’s how the conditional character entropy of these four texts compares, expressed as a fraction of their theoretical maximum entropy.
For the original texts (blue columns), Voynichese is much more predictable than Latin, resulting in a low h2. But both scrambled texts (red) achieve high h2 values close to the maximal entropy for their respective texts.
But what about other scripts, abjads…? Latin script might be our best bet. I tested Greek and its h2 is much higher, while we need lower. Lindemann  explored Arabic, Syriac and Amharic, again leading to a higher h2. He also tested texts transcribed with abbreviations maintained, which again did not lower h2.
Let’s make some entropy!
The low end for conditional entropy in European texts is around 3.0, while Herbal A does best for Voynichese, with h2 = 2.1. That’s an enormous difference. But our transcription is in EVA, which lowers entropy; Lindemann calculated that Currier’s transcription would increase VM entropy to about 2.4, which is still low.  Many of the tweaks I can make on basic EVA will bring it closer to Currier.
So this is our goal. Let’s start from EVA and see which changes we can make, and what their impact is on h2. Again, this is no proposed solution or correction, just an experiment with statistics.
1. EVA [ch] and [sh]
There are a few obvious starting points, and one of them is what we call “bench characters”, transcribed in EVA as [ch] and [sh]. Transcribing them as one character (which they may well be) instead of a bigram should increase entropy a bit. But by how much? Since character entropy is case sensitive, it is convenient for me to use uppercase for any replacements I make. So I’ll use C and S.
[ch] –> C
[sh] –> S
This modification has a noticeable impact on h2, but we still have a long way to go.
Herbal A will give us the best shot, so I will drop Q13 and Q20 from further graphs. I’ll also use (h2/h1) from now on, since I’ve been told (by Anton) that h1 must be taken into account when comparing various texts. This can be done in various ways, but I like h2/h1 because this gives h2 as a percentage of the text’s maximal h2.
In the above graph, VM_0 is the original 40,000 character text from Herbal A. VM_1 is the first alteration. Subsequent alterations will be cumulative, so ideally they will appear between VM_1 and our goal.
This first transformation brings us from 55% to 58%.
2. EVA [ckh, cth, cfh, cph]
These are what we call benched gallows. They are complex characters, and we don’t quite know how to handle them. Recently, JK Petersen explained on the forum that stacking letters like this was not uncommon in Greek, and that there is no fixed order to read them. This leaves me with three options: transcribe as bench-gallow, gallow-bench or a unique character.
Bench-gallow beats the other options by a tiny margin, so I’m going with that
[ckh] –> Ck
[cth] –> Ct
[cfh] –> Cf
[cph] –> Cp
And so we cross the barrier of 60%… barely.
Ignoring rare cases, [i], which is basically a minim, can only be followed by [n], [r] or another [i]. The number of [i] in such clusters varies.
For now, I’ll take a page from Currier’s book, but only for the most common clusters.
[iin] –> M
[in] –> N
[iir] –> W
[ir] –> V
My reasoning is that [n] might just be [i] with a decorative swoop (flourish). In that way, it makes sense to count it as a minim as well. Next, I converted the few remaining ii-clusters to their corresponding uppercase letter, again treating them like minims in Gothic script.
[iii] –> M
[ii] –> N
Much to my surprise, this iteration (VM_3) surpassed Currier’s transcription, which I’ve added in red:
4. EVA [q]
Next up is EVA [q]. This is almost always word-initial and followed by [o]. Both characteristics have a negative impact on entropy. Currier transcribed [qo] as [q], which seems like a good idea.
[qo] –> q
Somewhat surprisingly, this hardly raises h2. I suspect that collapsing [o] into [q] is not enough to offset the effects of [q]’s word-initial position. No graph, the difference is only 0.0025.
5. EVA [e]
EVA [e] is like [i] because it often appears in improbable sequences. For example, Herbal A contains the word [deeeese].
As you’ll notice above, the central [eeee] cluster comprises two parts, and indeed [ee] is often connected. I will therefore consider [ee] a single glyph.
[ee] –> U
Again, this change has very little impact. I think two things are happening:
- With the new glyphs I’ve been introducing, h0 goes up. And with that, h1 rises as well: with a larger glyph set, you can create larger entropy. Hence, h2/h1 remains constant, while we need it to go up.
- Despite collapsing frequent glyph pairs, there is still a rigid structure. I’m just kicking the can down the road.
Time to bring out the big guns.
6. EVA [y]
One of the first things I noticed when going over a page of Voynich text years ago is that [y] is [a] with a tail. Additionally, they are in complementary distribution; [y] favors the beginning and ends of words, places where we don’t find [a]. 
What would happen if I change [y] to [a]? When transcribing a regular medieval text, we would also ignore place-dependent variations of the same glyph. We don’t know whether that is what’s going on here, but all things considered, the concept is not unusual.
[y] –> [a]
The impact is considerable:
You can see that after VM_3, I had gotten the most out of collapsing possible digraphs into single characters. At VM_6, the pace picks up again because I eliminated [y], which might be a positional variation of [a].
I changed several common bigrams and trigrams into single characters and turned [y] into [a]. These tweaks increased the conditional character entropy of Herbal A from 55% of its maximum to 67%. This is an improvement, but still very low. H2 itself went from 2.09 to 2.43, which is still far from the target of 3.00.
The changes to bigrams that had most effect involved benches and [i]-clusters. For both, arguments can be made that they are better represented as single glyphs for entropy calculations. In the next alterations to bigrams, however, I noticed I had reached the ceiling.
If more progress is to be made, it might be necessary to detect other situations like [a] and [y], where two glyphs might be positional variations of the same “phoneme”. If you have any suggestions, I’ll gladly test them.
As usual, I owe thanks to the helpful people at the Voynich.ninja forum. Among others Nablator for the code, Anton for his entropy tutorial and Rene for various pointers on statistics.
 This post was in part prompted by the slides for a 2018 talk by Luke Lindemann (pdf) exploring the relation between transcription and character entropy.
 Alternatively, one could use Glenn Claston’s transcription, but this has its own problems. Claston attempted to include smaller differences between glyphs which might be meaningful. Still, with all these variables added, h2 for the first 40,000 characters is “only” 2.6. What’s really problematic is that this transcription’s h0-h1 is two to four times as large as that of other texts. Even for a modern Flemish news article including capitalization, numerals and punctuation, h0-h1 is 2.0, compared to 3.1 for Claston.
 The complementary distribution of [a] and [y] was pointed out by Emma May Smith, although she does not believe the difference is merely ornamental.