EDIT 19 October 2020: Marco Ponzi noticed that something seemed off with the values. Upon double checking, I realized that I had made several mistakes. The overall point remains valid, but the values in this post are not reliable. Please refer to this new post for corrections and expansion:


This research benefited greatly from a number of recommendations by Marco Ponzi, and I ended up rewriting this post with his advice in mind. He also helped a lot in streamlining the corpus.

Much of this post is the description of a number of experiments I did, looking for ways to increase entropy. If you don’t like to take the scenic route, I marked the conclusion with (TL/DR).


A well-known problem with Voynichese is that its  conditional character entropy is low compared to other texts. For any Voynichese glyph, the next glyph is too predictable. Too often though, this low entropy is accepted as a given, while at least in part it may be caused or amplified by the transliteration system.  For example, if in English I were to represent [m] as [iii], conditional character entropy would drop because the chance for [i] to follow another [i] increases.

That said, a lot of Voynichese’s entropy problems are caused by frequent glyph clusters that cannot be blamed on our transliteration system. In previous experiments like these, I attempted to focus on those problems that we may have introduced, for example by representing the “bench”-character as the bigram [ch]. In this post, by contrast, I will focus on the most frequent bigrams.

Side note: Before starting, I needed to decide what to do with the characters we call benched gallows. These look like two glyphs stacked on top of each other. EVA transcribes them as three characters: half bench + gallow + half bench. This is a disaster for entropy. Preliminary results show that it is important for entropy increase to treat benches as single characters. This left me with two options. On the one hand, I could represent each benched gallow with one unique character. So [ckh] becomes [1], [cth] becomes [2] and so on. Alternatively, I could “unstack” the glyphs, as one would do with similar monograms in Greek, like the chi-rho symbol. I tested both methods, and unstacking had a more desirable effect on entropy values. However, as Marco pointed out to me, this leads to some information loss. I limited this by opting for the rare order bench+gallow, instead of the much more common gallow+bench. Like this, benched gallows can still be restored with 85-90% accuracy. Combined with the relative rarity of benched gallows, this was the only concession to information integrity I would allow.

Experiment 1: frequent bigrams

I wondered what would happen if I replaced the most frequent bigrams with single characters. For this first test, I used the Takahashi transcription of Q20. I removed words with unclear characters (marked with *). 

Starting with the most frequent bigram in my file, I replaced the following list, calculating entropy values for each step.

  1. ch
  2. sh
  3. ai
  4. ee
  5. dy
  6. in
  7. ok
  8. ot
  9. ol
  10. ar
  11. al
  12. ey
  13. or
  14. eo
  15. lk
  16. ed
  17. od
  18. qo
  19. am
  20. ky
  21. op
  22. yk

I found a scatter plot the best way to represent this process:

Untitled-4 copy

As you see, each time we replace a common bigram by one new character, conditional entropy (h2) increases. However, we also want to keep h1 in check. After the first few changes, h2 and h1 increase at a similar pace. At the end, little progress is made and our dots start bunching up.

This is how our Voynichese snake compares to the data for other medieval texts in various scripts and languages (note: so far, only Latin, Greek and Slavic scripts have been included, this should be expanded).


While we can already achieve significant gains to h2 this way, Voynichese remains well below the corpus. It is clear that replacing bigrams will become ridiculous well before we reach acceptable h2 values.

Experiment 2: recount after each step

So, what would happen if after each replacement, I count the mot frequent bigrams anew? With this method, the new characters I introduced started emerging at the 6th step, which means that I was effectively replacing trigrams as well.

  1. ch
  2. ai
  3. ee
  4. dy
  5. ok
  6. (originally che)
  7. (originally aii)
  8. ot
  9. (originally aiin)
  10. ol
  11. sh
  12. (originally qok)
  13. ar
  14. al
  15. (originally ain)
  16. (originally eey)
  17. (originally she)
  18. (originally chedy)
  19. (originally eedy)
  20.  or
  21. (originally qot)
  22. lk
  23. od
  24. (originally edy)
  25. (originally chey)
  26. (originally ody)
  27. ey
  28. am
  29. (originally air)
  30. (originally chk)
  31. op
  32. dc
  33. (originally shedy)
  34. (originally olk)
  35. (originally  chdy)

Now our graph looks like this:

Untitled-10 copy

This is still not exactly what we’d want, but the difference with the previous graph is clear. Now, the top Voynichese dots are among the bottom of the cloud with Slavic and Greek texts. For Latin script (the cloud on the left), h1 is too high. Also, this is the first time I have been able to raise Voynichese conditional character entropy comfortably above 3.0.

Comparing these graphs has been informative: it appears that we have to include trigrams and more. Note that I am talking about trigrams in EVA – we don’t quite know what single glyphs in Voynichese are. Also, not all modifications are equally beneficial. Some even raise h1 while doing nothing for h2. We may need to be more selective.

Experiment 3: most frequent n-grams

While I was pondering what to do next, Marco Ponzi sent me a list of the most frequent n-grams in the entire Voynich MS. This seemed like a good starting point, but first I needed to adjust the frequencies, adding more weight to larger sequences. A bigram will almost always be more frequent than the trigram it contains. For example, if you replace the bigram [in], you miss the opportunity to replace the very frequent [aiin]. So I first multiplied the frequencies by (n-1), where n is the number of glyphs in the sequence. Then I sorted the weighed frequencies high to low and started replacing in order.

The plot below summarizes the results so far. The purple cloud is the non-VM corpus. The black dot is the VM Q20 in its original state (but with benched gallows unstacked). The other colors are different modification sequences to the original file. Green replaces only frequent bigrams, orange does trigrams. Yellow follows the weighed list of n-grams (with max n=4).

Untitled-2 copy

Using only bigrams or trigrams gives similar results, although the trigram line is more consistent. However, using weighed list of n-grams appears more effective, with higher h2 for a nearly identical h1.

Marco then pointed out that some replacements appear more effective than others in increasing h2 without increasing h1. By replacing the vertical axis with h2/h1, we can amplify this effect. In the sequence charted below, ch, aiin, sh, ain and air were effective replacements, while replacements like dy, ol, ok and edy should probably be blacklisted.

Untitled-4 copy

Most effective measures

So what happens when we only apply the most successful transformations and blacklist those that increase h1 more than h2?

In the series where I replaced only bigrams:

  • good: ch, in, sh, al, qo, am, ar, ai, or
  • blacklist: ee, eo, ky, yk, ey, od, dy

This is an interesting pattern: replacing benches and certain clusters with [i] and [a] is very effective, while [e] and [y] – clusters are better left alone. 

In the series where I replaced a weighed list of common n-grams up to 4:

  • good: ch, aiin, ain, sh, qok, air, ar, or, al
  • blacklist: che, ey, ol, ok, ot, edy, dy, she (I already skipped [ee] in this experiment because in all previous tests it had undesired effects).

Here we get a similar result as with the bigrams: if it is a bench or involves [a] or [i], it is probably effective to replace. [y] and [e] are best avoided. [o] can have a moderately positive or negative effect.

In summary:

  • Good: ch, sh, aiin, ain, am, ar, al
  • Uncertain: qo, qok, or
  • Bad: ee, eo, che, she, ey, edy, dy, yk, ky, ol, ok, ot, od

The graph below shows that isolation of desirable and undesirable replacements works. The yellow dots are the “good” replacements, performed cumulatively in the order above. Red are the “bad” replacements which result mostly in an increase in h1. If our goal is, ideally, to reach the bottom of the purple cloud (reference corpus of medieval texts) then expanding on the “good” line is the best strategy.

Untitled-6 copy

One of the red dots does produce a decent increase in h2. This is when I replaced [che], which includes [ch]. Substituting [ch] by one glyph is such a powerful boost to entropy, that this effect can also be noticed here. Indeed, adding the [che] substitution on top of the “good” sequence results in a significant shift to the right, which is what we want to avoid. Therefore, [che] must remain blacklisted. See the blue dot below, where I’ve hidden the reference corpus for increased visibility.

Untitled-8 copy
Blue dot: replacing (what originally was) [che] after all “good” transformations has undesirable effects.
With this in mind, I will assume that our blacklisted transformations should remain blacklisted. Progress?

Back to frequencies

I like where the “good” series brought us, but can we get any higher? I started by testing the next most frequent n-grams as the next step in the series:

cho, dy, ed, ke, or, qo, te, aiiin, chol, daiin, qok

Of these, the following have a desirable effect on entropy:

or, aiiin, qok, qo

Let’s see what happens when we apply those transformations all at once. The resulting values are represented by the top right yellow dot:

Untitled-10 copy

That’s pretty decent, we are almost at the bottom of the “normal” Latin script cloud with some h1 to spare. Now the challenge is to find a few more things to replace. Going by what worked before, I should probably focus on remaining i-clusters and (q)o+gallows. The most prominent candidates are [air] and [qot]. Let’s see what those do… Note that I changed the graph below a bit, back to regular h2 – h1.

Conclusion (TL/DR)

The graph below summarized what we found so far. The black dot is our starting point, EVA with benched gallows unstacked. Blue is what happens if you keep changing the most frequent bigrams to one new character. Yellow does the same with trigrams. Green uses frequent n-grams as a guideline, without discriminating between effective and non-effective measures. The red dot at the bottom of the “normal” cloud is where we end up if we selectively replace the most effective clusters (at least those I have been able to find so far!) These are [ch, sh, ain, aiin, aiiin, air, am, ar, al, or, qok, qot, qo].

Untitled-12 copy

Representing certain frequent letter clusters in EVA with a single new character, brings the entropy values of Voynichese closer to what we expect of regular texts. This worked best with those clusters that can be read as single characters, like benches (which EVA splits in two) and [i]-minims. Additionally, whenever [a] was involved, reducing the cluster to a new character also had positive effects. On the other hand, involving [e] or [y] did not move entropy in the desired direction. Finally, [o] had mixed results, being effective in combination with [q] but disappointing in most other contexts.

Some of these should hardly be controversial: benches are uninterrupted glyphs; [qo] can similarly be seen as one glyph; minims are by definition the building blocks of larger glyphs. I would argue that even involving [r] and [n] should not be controversial, since these can also be minims with different flourishes. I would personally argue that [a] can also act as a minim, this time with an added element on the left. Finally, contracting something like [qok] and [or] is more difficult to explain, since these are made of clearly distinct shapes.

It’s a shame to stop so close to the bottom of the “normal” Latin cloud. So far I have avoided transformations which drastically increase h1 over h2. But might including one of those provide the last push we need? Just for the challenge of it, let’s see what we can still do. The top ten remaining high-frequency bigrams are, from high to low: dy, ol, ed, ee, ey, ok, ot, ke, eo, od. If we count new bigrams which would originally have been trigrams, then [che, cho, she] are also still candidates. Notice how all of the remaining most frequent bigrams include one or two of [y, o, e].

Testing the ten most frequent bigrams that remain, I found [ol] to produce the best results: a decent increase in h2 without blowing up h1. I then decided to be consistent and just replace all four bigrams that start with [o], namely [ol, ot, ok, od].  This took h2 to 3.01, crossing the symbolic barrier of 3 for the first time. To my surprise, h1 only rose to 4.12. This is where it ends up on the graph (red dot):

Untitled-13 copy

All of this to say: by rewriting bigrams to single characters, it is possible to lift Voynichese’s conditional entropy to more normal levels. It seems best to focus on [a, i, o] and avoid [e, y]. The total list of replacements I applied to obtain these valuse is: [ch, sh, ain, aiin, aiiin, air, am, ar, al, or, qok, qot, qo, ol, ot, ok, od]. Of course, since this is all rather complex and sometimes counter-intuitive, it is always possible I made mistakes or missed obvious opportunities for better replacements. The amount of possibilities is simply too high to keep track of.