Marco Ponzi kindly pointed out that there was a mistake in my previous entropy posts. It was did not invalidate the results, but was still enough to noticeably change the numbers. However, instead of rewriting the previous posts with corrected numbers, I decided to do over the experiment, improving it at the same time.

What was this about again?

One of the main problems we have with the text of the Voynich manuscript is that its characters are relatively easy to predict – more so than in any language we know of. We say that its conditional character entropy (h2) is too low. One part of this problem (though certainly not all!) is caused by the fact that the commonly used EVA transliteration system splits certain glyphs into separate strokes. For example, the “bench” glyph becomes EVA [ch]. This makes that the c+h pairing suddenly becomes extremely predictable, dragging down h2.

The example above would be [chain] in EVA, while it could just represent two glyphs, the “bench” and something that looks like a loopy “m”. Since such stroke groups are common, it is possible that EVA decreases h2 by using multiple letters to represent them. This is not a criticism of EVA, I like it and the fact that everybody can use it to communicate about the MS, but rather of the fact that not all researchers are aware of EVA’s impact on statistics.

While I started by mapping the impact of EVA, my interest then shifted to common n-grams (groups of glyphs) in general. Because even if we factor out potentially “introduced” problems like [ch], there are still many common n-grams like [ol], [dy] etc, which are harder to interpret as single glyphs because they consist of clearly separate parts.

Imagine that [o] modifies the next glyph. So for example, [t] is “t”, but [ot] is “d”. This hypothesis assumes that Voynichese is a verbose cipher, whereby one glyph in the source text is represented by a group of glyphs in the ciphertext. While I do not wish to argue that Voynichese is a verbose cipher, it is still interesting to see how entropy changes if we reverse this hypothetical process.

Improved experiment – setup

My goal is to increase h2, preferably up to the magical barrier of 3.00, while at the same time keeping h1 in check. I do this by replacing common n-grams like [aiin] with a single new glyph, for example [A]. If you keep doing this, you can of course increase character entropy up to a point where it matches word entropy since each word (type) will be a unique glyph. This is why it is important to keep an eye on h1 as well, which makes the experiment much more tricky; some replacements increase h1 by much more than they increase h2, so those are to be avoided.

While in previous posts I did this with trial and error based on an n-gram frequency list Marco sent me, I now opted for a more systematic approach, and I wanted to take more variables into account:

  • What is the impact of the transliteration file used? Takahashi (TT) vs. Zandbergen-Landini.
  • Do different sections behave differently? I will test Herbal A, Herbal B, Q13 and Q20.
  • Does the way we treat benched gallows make a difference?

That last point deserves some explanation. Since benches (EVA-ch and -sh) are almost certainly glyphs that are “split” by EVA, I want to merge them already in the original files to be modified. Merged benches should be part of the “zero” state. But this leaves us with a problem for benched gallows, which consist of a bench glyph and a gallow glyph on top of each other.

EVA splits these into three parts: half bench + gallow + half bench, e.g. [cth]. But since benches have been merged, [c] and [h] are now gone. There are generally two ways to tackle this problem. One is “unstacking”, preferred by for example Nick Pelling and myself. I had somewhat arbitrarily chosen to rewrite them as bench + gallow, but in response Nick argued that gallow + bench is the better order. I will follow Nick’s advice. The other way is to represent benched gallows as their own, unique characters, as preferred by for example Marco Ponzi and Emma May Smith.

Because all these questions required me to test and edit a lot of different files, I faced the inevitable conclusion that I needed to learn how to automate these tasks in Python. This worked for the most part, and Marco was kind enough to help me fix the last issues with the code. Given an input text file, the script generates a series of new files, each with a different n-gram merged. The n-grams to be merged are simply based on a list of the most frequent n-grams, up to n=4. Benches are merged in the input already, so they count as one.

Running this script on a base file results in a series of new files, each with one of the most frequent n-grams merged to a single glyph. By running Nablator’s Java code on these files, I was able to calculate the impact of each different transformation.

What matters?

I prepared four different starting files: Both Takashi and ZL, both with benched gallows unstacked and benched gallows rewritten as unique glyphs. It appears that the impact of the transliteration used is very small. H2 for ZL files differs on average a mere 0.005 with the corresponding TT files. The impact of the way benched gallows were treated was five times as large, with an average difference of 0.028 between both methods.

This is a relief, because it means that at least for these very broad statistics, the two most commonly used transliterations are similar enough. Therefore, this post will henceforth focus on the TT files. Interpreting benched gallows in different ways does make a difference, but this was to be expected: rewriting benched gallows introduces new glyphs, increasing h2.

Do the differences between various sections matter? Apparently, they do. In the graph below, black dots are the original, unmodified EVA files. Red dots have had benches merged and benched gallows replaced by unique characters. Finally, green dots have benches merged and benched gallows unstacked.

There’s a lot to unpack in this graph. Q13, Q20 and HB follow a similar pattern, where both modified versions gain h2 and lose h1 (which is exactly what we want). Herbal A breaks this pattern: its version where benched gallows are replaced by unique characters (red dot) gains a lot of h2, but also takes on an increase in h1.

I wonder whether this is a consequence of the “language”; Q20, Q13 and HB are all in Currier’s B-dialect, while HA is, as its name suggests, in the A-dialect. Still, Currier language does not explain everything, since Q13 distances itself from the others with its notoriously low entropy values.

So in summary, what matters?

  • Section? Yes: A and B languages might show different behavior and Q13 has much lower entropy values than the rest.
  • Treatment of benched gallows? Yes, especially the impact on h1 is significant.
  • Transliteration file used? No, not for these broad statistics.

Since I find the differences between sections the most interesting, I will focus on this aspect to limit the amount of variables. We will start the experiment with the version where benched gallows have been unstacked in Nick Pelling’s favored manner. At the end I will check if there is a big difference with the version where benched gallows were replaced with unique characters.

What am [i]?

When I ran the script for a first test, I noticed a few things. The following graph is probably hard to read, but it is enough as an illustration for now.

Each section is a color: blue for HA, red for HB, yellow for Q13, green for Q20. The height of the bar represents the increase in h2 after the transformation, minus the increase in h1. In other words, how much does the dot move to the top left in the scatter plot, which is what we want. Tall bars represent glyphs that appear to have been extremely redundant.

The graph is sorted by increasing values for Herbal A, which also allows us to see where the B-sections differ. Q13 presents itself as kind of an extreme version of the B-language, with tall peaks at [edy, qo, qok]. These differences can be explained to a large extent by frequency: if an n-gram is much more numerous in one section, it can have a larger impact on that section. Still, frequency does not explain everything, since also some frequent n-grams have low values. But this graph certainly shows why it is important to treat sections separately.

By far the tallest bars for most sections are those replacements involving [ai], [in], or some combination of these. Therefore, I will first sort out the [i]-clusters separately. Like the benches, I feel like their effect on entropy is in part explained by of the way we represent Voynichese. Of course, the fact that [i]-clusters are always at the end of words is an inherent property of Voynichese, but the way we section those clusters in minims might lower entropy in an artificial way.

But what is an [i]-cluster?

7195 words in the ZL transcription contain one or more i-characters. Of those, 6043 end with [in]. In 6788 cases, the [i]’s are immediately preceded by [a]. This is why [ain] and [aiin] are so common. Only two other bigrams involve [i] and occur more than 100 times: [oi] and [ir].

Differently put, 78% of VM words containing [i] end either in [ain] or [aiin]. It might be a surprise to some readers that [a] is even more frequently connected to the [i]-cluster than [n] is. The only other combination that seems to matter is [air] with 8%. Next are [oiin] and [aiiin], which don’t have a great impact with 2% each.

Now don’t get me wrong, the fact that [oiin] appears 176 times must be relevant somehow. But not so much for this particular study, which is mostly concerned with frequent phenomena. Therefore, I will reduce the [i]-problem to [ain, aiin] and, like benches, replace these separately beforehand.


My intention is to increase h2 (conditional character entropy) while keeping h1 in check. Looking at the corpus of medieval texts I collected, I established a threshold for h1. For the level of h2 I am expecting to reach, I should really stay below:

max h1 = 4.20

For h2, I aim for the magical limit of 3.00, but this will probably be impossible. In my previous entropy post I did reach a value above 3.00, but as Marco pointed out, it contained two mistakes:

  • I used the entire MS instead of just one section. Given the different nature of the sections, it is easier to reach high h2 on the full MS.
  • I made a mistake in formatting the text which also somewhat increased h2.

So this is what we are aiming for: get h2 as close as possible to 3.00, while keeping h1 below 4.2.

Each of the following sections has had the following n-grams rewritten before running the tests:

ch, sh, ain, aiin

The following plot should clarify these choices. The many black dots are various medieval texts. The colored dots in the bottom left are the four selected Voynich quires. I have circled the original, unmodified EVA versions. With benches, ain and aiin “fixed”, you can see that they get a fairer starting position already, since all dots shift to the top left.

I know from experiences that reaching the green line will be hard. Therefore, the red line at 4.2 seems like a reasonable limit for h1. After this point, h1 tends to increase without raising h2, which offers no prospect of reaching any acceptable level.

Herbal A

For Herbal A, I reached the following values:

h1 = 4.20
h2 = 2.94

Sequence: qok, chol, chor, che, chy, ol, cho, or, qot, ar, eey, al, qo

This is closer than I thought to the “dream destination” of 3.00, but still not quite normal. I am surprised by the amount of [ch], and that several versions of [cho] were eliminated. It may be worth pointing out that the VM scribes often write [cho] as a ligature, and may indeed have thought of it as one unit. But this is something that should be studied separately.

If we apply this exact series of changes to the B-sections, we see how it affects them differently. They do move in the right direction, but Herbal A (purple dot) benefits more. It remains to be seen whether a different series will be better for the other sections. Things are not looking good for Q13.

Herbal B

For Herbal B, I reached the following values:

h1 = 4.15
h2 = 2.82

Sequence: ar, qok, ol, al, qot, or, edy, she, che, ok, qo


For Q20, I reached the following values:

h1 = 4.19
h2 = 2.78

Sequence: qok, edy, qot, ar, eey, al, she, che, chol


Now comes the VM’s worst crime against entropy: Q13. This section’s vocabulary is the most limited by far, and its characters are the most predictable. These are the values I attained:

h1 = 4.18
h2 = 2.73

Sequence: qok, edy, qot, ol, ar, al, ot, or, ok, eey, she, che

The final graph looks like this:

In the original EVA, Q13 is the outlier while the other sections are relatively close together. The modified versions, however, group by Currier language; it was much easier to get close to h2 = 3.00 with Herbal A than with any of the B-sections, even though I spent a lot of time trying to squeeze the most out of each section individually.

Comparison of selected n-grams

I obtained these values by repeatedly running a Python script on the files, and each time checking which operations produced the best offspring. Ideally I would have liked to create a version of each possible combination, but this is impossible given the exponential growth. (Once I accidentally ran the script four times in a row and I ended up with over 60,000 text files, or 14 gb, which took several minutes to delete).

The n-grams included in the script were the following:

ol, ok, ot, or, od, qo, qot, che, ee, cho, ed, ey, al, ar, eo, she, chol, chor, qok, chy, sho, da, dy, edy, eedy, eey

While selecting these, I needed to draw the line somewhere in order not to get overwhelmed. Therefore, I excluded common n-grams with a gallow in the middle, like [oke]. It is possible that including these may improve results, but this seemed too fanciful even for my taste.

The following graph shows which n-grams were changed where. Note that the order of operations may be relevant; this graph does not take those into account, but the section-specific titles above in this post do. Before anything else, benched gallows were unstacked in the fashion preferred by Nick Pelling, that is gallow-bench.

Marked in green are those replacements which were widely required: ch, che, sh, ain, aiin, qot, qok, al, ar, ol, or, eey.

Herbal A seemed to benefit most from additional replacements involving [ch]: chol, chor, cho, chy. This came as a surprise to me, but I think something can be said for at least [cho] being a unit.

The replacements she and edy (blue) were required in all B-sections, but not in Herbal A.

A whole group of replacements (marked in red) were not selected by the system: od, ee, ed, ey, eo, sho, da, dy, eedy. We can think of it like [edy] was preferred as a unit, excluding dy, ed, and eedy. In previous experiments, I already learned that e-based replacements like [ee, eo] don’t tend to give great results.

I was planning to check what the difference would be if benched gallows became unique characters, but this post already took me way too long. Since this expands and corrects the previous entropy posts, these are no longer valid. I will edit them and refer to this new post.

A final note: this is just one way to do this, and I do not want to propose this as any kind of solution. The point I want to make is that the information value of Voynichese words might be higher than the “unmodified” EVA h2 suggests. Researchers should take this into account when, for example, dismissing the possibility that Voynichese can contain real language based on these grounds.