Skip to content

Commit 58dd029

Browse files
committed
Commented Stolfi's work in note 006 and checked strange characters count for Slot decomposition
1 parent 5d9eba7 commit 58dd029

File tree

18 files changed

+194
-96
lines changed

18 files changed

+194
-96
lines changed
-35.5 KB
Loading
-19.2 KB
Loading

docs/005/images/Rare.PNG

125 Bytes
Loading

docs/005/images/Slots Table.PNG

-519 Bytes
Loading

docs/005/index.md

Lines changed: 46 additions & 43 deletions
Large diffs are not rendered by default.

docs/006/images/CCM.PNG

-32.7 KB
Binary file not shown.

docs/006/images/CrustMantleCore.PNG

18 KB
Loading

docs/006/index.md

Lines changed: 58 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# Note 006 - Works on Word Structure
22

3-
_Last updated Oct. 23rd, 2021._
3+
_Last updated Jan. 9th, 2022._
44

5-
_This note refers to [release v.5.0.0](https://github.com/mzattera/v4j/tree/v.5.0.0) of v4j;
5+
_This note refers to [release v.6.0.0](https://github.com/mzattera/v4j/tree/v.6.0.0) of v4j;
66
**links to classes and files refer to this release**; files might have been changed, deleted or moved in the current master branch.
77
In addition, some of this note content might have become obsolete in more recent versions of the library._
88

@@ -16,10 +16,11 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
1616
---
1717

1818

19-
In this page I will list, review, and comment works from different authors about the structure of Voynich words.
20-
When appropriate I will compare their findings with my [slots concept](../005).
19+
In this page I will list, review, and comment works from different authors about the inner structure of Voynich words.
20+
When appropriate, I will compare their findings with my [slots concept](../005).
2121

2222
I expect these notes to grow and refine over time (as for the others, to be honest).
23+
Number in square brackets indicate the date when corresponding works were published (as far as I can determine it).
2324

2425

2526
# John H. Tiltman [1967]
@@ -31,6 +32,7 @@ place in an "order of precedence" within words; some symbols such as
3132
'o' and 'y' seem to be able to occupy two functionally different places._"
3233

3334

35+
3436
# Mike Roe [1997]
3537

3638
I found the below "generic word" grammar by Roe quoted by [Zandbergen](http://www.voynich.nu/a3_para.html) as published to the Voynich MS mailing list. Roe suggested that this could perhaps present evidence of grammar of the Voynich language:
@@ -40,19 +42,63 @@ Image from Zandbergen's website.
4042
![Mike Roe's generic word.](images/pd_roe.gif)
4143

4244

45+
4346
# Jorge Stolfi [2000]
4447

45-
[Describes](https://www.ic.unicamp.br/~stolfi/voynich/97-11-12-pms/) a decomposition of Voynichese words into three parts; prefix, midfix, and suffix.
48+
Stolfi initially describes a [decomposition of Voynichese words](https://www.ic.unicamp.br/~stolfi/voynich/97-11-12-pms/) into three parts; prefix, midfix, and suffix.
4649
Based on a classification of EVA characters into soft and hard letters, he then shows how Voynichese words can be decomposed into
4750
a prefix and suffix made entirely of soft letters, and a midfix made entirely of hard letters.
4851

49-
This is in line with the slots model, the picture below shows glyphs in their corresponding slots and how they map
50-
into Stolfi definitions (red glyphs are "soft" letteres).
52+
This is well in line with the slots model. The picture below shows glyphs in their corresponding slots and how they map
53+
into Stolfi definitions (red glyphs are "hard" letters while blue represents "soft" ones).
54+
55+
![Stolfi's "soft" and "hard" letters in corresponding slots.](images/HardNSoft.PNG)
56+
57+
He continues his analysis with the "[OKOKOKO](https://www.ic.unicamp.br/~stolfi/voynich/Notes/017/Note-017.html)"
58+
paradigm, to describe the fine structure of Voynichese words; finally,
59+
Stolfi develops these concepts into his well known "[crust-mantle-core](https://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/00-06-07-word-grammar/)"
60+
decomposition that he describes by using a [formal grammar](https://www.ic.unicamp.br/~stolfi/EXPORT/projects/voynich/00-06-07-word-grammar/txt.n.html).
61+
62+
Accordingly to this model, each Voynich word can be divided into three layers, each containing the others in an onion-skin pattern, so, the core is at the center of words,
63+
surrounded by the mantle, which in turn is surrounded by the crust. Each layer can be optionally empty and is, in general, defined by the letters it contains.
64+
65+
Leaving aside letters 'a', 'o', 'e', and 'y', for which Stolfi has a separate treatment, the layers can be defined as follows:
66+
67+
* Crust: letters 'd', 'l', 'r', 's', 'n', 'x', 'i', 'm', 'g'.
68+
69+
* Mantle: pedestals and 'ee'.
70+
71+
* Core: all gallows, pedestalled or not.
5172

52-
![Slots accordingly to Stolfi classification.](images/HardNSoft.PNG)
73+
The below image shows how glyphs in slots map into crust-core-mantle definitions:
74+
75+
![Stolfi's crust-mantle-core glyps in corresponding slots.](images/CrustMantleCore.PNG)
76+
77+
Stolfi comments: "_The distribution of the "circles", the EVA letters { a o y }, is rather complex. They may occur anywhere within the three main layers_ ... _We have arbitrarily chosen to parse each circle as if it were a modifier of the next non-circle letter; except that a circle at the end of the word (usually a y) is parsed as a letter by itself. ... the rules about which circles may appear in each position seem to be fairly complex_". I think, in light of the slots model, this is an unnecessary complication
78+
as 'a', 'o' and 'y' can be unambiguously assigned to the crust layer in most of cases; furthermore,
79+
it is clear in which position they can appear (slots 1, 8, and 11). Similarly, I do not understand the complicated parsing of isolated 'e' ("_we have chosen to parse isolated e letters as part of the preceding mantle or core letter_ ... Very rarely ... e occurs alone, surrounded by crust letters; in which case we parse it as the only letter in the mantle layer_", when the slots model
80+
indicates 'e', 'ee', 'eee' play the same role in word structure.
81+
82+
Undoubtedly, the interesting aspect of this model is that it proposes an "onion-like" structure
83+
for Voynich words. In Stolfi's own words: "_The grammar not only specifies the valid words, but also defines a parse tree for each word, which in turn implies a nested division of the same into smaller parts ... we believe that our parsing of each word into three nested layers must correspond to a major feature of the VMS encoding or of its underlying plaintext_"; however, I would argue this is not what the grammar indicates.
84+
85+
For example, again by comparison with the slots model, and as Stolfi admits "_the crust is not homogeneous_"; it is composed by a "left" part, which constitutes word prefixes, and a "right" part that constitutes word suffixes and these parts are quite different; e.g. 'q' appears only in prefixes, while the 'ai*' or 'oi*' sequences (like '-aiin', '-am', etc., that Stolfi calls IN clusters) appears only in suffixes.
86+
87+
Similarly, it can be seen that gallows in slots 3 and 7, which belong to the core layer, could well enclose pedestals or 'ee' in slots 4 and 6 that are classified as mantle. Again, Stolfi comments: "_The implied structure of the mantle is probably the weakest part of our paradigm. Actually, we still do not know whether the isolated e after the core is indeed a modifier for the gallows letter (as the grammar implies); or whether the pedestal of a platform gallows is to be counted as part of the mantle_".
88+
89+
Stolfi notes: "_When designing the grammar, we tried to strike a useful balance between a simple and informative model and one that would cover as much of the corpus as possible. ... Conversely, the grammar is probably too permissive in many points, so that many words that it classifies as normal are in fact errors or non-word constructs_". It should be noted that the grammar is really good in parsing Voynichese
90+
(accordingly to Solfi it covers "_over 96.5% of all the tokens (word instances) in the text_") but,
91+
on the other side, it is also very bad in recognizing what is not Voynichese; the grammar accepts something in the order of 1.4e20 (100 billions of billions) different terms, only about 4'500 of which are terms in the manuscript ([concordance version](https://github.com/mzattera/v4j#ivtff)). Just for comparison, all the words that can be generated by the slot model amount at a total of 16'753'291 (13 order of magnitude less) of which around 2'800 are Voynich terms; the model covers slightly more than 88% of tokens (98% considering separable terms) but it is much easier to describe and understand.
92+
93+
I summary, I do agree with Stolfi (and other authors) that the order in which characters appears in Voynich
94+
words is not arbitrary, but I think his model is misleading in suggesting a "layered" structure; for example,
95+
word prefixes and suffixes, which in Solfi's model both belong to the same layer (the crust), are indeed very different and assigning them to the same word structure looks completely arbitrary; ultimately, it seems
96+
the grammar suggests a "sequence" of possible characters, rather than a "onion-like" structure for words.
97+
If this is the case, it must be said the other, much simpler, models in this page show the same overall
98+
structure of Voynich words even if in less details or with less coverage of Voynich terms.
99+
Regarding the fine details, these might not be as relevant as Stolfi admits that "_one should not give too much weight to the finer divisions and associations implied by our parse trees_". It should also be mentioned that the grammar
100+
looks unnecessary complex, mostly because of the way it handles "circles"; this makes very difficult to grasp the structure of Voynichese below the most superficial levels by looking at the grammar. This is further complicated by the fact that the huge majority of the words the grammar describes, are clearly very different by those we found in the text.
53101

54-
Stolfi develops some "paradigms" of Voynich words, like the [OKOKOKO](https://www.ic.unicamp.br/~stolfi/voynich/Notes/017/Note-017.html) paradigm and the crust-core-mantle decomposition
55-
which, in his words, are incorporated and refined into a [grammar for Voynichese words](https://www.ic.unicamp.br/~stolfi/voynich/00-06-07-word-grammar/).
56102

57103

58104
# Philip Neal [?]
@@ -103,7 +149,7 @@ So NEVA and the Slot alphabet have different objectives, as my proposal aims at
103149

104150
# Sean B. Palmer [2004?]
105151

106-
I found the below grammar attributed to Palmer by Pelling:
152+
I found the below grammar attributed to Palmer by [Pelling](http://ciphermysteries.com/2010/11/22/sean-palmers-voynichese-word-generator) (see also below):
107153

108154
```
109155
^
@@ -121,7 +167,7 @@ A = ai*n*
121167
O = o
122168
```
123169

124-
Accordingly to Pelling, Palmer claims this grammar can generate 97% of Voynichese words, but this is clearly (as Pelling says) this generates a lot of words (potentially infinite strictly looking at the grammar).
170+
Accordingly to Pelling, Palmer claims this grammar can generate 97% of Voynichese words, but this is clearly (as Pelling says) because it generates a lot of words (potentially infinite strictly looking at the grammar).
125171

126172

127173
# Elmar Vogt [2009?]
@@ -135,8 +181,6 @@ stars section of the Voynich, which is written in Currier's B language.
135181
Proposes a [Markov state machine](http://www.ciphermysteries.com/2010/11/22/sean-palmers-voynichese-word-generator)
136182
to generate Voynichese words.
137183

138-
In his page he mentions grammars attributed to Sean Palmer, which I should investigate and describe here in more detail.
139-
140184

141185
# Brian Cham [2014]
142186

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Below some links to browse the Voynich online.
1616

1717
* Zandbergen's [Voynich MS - Browser](http://www.voynich.nu/folios.html)
1818

19-
* [Voynich Manuscript Project](https://ambertide.github.io/VoynichExplorer/index.html).
19+
* [Voynich Manuscript Project](https://ambertide.github.io/VoynichExplorer/index.html) by Ege Özkan.
2020

2121

2222

@@ -38,7 +38,7 @@ Each symbol in the alphabet is referred as a **transliteration character** or si
3838

3939
- Unless stated otherwise, pieces of transliterated Voynich script I quote use the "Basic Eva" as transliteration alphabet and are enclosed in single quotes (e.g. 'qockhey').
4040

41-
- A **token** in a text is a single sequence of characters, separated by spaces. The list of **terms** is the list of tokens, without repetitions.
41+
- A **token** in a text is a single sequence of characters, separated by spaces. The list of **terms** is the list of tokens without repetitions.
4242
In other words, a token is an instance of a term. For example; the below line in the Voynich
4343

4444
```

eclipse/io.github.mzattera.v4j-apps/pom.xml

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,4 @@
1212
<maven.compiler.source>15</maven.compiler.source>
1313
<maven.compiler.target>15</maven.compiler.target>
1414
</properties>
15-
<profiles>
16-
<profile>
17-
<id>java-8-api</id>
18-
<activation>
19-
<jdk>[9,)</jdk>
20-
</activation>
21-
<properties>
22-
<maven.compiler.release>15</maven.compiler.release>
23-
</properties>
24-
</profile>
25-
</profiles>
2615
</project>

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/CountNWords.java

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,8 @@ private CountNWords() {
3131
public static void main(String[] args) {
3232
try {
3333
IvtffText doc = VoynichFactory.getDocument(TranscriptionType.MAJORITY);
34-
doc.filterPages(new PageFilter.Builder().cluster("B").build());
3534

36-
Counter<String> c = process(doc, 3, true);
35+
Counter<String> c = process(doc, 2, true);
3736

3837
System.out.println("Most repeated: " + c.getHighestCounted() + " = " + c.getHighestCount());
3938

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/CountRegEx.java

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -41,12 +41,13 @@ public final class CountRegEx {
4141
public static final ElementFilter<IvtffPage> FILTER = null;
4242

4343
// The RegEx to look for.
44-
// private final static String REGEX = "\\?[tpfk]h";
45-
// private final static String REGEX = "c([^tpfk]h|[^tpfkh]|[tpfk][^h])";
46-
// private final static String REGEX = "[^tpfkcs\\?]h|.\\?h";
47-
// private final static String REGEX = "(^|\\.)([^\\.]*[gxvujbz]+[^\\.]*)+(\\.|$)";
48-
private final static String REGEX = "[gxvujbz]";
49-
// private final static String REGEX = "(^|\\.)[gxvujbz](\\.|$)";
44+
45+
// Words with rare characters
46+
private final static String REGEX = "[^\\.]*[gxvujbz]+[^\\.]*";
47+
48+
// Total rare characters
49+
// private final static String REGEX = "[gxvujbz]";
50+
5051

5152
private CountRegEx() {
5253
}
@@ -68,7 +69,6 @@ public static void main(String[] args) {
6869
if (FILTER != null)
6970
doc = doc.filterPages(FILTER);
7071

71-
7272
Counter<String> c = process("." + doc.getPlainText() + ".", REGEX);
7373

7474
for (Entry<String, Integer> e : c.reversed()) {
@@ -79,7 +79,6 @@ public static void main(String[] args) {
7979
} finally {
8080
System.out.println("\nCompleted.");
8181
}
82-
8382
}
8483

8584
public static Counter<String> process(String s, String regex) {

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/slot/CountCharsBySlot.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
import io.github.mzattera.v4j.text.alphabet.SlotAlphabet.TermDecomposition;
1111
import io.github.mzattera.v4j.text.ivtff.IvtffPage;
1212
import io.github.mzattera.v4j.text.ivtff.IvtffText;
13+
import io.github.mzattera.v4j.text.ivtff.PageFilter;
1314
import io.github.mzattera.v4j.text.ivtff.VoynichFactory;
1415
import io.github.mzattera.v4j.text.ivtff.VoynichFactory.Transcription;
1516
import io.github.mzattera.v4j.text.ivtff.VoynichFactory.TranscriptionType;
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
/**
2+
*
3+
*/
4+
package io.github.mzattera.v4j.applications.slot;
5+
6+
import java.util.Map.Entry;
7+
8+
import io.github.mzattera.v4j.applications.CountRegEx;
9+
import io.github.mzattera.v4j.text.ElementFilter;
10+
import io.github.mzattera.v4j.text.alphabet.Alphabet;
11+
import io.github.mzattera.v4j.text.ivtff.IvtffPage;
12+
import io.github.mzattera.v4j.text.ivtff.IvtffText;
13+
import io.github.mzattera.v4j.text.ivtff.VoynichFactory;
14+
import io.github.mzattera.v4j.text.ivtff.VoynichFactory.Transcription;
15+
import io.github.mzattera.v4j.text.ivtff.VoynichFactory.TranscriptionType;
16+
import io.github.mzattera.v4j.util.Counter;
17+
18+
/**
19+
* This class prints occurrences of 'c' and 'h' appearing alone (not in 'ch',
20+
* 'sh', and gallows).
21+
*
22+
* @author Massimiliano "Maxi" Zattera
23+
*
24+
*/
25+
public class FindStrangeCH {
26+
27+
/**
28+
* Which transcription to use.
29+
*/
30+
public static final Transcription TRANSCRIPTION = Transcription.AUGMENTED;
31+
32+
/**
33+
* Which transcription type to use.
34+
*/
35+
public static final TranscriptionType TRANSCRIPTION_TYPE = TranscriptionType.CONCORDANCE;
36+
37+
/** Filter to use on pages before analysis */
38+
public static final ElementFilter<IvtffPage> FILTER = null;
39+
40+
/**
41+
* @param args
42+
*/
43+
public static void main(String[] args) {
44+
try {
45+
// Prints configuration parameters
46+
System.out.println("Transcription : " + TRANSCRIPTION);
47+
System.out.println("Transcription Type: " + TRANSCRIPTION_TYPE);
48+
System.out.println("Filter : " + (FILTER == null ? "<no-filter>" : FILTER));
49+
System.out.println();
50+
51+
IvtffText doc = VoynichFactory.getDocument(TRANSCRIPTION, TRANSCRIPTION_TYPE, Alphabet.EVA);
52+
if (FILTER != null)
53+
doc = doc.filterPages(FILTER);
54+
55+
// Replaces "valid" occurrences of c and h
56+
String txt = "." + doc.getPlainText() + ".";
57+
txt = txt.replaceAll("c([tkpf\\?])h", "C$1H");
58+
txt = txt.replaceAll("\\?([tkpf\\?])h", "?$1H");
59+
txt = txt.replaceAll("c([tkpf\\?])\\?", "C$1?");
60+
txt = txt.replaceAll("ch", "CH");
61+
txt = txt.replaceAll("sh", "SH");
62+
txt = txt.replaceAll("c\\?", "C?");
63+
txt = txt.replaceAll("\\?h", "?H");
64+
Counter<String> c = CountRegEx.process(txt, "[^\\.]*[ch]+[^\\.]*");
65+
66+
for (Entry<String, Integer> e : c.reversed()) {
67+
System.out.println(e.getKey() + ";" + e.getValue());
68+
}
69+
} catch (Exception e) {
70+
e.printStackTrace();
71+
} finally {
72+
System.out.println("\nCompleted.");
73+
}
74+
}
75+
76+
}

eclipse/io.github.mzattera.v4j/.classpath

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,12 @@
1818
<attribute name="maven.pomderived" value="true"/>
1919
</attributes>
2020
</classpathentry>
21-
<classpathentry exported="true" kind="con" path="org.eclipse.m2e.MAVEN2_CLASSPATH_CONTAINER">
21+
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-15">
2222
<attributes>
2323
<attribute name="maven.pomderived" value="true"/>
2424
</attributes>
2525
</classpathentry>
26-
<classpathentry kind="con" path="org.eclipse.jdt.launching.JRE_CONTAINER/org.eclipse.jdt.internal.debug.ui.launcher.StandardVMType/JavaSE-15">
26+
<classpathentry exported="true" kind="con" path="org.eclipse.m2e.MAVEN2_CLASSPATH_CONTAINER">
2727
<attributes>
2828
<attribute name="maven.pomderived" value="true"/>
2929
</attributes>

eclipse/io.github.mzattera.v4j/pom.xml

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,17 +12,6 @@
1212
<maven.compiler.source>15</maven.compiler.source>
1313
<maven.compiler.target>15</maven.compiler.target>
1414
</properties>
15-
<profiles>
16-
<profile>
17-
<id>java-8-api</id>
18-
<activation>
19-
<jdk>[9,)</jdk>
20-
</activation>
21-
<properties>
22-
<maven.compiler.release>15</maven.compiler.release>
23-
</properties>
24-
</profile>
25-
</profiles>
2615
<repositories>
2716
<repository>
2817
<id>jitpack.io</id>

eclipse/io.github.mzattera.v4j/src/main/java/io/github/mzattera/v4j/text/alphabet/SlotAlphabet.java

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -276,10 +276,8 @@ public static String fromEva(String txt) throws ParseException {
276276

277277
// TODO write test
278278

279-
// plant intrusion is replaced by a space
279+
// Mark plant intrusion and end of paragraph for later
280280
txt = txt.replace("<->", "-");
281-
282-
// Mark end of paragraph for later
283281
txt = txt.replace("<$>", "$");
284282

285283
// Remove comments as they might interfere with replacement

resources/analysis/slots/Slots.xlsx

-4.79 KB
Binary file not shown.

0 commit comments

Comments
 (0)