Skip to content

Commit 9496b51

Browse files
committed
Note 10
1 parent 6989efe commit 9496b51

File tree

15 files changed

+207
-253
lines changed

15 files changed

+207
-253
lines changed

docs/010/images/SummaryTable.PNG

23 KB
Loading

docs/010/index.md

Lines changed: 58 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Note 010 - Character distribution through the page
22

3-
_Last updated Dec. 19th, 2024._
3+
_Last updated Dec. 28th, 2024._
44

55
_This note refers to [release v.13.0.0](https://github.com/mzattera/v4j/tree/v.13.0.0) of v4j;
66
**links to classes and files refer to this release**; files might have been changed, deleted or moved in the current master branch.
@@ -18,7 +18,7 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
1818

1919
# Abstract
2020

21-
It is known since the very beginning of Voynich studies, that the distribution of character within the page presents some statistical anomalies.
21+
It is known since the very beginning of Voynich studies, that the distribution of character within the pages presents some statistical anomalies.
2222
This note looks into it, using for the first time the
2323
[Slot transcription](https://github.com/mzattera/v4j/blob/master/eclipse/io.github.mzattera.v4j/src/main/resources/Transcriptions/Interlinear_slot_ivtff_1.5.txt).
2424

@@ -45,21 +45,23 @@ The set of experiments is as follows:
4545
* First line in paragraph - first lines of paragraphs are compared with the rest of the text.
4646
* Last line in paragraph - last lines of paragraphs are compared with the rest of the text.
4747
* First letter in a line - initial character of first token in a line is compared with initial characters of all other tokens.
48+
For reasons that will be clearer later, the first line of each paragraph (thus the first token) is ignored.
4849
* Last letter in a line - final character of last token in a line is compared with last characters of all other tokens.
4950

5051
The results are shown in the below table[{1}](#Note1)[{2}](#Note2):
5152

5253
![Summary table of anomalies in char distribution](images/SummaryTable.PNG)
5354

54-
As a test, experiments have been repeated with a shuffled version of the Voynich where the layout (number of tokens in each line) has been preserved but tokens were shuffled around randomly, and the anomalies in distribution disappeared.
55+
As a test, experiments have been repeated with a shuffled version of the Voynich where the layout (number of tokens in each line) has been preserved but tokens were shuffled around randomly,
56+
and the anomalies in distribution disappeared.
5557

5658
# Considerations and Previous Works
5759

5860
## (Pedestalled) Gallows
5961

60-
Before analyzing the above result, I want to discuss the distribution of "gallows" characters ('p', 't', 'k', 'f', 'P', 'T', 'K', 'F').
62+
Before analyzing the above results, I want to discuss the distribution of "gallows" characters ('p', 't', 'k', 'f', 'P', 'T', 'K', 'F').
6163
For the purpose, I have prepared the following set of tables[{3}](#Note3)[{4}](#Note4);
62-
all tables have been prepared using the majority transliteration of the Voynich, using the Slot alphabet [see v4j README](https://github.com/mzattera/v4j#ivtff).
64+
all tables have been prepared using the majority transliteration of the Voynich, using the Slot alphabet ([see v4j README](https://github.com/mzattera/v4j#ivtff)).
6365
The analysis has been done by splitting the text in [clusters](../003) and then considering different parts the text:
6466

6567
* First Word of a Paragraph
@@ -98,7 +100,8 @@ Comparing these three tables and considering the one with characters distributio
98100

99101
6. The "pedestalled gallows" seem to follow the same behavior of their "non-pedestalled" counterparts; thus 'T' behaves like 't' (its
100102
distribution seems more uniform across lines though), 'P' like 'p', 'F' like 'f', and 'K' like 'k',
101-
with the exception that they tend to avoid being token initials (especially 'K').
103+
with the exception that they tend to avoid being token initials (especially 'K');
104+
notice how 'K', 'P', and 'T' appear significantly less as first character in a line.
102105
However, this last part is difficult to confirm, given the small number of these glyphs.
103106
104107
7. From 75% (for Pharmaceutical) to 95% (for Herbal B) of paragraphs begins with "gallows".
@@ -112,8 +115,6 @@ see, among others,[TILTMAN (1967)](../biblio.md)[{5}](#Note5), [CURRIER (1976)](
112115
It is interesting that, even with variations and very few exceptions, the above rules apply to all of the different clusters,
113116
this is somewhat surprising, given that we know that "languages" for each cluster are structurally different (see [Note 009](../009)).
114117

115-
116-
117118

118119
## First Line in a Page
119120

@@ -122,124 +123,76 @@ For the time being, I will assume that the differences between first line of a p
122123
being the sample much smaller for beginning of pages, the trends are just less marked.
123124

124125

125-
126126
## First Line in a Paragraph
127127

128-
** See separate class doen for gallows
129-
** S appears more frequently; nobody noticed so far, probably because of using EVA; do the test using EVA and see if c s h have some anomalies....
130-
131-
tokens starting with non-pedestalled gallows are almost always found as first token in a paragraph.
132-
tokens starting with pedestalled gallows are more rare and distributed more or less evenly, but tend not to appear at the beginning of a line other than first line of paragraphs.
133-
134-
75-95% of first tokens in paragraphs start with a (pedestalled) gallows.
135-
136-
p, f, P, F appear almost exclusively in first line of a paragraph; p and f, when appearing in the first token, are almost always initials.
137-
138-
Other (pedestalled) gallows tend to not appear as token initials.
139-
140-
141-
128+
1. As already discussed above, 'f', 'F', 'p', 'P' tend to appear more frequently in first line of paragraphs, same holds true for 't', except for
129+
the Herbal pages, 'k' and 'K' have the opposite behavior, tending to appear more frequently outside the first line, finally, not much can be said about 'T'.
130+
2. There is also a preference for 'S' to appear in first line of paragraphs.
131+
3. 'e' seems to appear more frequently without repetitions in first line (see low frequencies of 'E' and 'B').
132+
4. 'n' avoids the first line of paragraphs.
133+
5. With the exception of the Pharmaceutical section, 'J' avoids the first line of paragraphs.
134+
6. For the Biological and Stars sections only, 'r' and 'o' seems over-represented in first line, the opposite is true for 's'.
142135

136+
To my knowledge, with the exception of point 1., these are new findings which are due to:
143137

138+
a. Using the Slot alphabet for the analysis rather than EVA.
139+
140+
For example, Slot 'S' is a single character represented in EVA as two characters ('sh'), this makes difficult, if not impossible,
141+
for analysis based on EVA to spot the abundance of 'sh' in first line, as the statistics will be skewed by single occurrences of EVA 's'
142+
or EVA sequences like 'ch' 'cth', etc.
143+
144+
b. Performing a separate analysis for each cluster (for point 6).
144145

145146

146147
## Last Line in a Paragraph
147148

148-
149+
1. 'f', 't' and especially 'q' and 'p' tend to avoid last line of paragraphs.
150+
151+
I found no mention of this before.
149152

150153

151154
## First Letter in a Line
152155

153-
[TILTMAN (1967)](../biblio.md) 'y' occurs quite frequently as the initial symbol of a line followed immediately by a combination of symbols which seem
154-
to be happy without it in any part of a line away from the beginning (d).
155-
156-
[CURRIER (1976)](../biblio.md) "functional entity":
157-
1. "The frequency counts of the beginnings and endings of lines are markedly different from the counts of the same characters internally".
158-
159-
[CURRIER (1976)](../biblio.md)
160-
* The 'ligatures' [ cKh cTh cFh cPh ] can never occur as paragraph initial, and almost never line initial.
161-
162-
[CURRIER (1976)](../biblio.md)
163-
* Skewed frequencies at beginnings of lines may be illustrated by the two letters ch and Sh.
164-
If its occurrence as an initial were random, we would expect it to occur one seventh of the time in each token position of a line.
165-
Actually, it is a very infrequent token initial at the beginning of a line, except when there is an intercalated o. This applies only to 'Language' A.
166-
Other ‘tokens’ occur in this position far more frequently than expected, particularly ‘tokens’ with initial ‘dC,’ ‘qC’ etc.,
167-
which have the appearance of ‘C’-initial ‘tokens’ suitably modified for line-initial use
168-
-> Nobody noticed, maybe because in EVA this is treated as two characters ('sh'), which skews the statistics.
169-
except for Currier who transcripes this as S Z.
170-
->Guarda comunque anche le differenze nelle percentuali
156+
To perform this analysis, the first token of each paragraph has been ignored, as we already know from the analysis above that
157+
that token will most likely start with gallows (thus skewing our analysis).
158+
159+
1. 't' and less markedly 'p', are over-represented at line start; the opposite is true for 'k', confirming our analysis of gallows above.
160+
2. 's', 'y', and 'd' (with exception for Herbal A) are also over-represented at line start.
161+
3. 'C' and, less markedly, 'S' are under-represented at line start.
162+
4. 'a', 'o' (with exception of Herbal A where it shows opposite behavior), and, less markedly, 'r' are under-represented at beginning of a line.
163+
164+
Again, much of this is not new: [CURRIER (1976)](../biblio.md) states that
165+
"The frequency counts of the beginnings and endings of lines are markedly different from the counts of the same characters internally" and he noticed how
166+
'C' and 'S' are under-represented (unless followed by 'o').
167+
168+
[TILTMAN (1967)](../biblio.md) noticed that "'y' occurs quite frequently as the initial symbol of a line followed immediately by a combination of symbols which seem
169+
to be happy without it in any part of a line away from the beginning".
171170

172-
[BOWERN (2020)](../biblio.md)
173-
There is a similar but less robust pattern associated with the beginning of each line. The
171+
[BOWERN (2020)](../biblio.md) mentions that "The
174172
first token is somewhat more likely to begin with s- s. This may be another orthographic
175173
variant, but it appears to only occur with tokens that otherwise begin with o- o or a- a. Thus
176-
aiin aiin, ol ol, and or or are replaced with saiin saiin, sol sol, and sor sor.
177-
178-
174+
aiin aiin, ol ol, and or or are replaced with saiin saiin, sol sol, and sor sor." this is consistent with
175+
points 2. and 4. above.
179176

180177

181178
## Last Letter in a Line
182179

183-
[TILTMAN (1967)](../biblio.md) 'm' appears most commonly at the end of a line, rarely elsewhere (b).
184-
185-
[CURRIER (1976)](../biblio.md) "functional entity",
186-
187-
1. "The frequency counts of the beginnings and endings of lines are markedly different from the counts of the same characters internally".
188-
189-
2. There is, for instance, one symbol that, while it does occur elsewhere, occurs at the
190-
end of the last ‘tokens’ of lines 85% of the time".
191-
192-
[BOWERN (2020)](../biblio.md)
193-
There are also characters which usually appear at the end of the last token of the line,
194-
particularly m. It is plausible that m m and g g are variant forms of the token-final glyphs -iin iin and -y y
195-
However, if this is an orthographic convention, it is not applied in a consistent manner: the forms -iin iin and -y
196-
y are also found line-finally, albeit somewhat less frequently.
197-
198-
[ZANDBERGEN (2021)](../biblio.md)
199-
The third feature is similar to the second, but it is less pronounced, and could be easier to explain. This is
200-
the character m that is a token-final character that predominantly (but again not always) appears at the
201-
ends of lines. In this case, the letter could conceivably be a line final variant form of either r or l , but
202-
there are some issues with that hypothesis.
203-
204-
205-
206-
## Other Patterns
207-
208-
[KNIGHT]
209-
Confirms uneven char distribution but does it for the entire text
210-
It is particularly interesting that lower frequency characters occur more at line-ends,
211-
and higher-frequency ones at the beginnings of lines.
212-
-> DAVVERO!?!?!? INTERESSANTE DA TESTARE vedi io.github.mzattera.v4j.applications.chars.CharByPositionTest
213-
214-
Patrick Feaster CONFERENZA
215-
Rightward and Downward Grapheme Distributions in the Voynich Manuscript.
180+
1. 'm' is over represented at the end of lines.
181+
2. Conversely, 'l' and 'r' are under-represented.
182+
3. For some clusters, 'd', 'o', 'n', and 'y' shows a significant deviation in their distribution.
183+
184+
Point 1. is a well known fact in [TILTMAN (1967)](../biblio.md), [CURRIER (1976)](../biblio.md), [BOWERN (2020)](../biblio.md),
185+
and [ZANDBERGEN (2021)](../biblio.md).
216186

217187
# Conclusions
218188

219189
The distribution of characters across the page presents some anomalies which are statistically significant and are summarized in the table above.
220-
May of these anomalies have been detected by several authors in the past.
221-
222-
However, this is possibly the first time when it is shown that the list of characters presenting anomalies in their distribution, the extent and the direction of these anomalies
223-
differ across different sections of the Voynich. By looking at each cluster separately, I also identified some anomalies which, as far as I know, are new.
224-
225-
We summarize below the main trends, but we invite to refer to the above table for a detailed analysis, case by case.
226-
-> Cluster piu' aprticolare HA
227-
228-
**Little progress has been made since Tillman and currier on char distribution until now**
229-
230-
** Casi piu evidenti q d l o n che si comportano in modo marcatamente opposto in cluster diversi**
190+
Many of these anomalies have been detected by several authors in the past, but some are possibly new:
231191

232-
**Highlight char anomalies which nobody discovered before (e.g., 'a' or 'y' as first char in a line)**
233-
234-
If we look to behaviors that appear consistently across clusters, we can see that:
235-
236-
* 'k' does not appear in first line of pages and in first line of paragraphs (with a slightly less significance for BB cluster).
237-
* 'S' and 'p' appear with high frequency in first line of paragraphs.
238-
* 'y', 't', and 'd' tend to appear as first letter in a line; with the exception of cluster HA where 'd' has the opposite behavior.
239-
'C', 'S', 'o', and 'a' hardly do; with the exception of cluster HA again where 'o' appears with high frequency.
240-
* 'l' and 'r' tend not to appear as terminal letter of last token in a line.
192+
1. 'k' and 'K' behaving differently then other gallows.
193+
2. 'f', 't' and especially 'q' and 'p' tend to avoid last line of paragraphs.
241194

242-
Is Currier's lien as a functional entity valid?
195+
In addition, worth mentioning as some characters behave differently in different clusters.
243196

244197

245198
---
@@ -274,6 +227,10 @@ On this point, please see [Note 005](../005) where I show, given the slot struct
274227
Still, I think there is good evidence that the initial gallows in paragraphs might be an addition to the actual token. If this is done for aesthetic reasons or is part of the encoding scheme
275228
(as Grove suggests) I cannot tell.
276229

230+
<a id="Note8">**{9}**</a>John Grove seems to be the first person to notice that "First Gallows on a page can normally be detached from the first word to form a relatively normal VMS word",
231+
suggesting these characters might be additions to the token (see also [this message](http://voynich.net/Arch/2004/09/msg00442.html) from Stolfi, which picks up on this).
232+
233+
277234
---
278235

279236
[**<< Home**](..)

docs/index.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ In other words, a token is an instance of a word type. For example; the below li
7272

7373
This should be considered when applying statistical analysis methods to the manuscript.
7474

75-
- [Note 004 - On Word types](./004)
75+
- [Note 004 - On Word Types](./004)
7676

7777
List of most common Voynichese word types and how they are split across different clusters.
7878

@@ -97,6 +97,10 @@ In other words, a token is an instance of a word type. For example; the below li
9797
I used insights provided by the above grammar to show structural differences in words appearing in different sections of the Voynich.
9898
This suggests: 1) that Currier's languages can be more than 2 and 2) clustering might not be showing a difference in topics.
9999

100+
- [Note 010 - Character Distribution Through Clusters](./010)
101+
102+
I used the Slot alphabet to explore character distribution across clusters in different part of pages.
103+
100104

101105
# Bibliography and Reviews
102106

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/FindVowels.java

Lines changed: 4 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -28,44 +28,12 @@
2828
*/
2929
public class FindVowels {
3030

31-
/**
32-
* Which transcription to use.
33-
*/
34-
public static final Transcription TRANSCRIPTION = Transcription.AUGMENTED;
35-
36-
/**
37-
* Which transcription type to use.
38-
*/
39-
public static final TranscriptionType TRANSCRIPTION_TYPE = TranscriptionType.MAJORITY;
40-
41-
/**
42-
* Which Alphabet type to use.
43-
*/
44-
public static final Alphabet ALPHABET = Alphabet.SLOT;
45-
46-
/** Filter to use on pages before analysis */
47-
public static final ElementFilter<IvtffPage> FILTER = new PageFilter.Builder().cluster("BB").build();
48-
4931
/**
5032
* @param args the command line arguments
5133
*/
5234
public static void main(String[] args) {
5335
try {
54-
55-
// Get the document to process
56-
// Prints configuration parameters
57-
System.out.println("Transcription : " + TRANSCRIPTION);
58-
System.out.println("Transcription Type: " + TRANSCRIPTION_TYPE);
59-
System.out.println("Alphabet : " + ALPHABET);
60-
System.out.println("Filter : " + (FILTER == null ? "<no-filter>" : FILTER));
61-
System.out.println();
62-
63-
IvtffText doc = VoynichFactory.getDocument(TRANSCRIPTION, TRANSCRIPTION_TYPE, ALPHABET);
64-
if (FILTER != null)
65-
doc = doc.filterPages(FILTER);
66-
67-
process(BibleFactory.getDocument("latin"));
68-
36+
process(BibleFactory.getDocument("latin"),true);
6937
} catch (Exception e) {
7038
e.printStackTrace(System.err);
7139
System.exit(-1);
@@ -75,10 +43,10 @@ public static void main(String[] args) {
7543
/**
7644
* Searches given regular expression
7745
*/
78-
public static void process(Text doc) {
46+
public static void process(Text doc, boolean toUpperCase) {
7947

8048
Alphabet a = doc.getAlphabet();
81-
String txt = a.toUpperCase(doc.getPlainText());
49+
String txt = toUpperCase ? a.toUpperCase(doc.getPlainText()) : doc.getPlainText();
8250
Counter<Character> charCount = StringUtil.countChars(txt);
8351

8452
// Maps a char to an index 1..N and vice versa
@@ -99,7 +67,7 @@ public static void process(Text doc) {
9967
char prev = txt.charAt(0);
10068
for (int i = 1; i < txt.length(); ++i) {
10169
char curr = txt.charAt(i);
102-
if (a.isRegular(prev) && a.isRegular(curr)) {
70+
if (a.isRegular(prev) && a.isRegular(curr) && !a.isWordSeparator(prev) && !a.isWordSeparator(curr)) {
10371
preceeds[index.get(prev)][index.get(curr)]++;
10472
prev = curr;
10573
}

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/chars/CharDistributionAnalysis.java

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ public static void main(String[] args) {
9393
System.out.print("\n\n[ Last line in paragraph ];\n");
9494
process(voynich, new Experiment.LastLineInParagraph());
9595
System.out.print("\n\n[ First letter in a line ];\n");
96-
process(voynich, new Experiment.Initials(new Experiment.FirstWordInLine(false, false), true));
96+
process(voynich, new Experiment.Initials(new Experiment.FirstWordInLine(true, false), true));
9797
System.out.print("\n\n[ Last letter in a line ];\n");
9898
process(voynich, new Experiment.Finals(new Experiment.LastWordInLine(false, false), true));
9999

0 commit comments

Comments
 (0)