Skip to content

Commit dfef6d6

Browse files
committed
"Term" replaced by "word type"
1 parent 90893ba commit dfef6d6

File tree

9 files changed

+95
-95
lines changed

9 files changed

+95
-95
lines changed

docs/003/index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
1717

1818
# Abstract
1919

20-
This note discuss the application of k-means clustering algorithms to Voynich pages, showing how the terms in the page
20+
This note discuss the application of k-means clustering algorithms to Voynich pages, showing how the word types in the page
2121
strongly correlate with the page illustration type (Herbal, Biological, Pharmaceutical, etc.) and Currier's language (A or B).
2222

2323
# Previous Works
@@ -38,9 +38,9 @@ I use the EVA alphabet, but it is not relevant for this discussion, as I look at
3838
## Embedding and Distance Measure
3939

4040
The text is split into units for analysis, that could be single pages or bigger portions of text (e.g. parchments / bi-folios).
41-
Each unit is embedded as a bag of words where the dimensions are the "readable" terms in the Voynich (that is, Voynich "words" with no
41+
Each unit is embedded as a bag of words where the dimensions are the "readable" word types in the Voynich (that is, Voynich "words" with no
4242
"unreadable" characters [{1}](#Note1))
43-
and the value for the dimension is the number of times corresponding term appears in the text unit.
43+
and the value for the dimension is the number of times corresponding word type appears in the text unit.
4444

4545
Similarity between textual units is computed as positive angular distance of corresponding embedding; this returns angular distance
4646
between two vectors assumed to have only positive components.
@@ -233,7 +233,7 @@ Currier's languages reflect language differences in the underlying plain text.
233233
However, it can be that these similarities reflect a different technique (or variations of the same technique) used to create
234234
the parchments. This technique could be either a proper cypher or a way to produce "random" text.
235235

236-
- As the above grouping reflects a similar distribution of terms in the text, no matter what was the cause,
236+
- As the above grouping reflects a similar distribution of word types in the text, no matter what was the cause,
237237
these differences should be kept in mind when performing statistical analysis of the text or when trying to decipher it.
238238

239239
For this reason, v4j library provides means to classify pages accordingly to above considerations, the resulting clusters are shown below[{5}](#Note5)

docs/004/index.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Note 004 - On Terms
1+
## Note 004 - On Word types
22

33
_Last updated Sep. 18th, 2021._
44

@@ -17,29 +17,29 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
1717

1818
The class
1919
['MostUsedTerms'](https://github.com/mzattera/v4j/blob/v.4.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mattera/v4j/applications/MostUsedTerms.java)
20-
finds top 20 most used terms for each cluster defined in [Note 003](../003) and prints out the result in .CSV format.
20+
finds top 20 most used word types for each cluster defined in [Note 003](../003) and prints out the result in .CSV format.
2121

2222
An Excel file ("`MostUsedTerms.xlsx`") containing this data can be found under the
2323
[analysis folder](https://github.com/mzattera/v4j/tree/master/resources/analysis).
2424

25-
The below table summarizes the results, showing, the relative frequency of terms in each cluster.
25+
The below table summarizes the results, showing, the relative frequency of word types in each cluster.
2626

27-
![Most used terms](images/Terms.PNG)
27+
![Most used word types](images/Word types.PNG)
2828

29-
As expected from cluster analysis, beside terms that appear frequently in all clusters (such as 'chey', 'daiin', 'dar', 'dy', and 'or'),
30-
there are terms characteristic of a single cluster; the table below shows them.
29+
As expected from cluster analysis, beside word types that appear frequently in all clusters (such as 'chey', 'daiin', 'dar', 'dy', and 'or'),
30+
there are word types characteristic of a single cluster; the table below shows them.
3131

32-
![Most used terms](images/Unique.PNG)
32+
![Most used word types](images/Unique.PNG)
3333

3434
It might be interesting to note that:
3535

36-
- Most common terms in Herbal A pages (HA cluster) start with 'ch-' or 'sh-'; the latter prefix appearing only here,
36+
- Most common word types in Herbal A pages (HA cluster) start with 'ch-' or 'sh-'; the latter prefix appearing only here,
3737

38-
- Pharmaceutical (PA cluster) common terms end in '-ol', which is rare for other clusters. In addition, they seem to prefer the 'ok-' or 'qok-' prefix.
38+
- Pharmaceutical (PA cluster) common word types end in '-ol', which is rare for other clusters. In addition, they seem to prefer the 'ok-' or 'qok-' prefix.
3939

40-
- Herbal B pages (HB cluster) prefer terms starting with 'qo-' and 'qok-'.
40+
- Herbal B pages (HB cluster) prefer word types starting with 'qo-' and 'qok-'.
4141

42-
- Zodiac (ZZ) common terms mostly start with 'ot-', this is uncommon for clusters above. Moreover, these pages feature single characters as common terms.
42+
- Zodiac (ZZ) common word types mostly start with 'ot-', this is uncommon for clusters above. Moreover, these pages feature single characters as common word types.
4343

4444

4545
---

docs/005/images/Summary Pie.PNG

302 Bytes
Loading

docs/005/images/Summary.PNG

-12.5 KB
Loading

docs/005/index.md

Lines changed: 33 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
1818

1919
## Abstract
2020

21-
I show how the structure of Voynich words can be easily described by assuming each term is composed by "slots" that can be filled
21+
I show how the structure of Voynich words can be easily described by assuming each word type is composed by "slots" that can be filled
2222
accordingly to simple rules, which are described below.
2323

2424
This in turn sheds some lights on the definition of what might constitute a Voynich character (the Voynich alphabet).
@@ -32,27 +32,27 @@ exists in any modern text as well. However, I will try to focus on claims that a
3232

3333
I start my analysis from a concordance version of the Voynich text (see [Note 001](../001)); this is obtained from the
3434
Landini-Stolfi Interlinear file by merging available interlinear transcriptions for each transcriber. In the merging, characters that are not
35-
read by all authors in the same way are marked as unreadable. This to ensure the terms I will extract from the text are the most accurate.
35+
read by all authors in the same way are marked as unreadable. This to ensure the word types I will extract from the text are the most accurate.
3636

3737
For reasons explained below, any occurrence of the following characters is also marked with an unreadable character:
3838

3939
- 'g', 'x', 'v', 'u', 'j', 'b', 'z' (47 occurrences in total, 13 of them are single-letter words).
4040

41-
- 'c' and 'h', when they do not appear in combinations such as 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh'; this sums up to 11 terms.
41+
- 'c' and 'h', when they do not appear in combinations such as 'ch', 'sh', 'cth', 'ckh', 'cph', 'cfh'; this sums up to 11 word types.
4242

4343
As a second step, **tokens** are created by splitting the text where a space was detected by at least one of the transcribers; there are 31'317 tokens in the text,
4444
ignoring those that contain an unreadable character.
4545

46-
The list of **terms** is the list of tokens without repetitions (this would be the "vocabulary" of the Voynich).
47-
These 5'105 total terms have then been analyzed as explained below.
46+
The list of **word types** is the list of tokens without repetitions (this would be the "vocabulary" of the Voynich).
47+
These 5'105 total word types have then been analyzed as explained below.
4848

4949

5050
## Considerations
5151

52-
By looking at the terms in the Voynich, we can see their structure (that is, the sequence of Voynich glyphs used to write them) can be easily described
52+
By looking at the word types in the Voynich, we can see their structure (that is, the sequence of Voynich glyphs used to write them) can be easily described
5353
as follows:
5454

55-
- each term can be considered as composed by 12 "slots"; for convenience I will number them from 0 to 11.
55+
- each word type can be considered as composed by 12 "slots"; for convenience I will number them from 0 to 11.
5656

5757
- each slot can be empty or contain a single glyph.
5858

@@ -65,12 +65,12 @@ The below table summarizes all of these rules, showing the 12 slots and the glyp
6565
![Slots](images/Slots Table.PNG)
6666

6767
In some cases, the word structure can be ambiguous, since a glyph can occupy any of 2 available slots
68-
(e.g. the term 'y' can be seen as a 'y' either in slot 1 or slot 11); following some further
68+
(e.g. the word type 'y' can be seen as a 'y' either in slot 1 or slot 11); following some further
6969
[analysis on word structure](../007), when decomposing a word, I always put each glyph in the rightmost possible position.
7070
Notice this is a "weak" rule that is quite arbitrary and has no impact on which
71-
terms can or cannot be described by this model.
71+
word types can or cannot be described by this model.
7272

73-
To exemplify this concept, I show how some common terms can be decomposed in slots;
73+
To exemplify this concept, I show how some common word types can be decomposed in slots;
7474

7575
```
7676
'daiin'
@@ -97,34 +97,34 @@ To exemplify this concept, I show how some common terms can be decomposed in slo
9797

9898
We can then see [{2}](#Note2) that tokens can be classified as follows:
9999

100-
- 27'114 tokens (86.6% of total), corresponding to 2'617 different terms (51.3% of total), can be decomposed in slots accordingly to the above rules. I will call these tokens "**regular**".
100+
- 27'114 tokens (86.6% of total), corresponding to 2'617 different word types (51.3% of total), can be decomposed in slots accordingly to the above rules. I will call these tokens "**regular**".
101101

102-
- 3'249 tokens (10.4% of total), corresponding to 1'892 different terms (37.1% of total), can be divided in two parts, each composed by at least two Voynich glyphs,
103-
where each of these parts is a regular term. I will call these tokens "**separable**".
102+
- 3'249 tokens (10.4% of total), corresponding to 1'892 different word types (37.1% of total), can be divided in two parts, each composed by at least two Voynich glyphs,
103+
where each of these parts is a regular word type. I will call these tokens "**separable**".
104104

105-
Moreover, we can see that for 2'219 separable terms (75.2% of total separable terms) their constituent parts appear as tokens in the text at least as often as the whole
106-
separable term. For example, the term 'chockhy' appears 18 times in the text; it is a separable term that can be divided in two parts, each one being a regular term, as
107-
'cho' - 'ckhy' which appears in the text 79 and 39 times respectively. I think this is an indication that many separable terms are possibly just two regular words that were written together
105+
Moreover, we can see that for 2'219 separable word types (75.2% of total separable word types) their constituent parts appear as tokens in the text at least as often as the whole
106+
separable word type. For example, the word type 'chockhy' appears 18 times in the text; it is a separable word type that can be divided in two parts, each one being a regular word type, as
107+
'cho' - 'ckhy' which appears in the text 79 and 39 times respectively. I think this is an indication that many separable word types are possibly just two regular words that were written together
108108
(or the space between them was not transcribed correctly).
109-
When I need to distinguish these terms from other separable terms, I will call them "**verified separable**" or simply "**verified**".
109+
When I need to distinguish these word types from other separable word types, I will call them "**verified separable**" or simply "**verified**".
110110

111-
- Remaining 954 tokens (3.0% of total), corresponding to 596 different terms (11.7% of total), are marked as "**unstructured**".
111+
- Remaining 954 tokens (3.0% of total), corresponding to 596 different word types (11.7% of total), are marked as "**unstructured**".
112112

113-
Notice that 489 out of these 596 terms, or 82%, appear only once in the text; this percentage is 59.8% for regular and separable terms considered together.
113+
Notice that 489 out of these 596 word types, or 82%, appear only once in the text; this percentage is 59.8% for regular and separable word types considered together.
114114
This might suggest that unstructured words are either typos or special words that are encoded differently than other words.
115115

116-
- Sometime I contrast regular and separable terms to unstructured ones by calling the former "**structured**".
116+
- Sometime I contrast regular and separable word types to unstructured ones by calling the former "**structured**".
117117

118118
The below tables summarize these findings.
119119

120120
![Table with distribution of words accordingly to their classification.](images/Summary.PNG)
121121

122122
![Pie chart with distribution of words accordingly to their classification.](images/Summary Pie.PNG)
123123

124-
In short, almost 9 out of 10 tokens in the Voynich text exhibit a "slot" structure. Of the remaining, a fair amount can be decomposed in two parts each corresponding to regular terms
124+
In short, almost 9 out of 10 tokens in the Voynich text exhibit a "slot" structure. Of the remaining, a fair amount can be decomposed in two parts each corresponding to regular word types
125125
appearing elsewhere in the text. The remaining cases (3 out of 100) are mostly words appearing only once in the text.
126126

127-
The below table shows percentage occurrence of glyphs in slots for regular terms [{3}](#Note3).
127+
The below table shows percentage occurrence of glyphs in slots for regular word types [{3}](#Note3).
128128

129129
<a id="GliphCountImg" />
130130
![Table with glyph count by slot.](images/Char Count by Slot.PNG)
@@ -139,9 +139,9 @@ The definition of the Voynich alphabet, that is of which glyphs should be consid
139139
Each transcriber must continuously decide what symbols in the manuscript constitute instances of the same glyph and how each glyph needs to be mapped into
140140
one or more transliteration characters.
141141

142-
However, if we consider the above defined slots as relevant for the structure of terms, we can reasonably assume that each glyph appearing in a slot constitutes
142+
However, if we consider the above defined slots as relevant for the structure of word types, we can reasonably assume that each glyph appearing in a slot constitutes
143143
a basic unit of information, that is a character in the Voynich alphabet.
144-
As far as I know, this is the first time that a possible Voynich alphabet is supported by empirical evidence of an inner structure of Voynich terms.
144+
As far as I know, this is the first time that a possible Voynich alphabet is supported by empirical evidence of an inner structure of Voynich word types.
145145

146146
Below, I analyze more in detail some relationships between glyphs, as they appear in slots, and EVA characters.
147147

@@ -172,7 +172,7 @@ that is a more compact from of writing a combination of the pedestal and a gallo
172172
If we look at slots 3 through 5, we might think that pedestalled gallows can be indeed a combination of a gallows character followed by the pedestal, in this specific order.
173173
However,
174174

175-
- The combination of gallows in slot 3 followed by a pedestal in slot 4 is quite common in the text. 2'185 tokens, or 419 regular terms, that is 16% of regular terms,
175+
- The combination of gallows in slot 3 followed by a pedestal in slot 4 is quite common in the text. 2'185 tokens, or 419 regular word types, that is 16% of regular word types,
176176
and written explicitly as two glyphs.
177177

178178
- In 332 tokens, we have a pedestal followed by a pedestalled gallows. This would correspond to a double pedestal is in a word (or a separable word), which contrasts with the
@@ -206,7 +206,7 @@ Based on the above, I assume each sequence of 'e' and 'i' is probably a characte
206206

207207
Finally, drawing from the above considerations, I propose a new transliteration alphabet, which I will call the **Slot alphabet** for obvious reasons.
208208

209-
I think that, being based on the inner structure of Voynich terms, this alphabet is more suitable than others when performing statistical analysis that relies on characters in words or when attempting
209+
I think that, being based on the inner structure of Voynich word types, this alphabet is more suitable than others when performing statistical analysis that relies on characters in words or when attempting
210210
to decipher the Voynich, where a one-to-one correspondence between the transliteration characters and the Voynich characters is paramount.
211211

212212
In addition, the alphabet can be easily converted into EVA, and vice-versa, therefore being used interchangeably.
@@ -234,20 +234,20 @@ I created a [separate page](../006).
234234

235235
## Conclusions
236236

237-
- Majority of words in the Voynich exhibits an inner structure described here, where terms can be represented as composed by 12 "slots" that can be left empty or
237+
- Majority of words in the Voynich exhibits an inner structure described here, where word types can be represented as composed by 12 "slots" that can be left empty or
238238
populated by a single glyph chosen among a very limited group of glyphs (usually 2-3).
239239

240-
- 86.6% of tokens (51.3% of terms) exhibit this structure (**regular** terms).
240+
- 86.6% of tokens (51.3% of word types) exhibit this structure (**regular** word types).
241241

242-
- 10.4% of tokens (37.1% of terms) can be divided in two parts, each presenting the inner structure described above (**separable** terms).
242+
- 10.4% of tokens (37.1% of word types) can be divided in two parts, each presenting the inner structure described above (**separable** word types).
243243

244-
For 68.3% of separable terms, their two constituents appear in the text more often than the separable term itself (**verified separable** terms).
244+
For 68.3% of separable word types, their two constituents appear in the text more often than the separable word type itself (**verified separable** word types).
245245

246-
This seems a strong indication that separable terms are made by two regular terms written or transcribed together.
246+
This seems a strong indication that separable word types are made by two regular word types written or transcribed together.
247247

248-
- only 3.0% of tokens (11.7% of terms) do not exhibit this structure (**unstructured** terms).
248+
- only 3.0% of tokens (11.7% of word types) do not exhibit this structure (**unstructured** word types).
249249

250-
82% of unstructured terms appears only once in the text. In other words, **only 1.5% of tokens (2.1% of terms) are unstructured terms appearing at least twice in the text**.
250+
82% of unstructured word types appears only once in the text. In other words, **only 1.5% of tokens (2.1% of word types) are unstructured word types appearing at least twice in the text**.
251251

252252
I argue that these can be typos or plain text words encoded in a different way than the majority of the text (e.g. because they represent proper names or uncommon words).
253253

0 commit comments

Comments
 (0)