Skip to content

Commit 9c26195

Browse files
committed
Now IvtffText reads file header correctly.
1 parent 2584cc0 commit 9c26195

File tree

11 files changed

+76
-48
lines changed

11 files changed

+76
-48
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ The folder `eclipse` contains an eclipse workspace. The (Maven) project `io.gith
88
The library content is described below. The (Maven) project `io.github.mattera.v4j-apps` contains classes I created to experiment with the
99
Voynich manuscript; here you can find examples about how to use the library.
1010

11-
**_Note:_** _Plase check the [project pages](https://mzattera.github.io/v4j/) for some terminology that is relevant here.
11+
**_Note:_** Plase check the [project pages](https://mzattera.github.io/v4j/) for some terminology that is relevant here.
1212

1313
## Packages and Library Overview - Project `io.github.mattera.v4j`
1414

@@ -226,7 +226,7 @@ Please take a look what is in here before implementing anythign from scratch.
226226

227227
### Testing
228228

229-
Project `io.github.mattera.v4j-apps` contain JUnit tests for the v4j library and (some) of the "applications" in `v4j-apps`.
229+
Project `io.github.mattera.v4j-apps` contains JUnit tests for the v4j library and (some) of the "applications" in `v4j-apps`.
230230

231231

232232

docs/003/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ Currier's languages reflect language differences in the underlying "clear" text.
234234
- As the above grouping reflects a similar distribution of terms in the text, no matter what was the cause,
235235
these differences should be kept in mind when performing statistical analysis of the text or when trying it decipherment.
236236

237-
For this reason v4j library provides means to classify pages accordingly to above considerations, the resulting clusters are shown below
237+
For this reason, v4j library provides means to classify pages accordingly to above considerations, the resulting clusters are shown below
238238
(also see [`PageHeader`](https://github.com/mzattera/v4j/blob/v.3.0.0/eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/ivtff/PageHeader.java) class).
239239

240240
![Cluster size in words](images/Clusters.PNG)

docs/004/index.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Note 004 - On Terms
22

3-
_Last updated Sep. 9th, 2021._
3+
_Last updated Sep. 18th, 2021._
44

55
_This note refers to [release v.4.0.0](https://github.com/mzattera/v4j/tree/v.4.0.0) of v4j;
66
**links to classes and files refer to this release** and files might have been changed, deleted or moved in the current master branch.
@@ -26,22 +26,23 @@ The below table summarizes the results, showing, the relative frequency of terms
2626

2727
![Most used terms](images/Terms.PNG)
2828

29-
As expected from cluster analysis, beside terms that appear frequently in all clusters (such as **chey**, **daiin**, **dar**, and **dy**),
29+
As expected from cluster analysis, beside terms that appear frequently in all clusters (such as 'chey', 'daiin', 'dar', 'dy', and 'or'),
3030
there are terms characteristic of a single cluster; the table below shows them.
3131

3232
![Most used terms](images/Unique.PNG)
3333

3434
It might be interesting to note that:
3535

36-
- Most common terms in Herbal A pages (HA cluster) start with ch- or sh-; the latter prefix appearing only here,
36+
- Most common terms in Herbal A pages (HA cluster) start with 'ch-' or 'sh-'; the latter prefix appearing only here,
3737

38-
- Pharmaceutical (PA cluster) common terms end in -ol, which is rare for other clusters. In addition, they seem to prefer the ok- or qok- prefix.
38+
- Pharmaceutical (PA cluster) common terms end in '-ol', which is rare for other clusters. In addition, they seem to prefer the 'ok-' or 'qok-' prefix.
3939

40-
- Herbal B pages (HB cluster) prefer terms starting with qo(k)-.
40+
- Herbal B pages (HB cluster) prefer terms starting with 'qo-' and 'qok-'.
4141

42-
- Zodiac (ZZ cluster) common terms mostly start with ot-, this is uncommon for other clusters. Moreover, this cluster
42+
- Zodiac (ZZ cluster) common terms mostly start with 'ot-', this is uncommon for other clusters. Moreover, this cluster
4343
features single characters as common terms.
4444

45+
4546
---
4647

4748
[**<< Home**](..)

docs/005/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Note 005 - Slots and a New Alphabet
22

3-
_Last updated Sep. 12th, 2021._
3+
_Last updated Sep. 18th, 2021._
44

55
_This note refers to [release v.5.0.0](https://github.com/mzattera/v4j/tree/v.5.0.0) of v4j;
66
**links to classes and files refer to this release**; files might have been changed, deleted or moved in the current master branch.
@@ -21,15 +21,17 @@ _Please refer to the [home page](..) for a set of definitions that might be rele
2121
I show how the structure of Voynich words can be easily described by assuming each term is composed by "slots" that can be filled
2222
accordingly to simple rules, which are described below.
2323

24-
This in turn sheds some lights on the definition of what constitute a Voynich character (the Voynich alphabet).
24+
This in turn sheds some lights on the definition of what might constitute a Voynich character (the Voynich alphabet).
2525

2626
Given the nature of this topic, it is impossible to define rules that apply to 100% of cases; after all, syntactical and grammatical exceptions
2727
exists in any modern text as well. However, I will try to focus on claims that apply to the vast majority of cases.
2828

2929

3030
## Previous Works
3131

32-
**TODO** _add the core/mantel/crust and the state machine works_.
32+
Either here or at the end as "Comparison with other works".
33+
34+
**TODO** https://briancham1994.com/2014/12/17/curve-line-system/.
3335

3436
- This approach is easier to explain and has more implications.
3537

@@ -183,8 +185,8 @@ Again, this seems a strong indication that EVA 'h' does not correspond to a Voyn
183185

184186
#### 'e' and 'i'
185187

186-
The characters 'e' and 'i' only appear in slots 6 and 9 respectively, in a sequence of 1, 2 or 3. Currier has assumed these sequences of same characters are
187-
single Voynich characters. Based upon how they appear in the slots, I think this is a reasonable assumption.
188+
The characters 'e' and 'i' only appear in slots 6 and 9 respectively, in a sequence of 1, 2 or 3. Currier has assumed each sequences of same characters is
189+
a single Voynich character. Based upon how they appear in the slots, I think this is a reasonable assumption.
188190

189191

190192
## The Slot Alphabet

docs/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@ in order to provide some (hopefully) more solid basis for discussions and deciph
1313
On this site, I will try to be consistent with following terminology.
1414

1515
- A **transliteration** is a symbol-by-symbol conversion of one script into another. Transliteration is needed to represent the content of the Voynich in a
16-
format that can be printed or stored in computer files. I might sometime use the less correct term **transliteration** as a synonym and refer
16+
format that can be printed or stored in computer files. I might sometimes use the less correct term **transliteration** as a synonym and refer
1717
to an author of a transliteration as a **transcriber**.
1818

19-
- I refer to the list of symbols used in the target script as the **transliteration alphabet** or simply as the **alphabet**.
19+
- I refer to the list of symbols used in the target script as the **transliteration alphabet** or simply as the **alphabet**.
2020
Each symbol in the alphabet is referred as a **transliteration character** or simply **character**.
2121

22-
- The term **glyph** refers to a symbol in the Voynich that appears to constitute a single unit of text. In principle, a glyph could represent one or more
22+
- The term **glyph** refers to a symbol in the Voynich that appears to constitute a basic unit of text. In principle, a glyph could represent one or more
2323
**Voynich characters** that constitute the **Voynich alphabet**.
2424

2525
The question of which glyphs are actual single Voynich characters is still very open and it is at the basis of the different transliteration alphabets being created.

eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/alphabet/Alphabet.java

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,9 @@
77
package io.github.mattera.v4j.text.alphabet;
88

99
import java.util.ArrayList;
10+
import java.util.HashMap;
1011
import java.util.List;
12+
import java.util.Map;
1113

1214
/**
1315
* Defines an alphabet of symbols used in a text.
@@ -25,8 +27,26 @@ public abstract class Alphabet {
2527
/** The Slot alphabet (see https://mzattera.github.io/v4j/005/) */
2628
public final static SlotAlphabet SLOT = new SlotAlphabet();
2729

30+
// All available alphabets, by code
31+
private static final Map<String, Alphabet> ALPHABETS = new HashMap<>();
32+
static {
33+
ALPHABETS.put(EVA.getCodeString(), EVA);
34+
ALPHABETS.put(UTF_16.getCodeString(), UTF_16);
35+
ALPHABETS.put(SLOT.getCodeString(), SLOT);
36+
}
37+
38+
/**
39+
* @param codeString a string code for this alphabet, same used in IVTFF file
40+
* header for the alphabet.
41+
* @return The Alphabet with given code, or null if it cannot be found.
42+
*/
43+
public static Alphabet getAlphabet(String codeString) {
44+
return ALPHABETS.getOrDefault(codeString, null);
45+
}
46+
2847
/**
29-
* @return a string code for this alphabet, same as that used in the IVTFF file.
48+
* @return a string code for this alphabet, same used in IVTFF file header for
49+
* the alphabet.
3050
*/
3151
public abstract String getCodeString();
3252

eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/ivtff/IvtffText.java

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -42,23 +42,18 @@ public String getId() {
4242
return id;
4343
}
4444

45-
// Transcript version: A.B.x
46-
// TODO this is not stored
4745
private String version;
4846

4947
/**
50-
*
5148
* @return Transcript version: A.B.x
5249
*/
5350
public String getVersion() {
5451
return version;
5552
}
5653

57-
// Major version: A.B
5854
private String majorVersion;
5955

6056
/**
61-
*
6257
* @return Transcript major version: A.B
6358
*/
6459
public String getMajorVersion() {
@@ -167,7 +162,7 @@ public static IvtffText fromPages(IvtffText doc, Collection<IvtffPage> pages) {
167162
* file. AND IS UNSUPPORTED
168163
*/
169164
public final static Pattern FILE_HEADER_PATTERN = Pattern.compile("#=IVTFF (.{4}) ([0-9]+\\.[0-9]+)(\\.[0-9]+)?");
170-
165+
171166
/**
172167
* The notation used to identify a page in the Voynich MS is the character f
173168
* (for folio) followed by the folio number, followed by r (for recto - the
@@ -178,10 +173,9 @@ public static IvtffText fromPages(IvtffText doc, Collection<IvtffPage> pages) {
178173
public final static Pattern PAGE_HEADER_PATTERN = Pattern.compile("<f[0-9]{1,3}[rv][0-9]?>|<fRos>");
179174

180175
/**
181-
* Constructor from Reader.
182-
* Assumes EVA alphabet.
176+
* Constructor from Reader. Assumes EVA alphabet.
183177
*
184-
* @param in A Reader from which the document content is read.
178+
* @param in A Reader from which the document content is read.
185179
*/
186180
private IvtffText(BufferedReader in) throws IOException, ParseException {
187181

@@ -193,16 +187,17 @@ private IvtffText(BufferedReader in) throws IOException, ParseException {
193187
if (row == null)
194188
throw new ParseException("The input file is empty.");
195189

196-
// TODO add proper descriptor that includes header information.
197-
// Also check and store IVTFF version of the file.
198190
Matcher m = FILE_HEADER_PATTERN.matcher(row);
199191
if (!m.matches())
200192
throw new ParseException("Invalid file header: ", row);
201-
if (m.group(1).equals("Eva-")) {
202-
this.alphabet = Alphabet.EVA;
203-
} else {
204-
throw new ParseException("Unsupported alphabeth: " + m.group(1));
205-
}
193+
194+
this.alphabet = Alphabet.getAlphabet(m.group(1));
195+
if (this.alphabet == null)
196+
new ParseException("Unsupported alphabeth: " + m.group(1));
197+
this.majorVersion = m.group(2);
198+
if (!this.majorVersion.equals("1.5"))
199+
new ParseException("Unsupported IVTFF format version: " + this.majorVersion);
200+
this.version = m.group(2)+m.group(3);
206201

207202
IvtffPage currentPage = null;
208203

@@ -304,6 +299,7 @@ public void write(File fOut) throws IOException {
304299
out.newLine();
305300
out.write("# Latest modified on: " + f.format(new Date()));
306301
out.newLine();
302+
out.write("#");
307303
out.newLine();
308304

309305
for (IvtffPage page : elements) {

eclipse/io.github.mattera.v4j/src/main/resources/Transcriptions/Interlinear_ivtff_1.5.txt

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
1-
#=IVTFF Eva- 1.5.2018092200
1+
#=IVTFF Eva- 1.5
2+
# Created automatically by io.github.mattera.v4j.applications.text.BuildConcordanceVersion on: 20210919.1526
3+
#
24
#
3-
# Minor changes made by Maxi to align trascription of a few lines, so they can be merged.
4-
# Last edited on 2018-09-22 17:13 by Maxi
5+
# This is based on Landini-Stolfi Interlinear file from http://www.voynich.nu/data/beta/LSI_ivtff_0d.txt
6+
# Minor changes made by Massimiliano Zattera to align transcription of a few lines, so they can be merged automatically.
7+
# Last edited on 2021-09-21 14:50 by Massimiliano Zattera.
58
#
69
# # <f0.A> {}
710
# Last edited on 1998-12-05 11:22:03 by stolfi

eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mattera/v4j/applications/text/BuildConcordanceVersion.java

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,10 @@
88
import java.io.IOException;
99
import java.io.InputStreamReader;
1010
import java.io.OutputStreamWriter;
11+
import java.text.DateFormat;
12+
import java.text.SimpleDateFormat;
1113
import java.util.ArrayList;
14+
import java.util.Date;
1215
import java.util.List;
1316
import java.util.regex.Matcher;
1417

@@ -90,29 +93,31 @@ static void doWork(BufferedReader in, BufferedWriter out) throws IOException, Pa
9093
String fLine = null;
9194
IvtffLine line = null;
9295
int lnum = 0;
93-
Alphabet a = null;
9496

9597
// Parse file header
9698
fLine = in.readLine();
9799
if (fLine == null)
98100
throw new ParseException("Empty input.");
99101
++lnum;
100102

101-
// TODO add proper descriptor that includes header information.
102-
// TODO add support for other alphabets
103103
Matcher m = IvtffText.FILE_HEADER_PATTERN.matcher(fLine);
104104
if (!m.matches())
105105
throw new ParseException("Invalid file header: ", fLine);
106-
if (m.group(1).equals("Eva-")) {
107-
a = Alphabet.EVA;
108-
} else {
109-
throw new ParseException("Unsupported alphabeth: " + m.group(1));
110-
}
106+
107+
Alphabet a = Alphabet.getAlphabet(m.group(1));
108+
if (a == null)
109+
new ParseException("Unsupported alphabeth: " + m.group(1));
111110

112111
// Write header
113112
out.write(fLine);
114113
out.newLine();
115114

115+
DateFormat f = new SimpleDateFormat("yyyyMMdd.HHmm");
116+
out.write("# Created automatically by " + BuildConcordanceVersion.class.getName() + " on: " + f.format(new Date()));
117+
out.newLine();
118+
out.write("#");
119+
out.newLine();
120+
116121
while ((fLine = in.readLine()) != null) {
117122
++lnum;
118123
line = null;

eclipse/io.github.mzattera.v4j-apps/src/main/resources/Transcriptions/LSI_ivtff_0d_fixed.txt

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
1-
#=IVTFF Eva- 1.5.2018092200
1+
#=IVTFF Eva- 1.5
22
#
3-
# Minor changes made by Maxi to align trascription of a few lines, so they can be merged.
4-
# Last edited on 2018-09-22 17:13 by Maxi
3+
# This is based on Landini-Stolfi Interlinear file from http://www.voynich.nu/data/beta/LSI_ivtff_0d.txt
4+
# Minor changes made by Massimiliano Zattera to align transcription of a few lines, so they can be merged automatically.
5+
# Last edited on 2021-09-21 14:50 by Massimiliano Zattera.
56
#
67
# # <f0.A> {}
78
# Last edited on 1998-12-05 11:22:03 by stolfi

resources/analysis/slots/Slots.xlsx

-37 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)