mzattera
diff --git a/‎README.md
Lines changed: 3 additions & 1 deletion b/‎README.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/001/index.md
Lines changed: 12 additions & 8 deletions b/‎docs/001/index.md
Lines changed: 12 additions & 8 deletions
diff --git a/‎docs/005/images/Char Count by Slot.PNG
64.2 KB b/‎docs/005/images/Char Count by Slot.PNG
64.2 KB
diff --git a/‎docs/005/images/Rare.PNG
1.53 KB b/‎docs/005/images/Rare.PNG
1.53 KB
diff --git a/‎docs/005/images/Slot Alphabet.PNG
-442 Bytes b/‎docs/005/images/Slot Alphabet.PNG
-442 Bytes
diff --git a/‎docs/005/index.md
Lines changed: 23 additions & 27 deletions b/‎docs/005/index.md
Lines changed: 23 additions & 27 deletions
diff --git a/‎docs/006/index.md
Lines changed: 33 additions & 0 deletions b/‎docs/006/index.md
Lines changed: 33 additions & 0 deletions
diff --git a/‎docs/index.md
Lines changed: 4 additions & 0 deletions b/‎docs/index.md
Lines changed: 4 additions & 0 deletions
diff --git a/‎eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/alphabet/SlotAlphabet.java
Lines changed: 14 additions & 13 deletions b/‎eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/alphabet/SlotAlphabet.java
Lines changed: 14 additions & 13 deletions
diff --git a/‎eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/ivtff/IvtffLine.java
Lines changed: 9 additions & 32 deletions b/‎eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/ivtff/IvtffLine.java
Lines changed: 9 additions & 32 deletions
@@ -42,6 +42,8 @@ The `Alphabet` class provides some static fields to access already defined alpha
 - `Alphabet.EVA` is the Basic EVA alphabet.
 
 - `Alphabet.UTF_16` is the UTF-16 char-set used in Java. This is the alphabet to be used to process "normal" (as non-Voynich) text files and strings.
+
+- `Alphabet.SLOT` is the Slot alphabet as defined in [this working note](https://mzattera.github.io/v4j/005/).
 
 
 ### `io.github.mattera.v4j.text`
@@ -78,7 +80,7 @@ where multiple versions of each line in the manuscript are provided, one per aut
 
 - **`AUGMENTED`**: This is an "augmented" version of the LSI transliteration where two "artificial" transcribers were created, 
 each corresponding to one of `IvtffText.TranscriptionType` values; `IvtffText.TranscriptionType` can be used in factory methods described below to 
-get one of these transcriptions.
+get one of these transcriptions. This transliteration is available both in EVA and Slot alphabet.
 
   - **`CONCORDANCE`**: each line of this transliteration is created by merging readings from all available transcribers. Only characters that appears to be read 
   in the same way by all authors are considered; other characters (read differently by one ore more transcribers) are marked as unreadable.
 
@@ -1,8 +1,8 @@
 ## Note 001 - The Text
 
-_Last updated Sep. 6th, 2021._
+_Last updated Sep. 19th, 2021._
 
-_This note refers to [release v.1.0.0](https://github.com/mzattera/v4j/tree/v.1.0.0) of v4j;
+_This note refers to [release v.5.0.0](https://github.com/mzattera/v4j/tree/v.5.0.0) of v4j;
 **links to classes and files refer to this release** and files might have been changed, deleted or moved in the current master branch.
 In addition, some of this note content might have become obsolete in more recent versions of the library._
 
@@ -22,25 +22,29 @@ obtain an `IvtffText` instance with the Voynich text. At present the library pro
 Landini-Stolfi Interlinear file (**LSI**) and an augmented version of it, containing concordance and majority versions of the text.
 
 The corresponding IVTFF files (which are read by the factory) can be found in the
-[resource folder](https://github.com/mzattera/v4j/tree/v.1.0.0/eclipse/io.github.mattera.v4j/src/main/resources/Transcriptions)
+[resource folder]()
 of the library.
 
-The "augmented" version is created using class
-[`BuildConcordanceVersion`](https://github.com/mzattera/v4j/blob/d7b349c08c780214bebe3b515623f54951bb3886/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mattera/v4j/applications/BuildConcordanceVersion.java);
+The "augmented" EVA version is created using class
+[`BuildConcordanceVersion`]();
 the input for the class is a slightly modified version of LSI that can be found in the
-[v4j-apps resource folder](https://github.com/mzattera/v4j/tree/v.1.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/resources/Transcriptions).
+[v4j-apps resource folder]().
 In this version, minor changes are done, that do not change the text content, in order to make sure
 all the different versions of the lines align properly, as required by `BuildConcordanceVersion` code.
 
+Class
+[`BuildSlotVersion`]();
+is then used to transcribe the "augmented" version from EVA into Slot alphabeth.
+
 ### The Bible Text
 
 Similarly, class
-[`BuildBibleTranscription`](https://github.com/mzattera/v4j/blob/d7b349c08c780214bebe3b515623f54951bb3886/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mattera/v4j/applications/BuildBibleTranscription.java)
+[`BuildBibleTranscription`]()
 is used to produce .txt version if the Bible from  XML files that can be found in the
 [v4j-apps resource folder](https://github.com/mzattera/v4j/tree/v.1.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/resources/Transcriptions).
 
 The corresponding IVTFF files (which are read by the factory) can be found in the 
-[resource folder](https://github.com/mzattera/v4j/tree/v.1.0.0/eclipse/io.github.mattera.v4j/src/main/resources/Transcriptions)
+[resource folder]()
 of the library.
 
 ---
 
@@ -1,6 +1,6 @@
 # Note 005 - Slots and a New Alphabet
 
-_Last updated Sep. 18th, 2021._
+_Last updated Sep. 19th, 2021._
 
 _This note refers to [release v.5.0.0](https://github.com/mzattera/v4j/tree/v.5.0.0) of v4j;
 **links to classes and files refer to this release**; files might have been changed, deleted or moved in the current master branch.
@@ -29,11 +29,9 @@ exists in any modern text as well. However, I will try to focus on claims that a
 
 ## Previous Works
 
-Either here or at the end as "Comparison with other works".
+I am not the first one analyzing the internal structure of Voynich words.
 
-**TODO** https://briancham1994.com/2014/12/17/curve-line-system/.
-
-- This approach is easier to explain and has more implications.
+One day I will create a [working note](../006) to compare this analysis with others.
 
 
 ## Methodology
@@ -112,11 +110,9 @@ where each of these parts is a regular term. I will call these tokens "**separab
   (or the space between them was not read correctly by the transcriber of the text).
   When I need to distinguish these terms from other separable terms, I will call them **verified separable** or simply **verified**.
 
-  **TODO** check the length of the parts and see if only short terms are joined. Check if separable tends to appear in tight spaces. 
-
 - Remaining 618 tokens (2.0% of total), corresponding to 429 different terms (8.4% of total), are marked as "**unstructured**".
 
-  **TODO** Show that vast majority of unstructured words appear only once in the text. This is probably true for separable too.
+  Notice that 366 out of these 429 terms appear only once in the text.
 
 - Sometime I contrast regular and separable terms to unstructured ones by calling the former ***structured***.  
 
@@ -129,12 +125,12 @@ The below table summarizes these findings.
 In short, almost 9 out of 10 tokens in the Voynich text exhibit a "slot" structure. Of the remaining, a fair amount can be decomposed in two parts each corresponding to regular terms
 appearing elsewhere in the text. The remaining cases (2 out of 100) are mostly words appearing only once in the text.
 
-**TODO** Char count by slot
+The below table shows occurrences of glyphs in slots for the regular terms [{2}](#Note2).
 
-**TODO** Decomposition by cluster.
+![Table with glyph count by slot.](images/Char Count by Slot.PNG)
 
 
-### The Voynich Alphabet
+## The Voynich Alphabet
 
 The definition of the Voynich alphabet, that is of which glyphs should be considered a single Voynich character in the text, is still open.
 Each transcriber must continuously decide what symbols in the manuscript constitute instances of the same glyph and how each glyph needs to be mapped into 
@@ -148,10 +144,11 @@ Below I analyze more in detail some relationships between glyphs, as they appear
 
 #### Rare Characters
 
-The EVA characters 'g', 'x', 'v', and 'u' appear in the text only very few times, mostly as single characters, as shown in the table below.
+Some EVA characters appears in the original interlinear transliteration very seldom, end even less frequently in the concordance version used, 
+where they appear mostly as single characters, as shown in the table below.
 For this reason, I decided to ignore these characters and mark them as "unreadable character" for this analysis.
 
-![Statistics about 'g', 'x', 'v', and 'u'](images/Rare.PNG)
+![Statistics about rare characters](images/Rare.PNG)
 
 Notice that through the Voynich there are several glyphs which cannot be directly transliterated into EVA characters (so called "weirdoes"); 
 they are mostly ignored in any analysis of the text.
@@ -202,27 +199,23 @@ The below table defines the Slots alphabet and compares it with other transliter
 
 ![The Slot alphabet and a comparison with other transliteration alphabets](images/Slot Alphabet.PNG)
 
-**TODO** i ii iii in alcuni alfabeti cambiano a seconda di come m r n sono trattate....evidenziarlo nella tabella.
-
-**TODO** Create transliteration.
+  * These alphabets treat sequence of EVA 'i' differently, depending on the letter following the sequence. Therefore there is no unique way to transliterate
+sequences of 'i' into these alphabets.
 
-**TODO** Create HTML version.
+A transliteration of the Landini-Stolfi interlinear file is available within [v4j library](https://github.com/mzattera/v4j) and accessible using `VoynichFactory` factory methods.
 
 
 ## Conclusions 
 
-- Inner structure of words, easier fro me to explain than Core or automata.
-
-- Excludes any (simple) substitution cypher.
-
-- This is the only alphabet that uses data-backed evidence in defining the char-set
+- I think the slots easily describe the inner structure of Voynich words.
 
-- Voynich/EVA chars in slot cells constitute a morphological unit (character).
+- Given they prove a structure in Voynich words that is not found in other languages, any attempt to propose a substitution cypher fro the Voynich should not be accepted.
 
-- It is important both for attacking the cypher and performing statistical analysis to have 1:1 mapping between Voynich and transliteration characters.
-
- 
- 
+- I think it is important, both for attacking the Voynich cypher and performing statistical analysis of the manuscript, 
+to have a one-to-one mapping between the Voynich characters and those in the transliteration alphabet.
+As far as I know, Slot alphabet is the first one that is created by empirical data about the structure of Voynich words, trying to capture the intent of the 
+Voynich author.
+  
 
 ---
 
@@ -231,6 +224,9 @@ The below table defines the Slots alphabet and compares it with other transliter
 <a id="Note1">**{1}**</a> Class [`Slots`]() has been used to perform this analysis. An Excel with its output can be found in the
 [analysis folder]().
 
+<a id="Note2">**{2}**</a> Class [`CountCharsBySlot`]() has been used to produce this table.
+
+
 ---
 
 [**<< Home**](..)
 
@@ -0,0 +1,33 @@
+# Note 006 - Other Works on Word Structure
+
+_Last updated Sep. 19th, 2021._
+
+_This note refers to [release v.5.0.0](https://github.com/mzattera/v4j/tree/v.5.0.0) of v4j;
+**links to classes and files refer to this release**; files might have been changed, deleted or moved in the current master branch.
+In addition, some of this note content might have become obsolete in more recent versions of the library._
+
+_Working notes are not providing detailed description of algorithms and classes used; for this, please refer to the 
+library code and JavaDoc._
+
+_Please refer to the [home page](..) for a set of definitions that might be relevant for this working note._
+
+[**<< Home**](..)
+
+---
+
+
+## Abstract
+
+This is Work in Progress.
+
+The idea is to compare the [slot concept](https://briancham1994.com/2014/12/17/curve-line-system/) with other works in this area.
+
+ 
+	
+---
+
+[**<< Home**](..)
+
+Copyright Massimiliano Zattera.
+
+<a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.
@@ -61,6 +61,10 @@ In other words, a token is an instance of a term. For example; the below line in
 
   List of most common Voynichese terms and how they are split across different clusters.
 
+- [Note 005 - Slots and a New Alphabet](./004)
+
+  I show how the structure of Voynich words can be explained by some simple rules, and how these can be used to derive the original Voynich alphabet.
+
 ---
 
 Copyright Massimiliano Zattera.
 
@@ -6,6 +6,9 @@
 import java.util.List;
 import java.util.Map;
 
+import io.github.mattera.v4j.text.ivtff.IvtffLine;
+import io.github.mattera.v4j.text.ivtff.ParseException;
+
 /**
  * "Slot" alphabet based on "slot" theory.
  * 
@@ -86,7 +89,7 @@ public String toString() {
 
 	@Override
 	public String getCodeString() {
-		return "Slt-";
+		return "Slot";
 	}
 
 	private final static char[] regularChars = { 'o', 'e', 'E', 'B', 'C', 'S', 'y', 'a', 'd', 'i', 'J', 'U', 'k', 'K',
@@ -220,15 +223,15 @@ protected SlotAlphabet() {
 	}
 
 	/**
-	 * Converts a text from Basic EVA alphabet. It only works for plain texts (see
-	 * Text.getPlainText()).
+	 * Converts a text from Basic EVA alphabet.
 	 * 
-	 * @param txt Plain text to be converted.
+	 * @param txt text to be converted.
+	 * @throws ParseException if text is not proper IVTFF text.
 	 */
-	public static String fromEva(String txt) {
-		for (char c : txt.toCharArray())
-			if (!Alphabet.EVA.isRegularOrSeparator(c) && !Alphabet.EVA.isUreadableChar(c))
-				throw new IllegalArgumentException("Text is not a plain EVA text.");
+	public static String fromEva(String txt) throws ParseException {
+		
+		// Remove comments as they migth interfer with replacement
+		txt = IvtffLine.removeComments(txt);
 
 		// TODO add support for illegible words
 
@@ -257,13 +260,11 @@ public static String fromEva(String txt) {
 		txt = txt.replace("v", "?");
 		txt = txt.replace("x", "?");
 		txt = txt.replace("u", "?");
+		txt = txt.replace("j", "?");
+		txt = txt.replace("b", "?");
+		txt = txt.replace("z", "?");
 		txt = txt.replace("'", "?");
 
-		// TODO test - REMOVEME
-		for (char c : txt.toCharArray())
-			if (!Alphabet.SLOT.isRegularOrSeparator(c) && !Alphabet.SLOT.isUreadableChar(c))
-				throw new UnsupportedOperationException("Something went wrong in conversion");
-
 		return txt;
 	}
 
 
@@ -70,30 +70,6 @@ public IvtffLine(IvtffLine other) {
 		setParent(other.getParent());
 	}
 
-	/**
-	 * Locus identifiers have the following format:
-	 * 
-	 * < page . num , code >
-	 * 
-	 * Or : < page . num , code ; T >
-	 * 
-	 * Whitespace is not allowed inside locus identifiers, but it is used in the
-	 * patterns above for clarity. The fields have the following meaning:
-	 * 
-	 * page The page name, which has to match the most recent page header.
-	 * 
-	 * num A sequence number, incrementing from 1 for each page. The highest number
-	 * that presently occurs is 160.
-	 * 
-	 * code A 3-character code, which is a 1-character locator followed by a
-	 * 2-character locus type
-	 * 
-	 * T An optional single-character transcriber ID. Only used in interlinear files
-	 * that include several parallel transcriptions.
-	 */
-	private final static Pattern locusIdentifier = Pattern
-			.compile("<(f[0-9]{1,3}[rv][0-9]?|fRos)\\.([0-9]{1,3}[a-z]?),([\\+\\*\\-=&~@/][PLCR].)(;.)?>");
-
 	/**
 	 * Creates a new instance parsing given input string.
 	 * 
@@ -124,13 +100,8 @@ public IvtffLine(String txt) throws ParseException {
 	public IvtffLine(String row, int rowNum, Alphabet a) throws ParseException {
 		super(a);
 
-		if (!row.startsWith("<"))
-			throw new ParseException("Missing locus indentifier", row, rowNum);
-
-		row = row.trim();
-
 		// TODO check right combination of generic and complete type for the locus type
-		Matcher m = locusIdentifier.matcher(row);
+		Matcher m = IvtffText.LOCUS_IDENTIFIER_PATTERN.matcher(row);
 		if (!m.find() || (m.start() != 0)) {
 			throw new ParseException("Missing or malformed locus identifier", row, rowNum);
 		}
@@ -169,7 +140,7 @@ private String normalizeText(String text) throws ParseException {
 		for (int i = 0; i < txt.length(); ++i)
 			if (!alphabet.isRegular(txt.charAt(i)) && !alphabet.isWordSeparator(txt.charAt(i))
 					&& !alphabet.isUreadableChar(txt.charAt(i)))
-				throw new ParseException("Line contains invalid characters", text);
+				throw new ParseException("Line contains invalid characters", text + " ['" + txt.charAt(i) + "']");
 
 		return getAlphabet().toPlainText(txt);
 	}
@@ -485,8 +456,14 @@ public static IvtffLine merge(List<IvtffLine> lines, TranscriptionType type) thr
 		for (IvtffLine l : lines)
 			copy.add(new IvtffLine(l));
 
-		if (!align(copy))
+		if (!align(copy)) {
+
+			// TODO remove debug code
+			for (IvtffLine l : lines)
+				System.out.println(l);
+			
 			throw new ParseException("Cannot align the transcriptions.");
+		}
 
 		IvtffLine merged = null;
 		switch (type) {