Finished note 008

mzattera · mzattera · commit 149e9deca9e0 · 2022-02-20T18:26:24.000+01:00
diff --git a/docs/006/index.md b/docs/006/index.md
@@ -220,7 +220,6 @@ Brian Cham proposes a new pattern in the text of the Voynich Manuscript named th
 This pattern is fundamentally based on shapes of individual glyphs but also informs the structure of words.
 
   
-	
 ---
 
 **Notes**
diff --git a/docs/007/index.md b/docs/007/index.md
@@ -177,6 +177,17 @@ Noticeable difference is that, while 'l' and 'r' can be followed by the word fin
 
 This slot contains the word ending 'y' alone.
 
+
+# Conclusions
+
+This analysis shows that there is a dependency between one character and those preceding it. In other terms,
+Voynich words are not generated by randomly putting proper chars into slots.
+
+It also shows that, given a character, we have a limited choice of options for the characters following it;
+this means that the information encoded by each single character is not much. Compared to modern languages,
+where a position in a text can encode 4-5 bits (it can be occupied by any of about 25 letters),
+a position in a Voynich word can be filled only by a smaller set of symbols, this encoding less information.
+
 	
 ---
 
diff --git a/docs/008/index.md b/docs/008/index.md
@@ -1,4 +1,4 @@
-# Note 008 - The Best Grammar for Voynichese (as far as I know)
+# Note 008 - Simply the Best Grammar for Voynichese (as far as I know)
 
 _Last updated Jan. 31st, 2021._
 
@@ -169,17 +169,16 @@ The table below compares our grammar with other models described in [Note 006](.
 | Model 	| Generated strings 	| True Positives 	| Positive Tokens 	| Precision 	| Recall 	| F1 |
 | :--- 	| ---: 	| ---: 	| ---: 	| ---: 	| ---: 	| ---: |
 | ROE 	| 120	| 112	| 15.954%	| <span style="color:red">0.933</span>	| 0.022	| 0.043 |
-| STOLFI 	| 143,124,560,075,240,080,000	| 4,527	| 97.813%	| 0.000	| <span style="color:red">0.881</span>	| 0.000 |
-| NEAL_1a 	| 87,480	| 535	| 20.083%	| 0.006	| 0.104	| 0.012 |
-| NEAL_1b 	| 174,818	| 1,782	| 66.013%	| 0.010	| 0.347	| 0.020 |
-| NEAL_2 	| 1,311,345	| 1,049	| 45.248%	| 0.001	| 0.204	| 0.002 |
-| PALMER 	| ∞	| 4,547	| 97.280%	| 0.000	| <span style="color:red">0.884</span>	| 0.000 |
-| VOGT (Recipes) 	| 32,575	| 424	| 58.697%	| 0.013	| 0.190	| 0.024 |
-| VOGT 	| 32,575	| 565	| 55.734%	| 0.017	| 0.110	| 0.030 |
+| STOLFI 	| 143'124'560'075'240'080'000	| 4'527	| 97.813%	| 0.000	| <span style="color:red">0.881</span>	| 0.000 |
+| NEAL_1a 	| 87'480	| 535	| 20.083%	| 0.006	| 0.104	| 0.012 |
+| NEAL_1b 	| 174'818	| 1'782	| 66.013%	| 0.010	| 0.347	| 0.020 |
+| NEAL_2 	| 1'311'345	| 1'049	| 45.248%	| 0.001	| 0.204	| 0.002 |
+| PALMER 	| ∞	| 4'547	| 97.280%	| 0.000	| <span style="color:red">0.884</span>	| 0.000 |
+| VOGT (Recipes) 	| 32'575	| 424	| 58.697%	| 0.013	| 0.190	| 0.024 |
+| VOGT 	| 32'575	| 565	| 55.734%	| 0.017	| 0.110	| 0.030 |
 | PELLING 	| ∞	| 259	| 32.099%	| 0.000	| 0.050	| 0.000 |
-| SLOT 	| 4,643,467	| 2,617	| 86.447%	| 0.001	| <span style="color:orange">0.509</span>	| 0.001 |
-| SM 	| 3,110	| 1,113	| 62.040%	| <span style="color:orange">0.358</span>	| 0.216	| <span style="color:red">0.270</span> |
-
+| SLOT 	| 4'643'467	| 2'617	| 86.447%	| 0.001	| <span style="color:orange">0.509</span>	| 0.001 |
+| <span style="color:green">**SLOT MACHINE**</span> 	| 3'110	| 1'113	| 62.040%	| <span style="color:orange">0.358</span>	| 0.216	| <span style="color:red">0.270</span> |
 
   - **STOLFI**: Jorge Stolfi's "crust-mantle-core" model. As it is impossible to generate and test all words for this model, I assume any term in the Voynich that is not listed in Solfi's `AbnormalWord` is a true positive.
   - There are three versions of grammars described by Philip Neal:
@@ -189,13 +188,17 @@ The table below compares our grammar with other models described in [Note 006](.
   - Vogt's model was created only for the "recipes" section (Stars B); here a comparison is provided both limited to that section and for the entire text.
   - When implementing Pelling's state machine, I assumed all arrows have the same meaning (even if some are dashed) and the red boxes are non-emitting states.
   - **SLOT** considers all terms generated by the [Slot model](../005 +).
-  - **SM** Is the state machine I describe above.
-  
-# Considerations
-
+  - **SLOT MACHINE** Is the state machine I describe above in this note.
 
+  
 # Conclusions
 
+  - The proposed grammar has the best F1, an order of magnitude above any other model I know. It is able to 
+  model 62% of tokens that appear in the Voynich (that is 1'113 terms, or 21.6% of them).
+  - The proposed grammar has the second best precision, topped only by Roe's model, which generates only 120 words (and 112 terms of the Voynich).
+  - Models with a recall higher than the proposed grammar (Stolfi's, Palmer's, and my Slot model) generate almost infinite number of words.
+  If we ignore these, Neal's model has a slightly higher recall than the proposed one, but generates more than 1.3 million words (compared to the 3 thousand generated by my model). 
+  
 	
 ---
 
@@ -212,7 +215,7 @@ I this discussion, I am ignoring it, as it is also slightly more complex that th
 
 <a id="Note3">**{3}**</a> A version of this graph that can be visualized using [Gephi](https://gephi.org/) (`StateMachine.gephi`) is stored [here](https://github.com/mzattera/v4j/blob/v.9.0.0/resources/analysis/slots/).
 
-<a id="Note1">**{4}**</a> Class [`WordModelEvaluator`](https://github.com/mzattera/v4j/blob/v.9.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/slot/WordModelEvaluator.java) was used for
+<a id="Note4">**{4}**</a> Class [`WordModelEvaluator`](https://github.com/mzattera/v4j/blob/v.9.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/slot/WordModelEvaluator.java) was used for
 this purpose.
 
 ---
diff --git a/docs/index.md b/docs/index.md
@@ -86,6 +86,10 @@ In other words, a token is an instance of a term. For example; the below line in
 
   I create a graph showing how characters in words are connected, based on "Slots" concept.
   
+- [Note 008 - Simply the Best Grammar for Voynichese (as far as I know)](./008)
+
+  I create a grammar to explain structure of Voynich words, showing it has the best F1 among all proposed models.
+   
 
 # Bibliography and Reviews
 
diff --git a/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/slot/WordModelEvaluator.java b/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/slot/WordModelEvaluator.java
@@ -21,7 +21,6 @@
 import io.github.mzattera.v4j.util.Counter;
 import io.github.mzattera.v4j.util.statemachine.SlotBasedModel;
 import io.github.mzattera.v4j.util.statemachine.StateMachine;
-import io.github.mzattera.v4j.util.statemachine.StateMachine.TrainMode;
 
 /**
  * Evaluates F1 score for models of Voynich words, considered as a classifiers
@@ -351,7 +350,7 @@ private static void evaluatePalmer(Counter<String> voynichTokens) throws ParseEx
 
 			if (p.matcher(t).matches()) {
 				++tp;
-				ttp += voynichTokens.getCount(t);				
+				ttp += voynichTokens.getCount(t);
 			}
 		}
 
@@ -634,88 +633,6 @@ private static void evaluateSlotMachine(Counter<String> voynichTokens) throws Pa
 		evaluate("SM", voynichTokens, SlotAlphabet.toEva(m.emit().itemSet()));
 	}
 
-	/**
-	 * Evaluate Slots state machine model. This is old model written manually.
-	 * 
-	 * @param voynichTokens List of Voynich terms (EVA).
-	 */
-	private static void evaluateSlotMachineOld(Counter<String> voynichTokens) throws ParseException {
-
-		StateMachine m = new StateMachine();
-		m.setInitialState(m.addState("Start"));
-		m.addState("Slot_0");
-		m.addState("0_q", "q");
-		m.addState("0_s", "s");
-		m.addState("0_d", "d");
-		m.addState("Slot_1");
-		m.addState("1_y", "y");
-		m.addState("1_o", "o");
-		m.addState("Slot_2");
-		m.addState("2_r", "r");
-		m.addState("2_l", "l");
-		m.addState("3_Gallows", new String[] { "t", "p", "k", "f" });
-		m.addState("4_Pedestals", new String[] { "ch", "sh" });
-		m.addState("5_PedGallows", new String[] { "cth", "cph", "ckh" }); // MISSING cfh
-		m.addState("6_eSeq", new String[] { "e", "ee" }); // MISSING eee
-		m.addState("Slot_7");
-		m.addState("7_d", "d");
-		m.addState("7_s", "s");
-		// m.addState("7_Gallows", new String[] {"t","p","k","f"});
-		m.addState("7_Gallows", new String[] {}); // REMOVED
-		m.addState("Slot_8");
-		m.addState("8_a", "a");
-		m.addState("8_o", "o");
-		m.addState("9_iSeq", new String[] { "i", "ii" }); // MISSING iii
-		m.addState("Slot_10");
-		m.addState("10_d", "d");
-		m.addState("10_lr", new String[] { "l", "r" });
-		m.addState("10_mn", new String[] { "m", "n" });
-		m.addState("11_y", "y");
-		m.addState("End", true);
-
-		// ***** TODO test optional states, optional characters and splitting C and S
-
-		m.addNext("Start", new String[] { "Slot_0", "Slot_1", "Slot_2", "3_Gallows", "4_Pedestals", "5_PedGallows",
-				"7_d", "7_s", "8_a", "6_eSeq" }); // (Possibly slot 6) IT WORKS
-		m.addNext("Slot_0", new String[] { "0_q", "0_d", "0_s" });
-		m.addNext("0_q", new String[] { "1_o" });
-		m.addNext("0_s", new String[] { "1_o", "4_Pedestals" });
-		m.addNext("0_d", new String[] { "1_o", "1_y", "4_Pedestals" });
-		m.addNext("Slot_1", new String[] { "1_y", "1_o" });
-		m.addNext("1_y", new String[] { "3_Gallows", "4_Pedestals" });
-		m.addNext("1_o", new String[] { "Slot_2", "3_Gallows", "4_Pedestals", "5_PedGallows", "6_eSeq", "7_d", "8_a" });
-		m.addNext("Slot_2", new String[] { "2_l", "2_r" });
-		m.addNext("2_r", new String[] { "4_Pedestals", "Slot_8" });
-		m.addNext("2_l", new String[] { "3_Gallows", "4_Pedestals", "7_d", "Slot_8" });
-		m.addNext("3_Gallows", new String[] { "4_Pedestals", "6_eSeq", "Slot_8", "11_y" }); // (7_d, 11_y ??) 11_y WORKS
-		m.addNext("4_Pedestals", new String[] { "6_eSeq", "Slot_7", "Slot_8", "11_y" }); // consider keeping them
-																							// separate? S won-t connect
-																							// to 7_s
-		m.addNext("5_PedGallows", new String[] { "6_eSeq", "Slot_8", "7_d", "11_y" }); // (possibly 7_d, 11_y) THEY BOTH
-																						// WORK
-		m.addNext("6_eSeq", new String[] { "Slot_7", "Slot_8", "11_y", "End" }); // possibly End // WORKS
-		m.addNext("Slot_7", new String[] { "7_d", "7_s", "7_Gallows" });
-		m.addNext("7_d", new String[] { "8_o", "8_a", "11_y", "End" });
-		m.addNext("7_s", new String[] { "8_a", "11_y", "End" }); // possibly 8_o? it does in slot 1 - NOT WORKING ->
-																	// looks better if 8_a is removed
-		m.addNext("7_Gallows", new String[] { "8_a", "11_y" }); // possibly 8_o? - NOT WORKING -> Looks better if
-																// removed completely
-		m.addNext("Slot_8", new String[] { "8_a", "8_o" });
-		m.addNext("8_a", new String[] { "9_iSeq", "10_lr", "10_mn" }); // Possibly END? - NOT WORKING
-		m.addNext("8_o", new String[] { "10_lr", "10_mn", "End" }); // Possibly END? - IT WORKS VERY WELL
-		m.addNext("9_iSeq", new String[] { "10_lr", "10_mn" });
-		m.addNext("Slot_10", new String[] { "10_d", "10_lr", "10_mn" });
-		m.addNext("10_d", new String[] { "11_y", "End" });
-		m.addNext("10_lr", new String[] { "11_y", "End" });
-		m.addNext("10_mn", new String[] { "End" });
-		m.addNext("11_y", new String[] { "End" });
-
-		evaluate("SMOLD", voynichTokens, m.emit().itemSet());
-
-		m.train(voynichTokens.itemSet(), TrainMode.F1);
-		evaluate("SMOLDTRN", voynichTokens, m.emit().itemSet());
-	}
-
 	/**
 	 * Evaluates and prints stats for a word generation model.
 	 *