Skip to content

Commit 6edc4fb

Browse files
committed
Fixing Release 003
1 parent 7cc9271 commit 6edc4fb

File tree

3 files changed

+14
-16
lines changed

3 files changed

+14
-16
lines changed

README.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -145,9 +145,9 @@ The class can build a BoW where dimensions can be (see `BagOfWordsMode`):
145145
Notice this class is `Clusterable`, thus can be used with the Apache clustering API where subclasses of `Clusterer<T extends Clusterable>`
146146
are used to cluster set of `Clusterable` instances.
147147

148-
#### K-Means Clustering
148+
#### K-Means Clustering - `io.github.mattera.v4j.util.clustering`
149149

150-
Below an example of how BoW insances can be clustered:
150+
Below an example of how BoW instances can be clustered:
151151

152152
```Java
153153
// Distance measure for clustering
@@ -191,10 +191,6 @@ clusters.get(i).getPoints();
191191
...
192192
```
193193

194-
#### K-Means Clustering
195-
196-
TODO
197-
198194

199195
### Useful Stuff - `io.github.mattera.v4j.util`
200196

docs/003/index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ Courier's languages reflect language differences in the underlying "clear" text.
232232
these differences should be kept in mind when performing statistical analysis of the text or when trying it decipherment.
233233

234234
For this reason v4j library provides means to classify pages accordingly to above considerations, the resulting clusters are shown below
235-
(also refer to[`PageHeader`]() class.
235+
(also refer to[`PageHeader`](https://github.com/mzattera/v4j/blob/v.3.0.0/eclipse/io.github.mattera.v4j/src/main/java/io/github/mattera/v4j/text/ivtff/PageHeader.java) class.
236236

237237
![Cluster size in words](images/Clusters.PNG)
238238

@@ -243,17 +243,19 @@ these differences should be kept in mind when performing statistical analysis of
243243

244244
<a id="Note1">**{1}**</a> See [v4j README](https://github.com/mzattera/v4j#alphabet).
245245

246-
<a id="Note2">**{2}**</a> The class [`OutlierDetection`]() is used to calculate average distance of each page from other
247-
pages in the text. The output of the class (`PageEmbeddingDistance.xlsx`) can be found in the [analysis folder]().
246+
<a id="Note2">**{2}**</a> The class
247+
[`OutlierDetection`](https://github.com/mzattera/v4j/blob/v.3.0.0/eclipse/io.github.mzattera.v4j-apps/src/main/java/io/github/mzattera/v4j/applications/clustering/OutlierDetection.java)
248+
is used to calculate average distance of each page from other pages in the text. The output of the class (`PageEmbeddingDistance.xlsx`) can be found in the
249+
[analysis folder](https://github.com/mzattera/v4j/tree/master/resources/analysis).
248250

249251
<a id="Note3">**{3}**</a> The class [`BuildBoW`]() can be used to generate data suitable for visualization that can
250252
be uploaded to the TensorFlow projector. The output of this class, in the form of a "vector" and "metadata" .TSV files,
251-
can be found in [this folder]() both for single pages or entire parchments.
253+
can be found in [this folder](https://github.com/mzattera/v4j/tree/master/docs/003/data) both for single pages or entire parchments.
252254

253255
<a id="Note4">**{4}**</a> Class `KMeansClusterByWords`
254256
does the K-Means clustering and prints out a report that can be easily converted in an Excel file.
255257
The class can be parameterized to run different types of experiments; its outputs, with some additional data,
256-
can be found as Excel files in the [analysis folder]().
258+
can be found as Excel files in the [analysis folder](https://github.com/mzattera/v4j/tree/master/resources/analysis).
257259
Keep in mind K-Means algorithm include some randomness, therefore slightly different clustering might result at each experiment.
258260

259261
---

docs/index.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Welcome to GitHub Pages
1+
## Welcome
22

33
Hi, in these pages I store thoughts, working notes, rants and frustrations about the [Voynich manuscript](https://en.wikipedia.org/wiki/Voynich_manuscript)
44
as resulting from my work with the [v4j library](https://github.com/mzattera/v4j).
@@ -7,20 +7,20 @@ as resulting from my work with the [v4j library](https://github.com/mzattera/v4j
77

88
In the below notes, we will try to be consistent with following terminology.
99

10-
- A **token** in the Voynich is a single sequence of characters, separated by spaces. A **term** represent the set of identical tokens.
10+
- A **token** in a text is a single sequence of characters, separated by spaces. A **term** represents a set of identical tokens.
1111
In other terms, a token is an instance of a term. For example the below line in the Voynich text:
1212

1313
```
1414
<f1r.15,+P0;m> daiin shckhey ckhor chor shey kol chol chol kor chol
1515
```
1616

1717
Contains 10 tokens ("daiin", "shckhey", "ckhor", "chor", "shey", "kol", "chol", "chol", "kor", "chol") which are instances of
18-
8 terms ("daiin", "shckhey", "ckhor", "chor", "shey", "kol", "chol" "kor").
18+
8 terms ("daiin", "shckhey", "ckhor", "chor", "shey", "kol", "chol", "kor").
1919

2020
When the distinction is not relevant, I might loosely use "word" (often in quotes) to refer to either tokens or terms.
2121

22-
- Terms "transcription" and "transliteration" are used more or less interchangeably, though the latter is more correct.
23-
In both case we refer either to the process of capturing a text (typically the Voynich) in a file or to the outcome of such process.
22+
- The terms "**transcription**" and "**transliteration**" are used more or less interchangeably, though the latter is more correct.
23+
In both cases, we refer either to the process of capturing a text (typically the Voynich) in a file or to the outcome of such process.
2424

2525
### Working Notes
2626

0 commit comments

Comments
 (0)