Txt and ann tokens sometimes do not match #3

jcklie · 2019-04-11T13:05:40Z

Thank you for making this data available. I started to write a python script to parse it, and I encountered some issues. If you look e.g. at data/aph/devset/iob/75-00060.txt, then the second sentence is

El	ART
testimonio	NC
de	PREP
Aristóteles	NC
(	LP
EN	PREP
1111a	NP
,	CM
6-11	CARD
)	RP
,	CM
en cambio	ADV
,	CM
sí	PPX
preserva	VLfin
fiablemente	ADV
el	ART
recuerdo	NC
del	PDEL
juicio	NC
.	FS

The corresponding text file, data/aph/devset/ann/75-00060.txt-doc-1.txt has as text

Las noticias de Heraclides Póntico (fr. 170 Wehrli), Ateneo (1, 39, 1-6) y Tzetzes (In Hes. Op. 414) relativas al supuesto juicio al que fue sometido Esquilo por revelar doctrinas de los misterios de Eleusis discrepan notablemente en los puntos fundamentales y no proceden de una fuente histórica.
El testimonio de Aristóteles (EN 1111a, 6-11), en, sí preserva fiablemente el recuerdo del juicio.
Esquilo divulgó la relación materno-filial de Deméter y Ártemis en varios dramas siendo inconsciente - él nunca fue iniciado en estos misterios - de que revelaba parte de la doctrina eleusina.
En su defensa, alegó que había adquirido el conocimiento de esta relación de una fuente no mistérica, a : ciertos tratados de inspiración órfico-pitagórica acerca los misterios de Deméter

If you look, the en cambio is just in the iob, not in the text. That happens a lot. The orig text is correct, the text in ann misses stuff. As I understand it, the ann offsets for tokens are based on the ann text, so it makes mapping character offsets to token offsets really difficult.

The text was updated successfully, but these errors were encountered:

mromanello · 2019-04-12T13:49:37Z

thanks @jcklie for reporting this issue! I believe this issue concerns only the devset documents as I haven't encountered it with goldset/testset. Since the annotations (entity mentions) done on the devset were fully automatic -- and were never manually corrected -- I can fix this tokenization/PoS tagging problem and regenerate the annotations. Anyway, I'm converting the brat files into uima/xmi and I've started with the goldset -- more on this soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Txt and ann tokens sometimes do not match #3

Txt and ann tokens sometimes do not match #3

jcklie commented Apr 11, 2019 •

edited

Loading

mromanello commented Apr 12, 2019

Txt and ann tokens sometimes do not match #3

Txt and ann tokens sometimes do not match #3

Comments

jcklie commented Apr 11, 2019 • edited Loading

mromanello commented Apr 12, 2019

jcklie commented Apr 11, 2019 •

edited

Loading