Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Txt and ann tokens sometimes do not match #3

Open
jcklie opened this issue Apr 11, 2019 · 1 comment
Open

Txt and ann tokens sometimes do not match #3

jcklie opened this issue Apr 11, 2019 · 1 comment

Comments

@jcklie
Copy link

jcklie commented Apr 11, 2019

Thank you for making this data available. I started to write a python script to parse it, and I encountered some issues. If you look e.g. at data/aph/devset/iob/75-00060.txt, then the second sentence is

El	ART
testimonio	NC
de	PREP
Aristóteles	NC
(	LP
EN	PREP
1111a	NP
,	CM
6-11	CARD
)	RP
,	CM
en cambio	ADV
,	CM
sí	PPX
preserva	VLfin
fiablemente	ADV
el	ART
recuerdo	NC
del	PDEL
juicio	NC
.	FS

The corresponding text file, data/aph/devset/ann/75-00060.txt-doc-1.txt has as text

Las noticias de Heraclides Póntico (fr. 170 Wehrli), Ateneo (1, 39, 1-6) y Tzetzes (In Hes. Op. 414) relativas al supuesto juicio al que fue sometido Esquilo por revelar doctrinas de los misterios de Eleusis discrepan notablemente en los puntos fundamentales y no proceden de una fuente histórica.
El testimonio de Aristóteles (EN 1111a, 6-11), en, sí preserva fiablemente el recuerdo del juicio.
Esquilo divulgó la relación materno-filial de Deméter y Ártemis en varios dramas siendo inconsciente - él nunca fue iniciado en estos misterios - de que revelaba parte de la doctrina eleusina.
En su defensa, alegó que había adquirido el conocimiento de esta relación de una fuente no mistérica, a : ciertos tratados de inspiración órfico-pitagórica acerca los misterios de Deméter

If you look, the en cambio is just in the iob, not in the text. That happens a lot. The orig text is correct, the text in ann misses stuff. As I understand it, the ann offsets for tokens are based on the ann text, so it makes mapping character offsets to token offsets really difficult.

@mromanello
Copy link
Owner

thanks @jcklie for reporting this issue! I believe this issue concerns only the devset documents as I haven't encountered it with goldset/testset. Since the annotations (entity mentions) done on the devset were fully automatic -- and were never manually corrected -- I can fix this tokenization/PoS tagging problem and regenerate the annotations. Anyway, I'm converting the brat files into uima/xmi and I've started with the goldset -- more on this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants