You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for making this data available. I started to write a python script to parse it, and I encountered some issues. If you look e.g. at data/aph/devset/iob/75-00060.txt, then the second sentence is
El ART
testimonio NC
de PREP
Aristóteles NC
( LP
EN PREP
1111a NP
, CM
6-11 CARD
) RP
, CM
en cambio ADV
, CM
sí PPX
preserva VLfin
fiablemente ADV
el ART
recuerdo NC
del PDEL
juicio NC
. FS
The corresponding text file, data/aph/devset/ann/75-00060.txt-doc-1.txt has as text
Las noticias de Heraclides Póntico (fr. 170 Wehrli), Ateneo (1, 39, 1-6) y Tzetzes (In Hes. Op. 414) relativas al supuesto juicio al que fue sometido Esquilo por revelar doctrinas de los misterios de Eleusis discrepan notablemente en los puntos fundamentales y no proceden de una fuente histórica.
El testimonio de Aristóteles (EN 1111a, 6-11), en, sí preserva fiablemente el recuerdo del juicio.
Esquilo divulgó la relación materno-filial de Deméter y Ártemis en varios dramas siendo inconsciente - él nunca fue iniciado en estos misterios - de que revelaba parte de la doctrina eleusina.
En su defensa, alegó que había adquirido el conocimiento de esta relación de una fuente no mistérica, a : ciertos tratados de inspiración órfico-pitagórica acerca los misterios de Deméter
If you look, the en cambio is just in the iob, not in the text. That happens a lot. The orig text is correct, the text in ann misses stuff. As I understand it, the ann offsets for tokens are based on the ann text, so it makes mapping character offsets to token offsets really difficult.
The text was updated successfully, but these errors were encountered:
thanks @jcklie for reporting this issue! I believe this issue concerns only the devset documents as I haven't encountered it with goldset/testset. Since the annotations (entity mentions) done on the devset were fully automatic -- and were never manually corrected -- I can fix this tokenization/PoS tagging problem and regenerate the annotations. Anyway, I'm converting the brat files into uima/xmi and I've started with the goldset -- more on this soon.
Thank you for making this data available. I started to write a python script to parse it, and I encountered some issues. If you look e.g. at
data/aph/devset/iob/75-00060.txt
, then the second sentence isThe corresponding text file,
data/aph/devset/ann/75-00060.txt-doc-1.txt
has as textIf you look, the
en cambio
is just in the iob, not in the text. That happens a lot. The orig text is correct, the text inann
misses stuff. As I understand it, theann
offsets for tokens are based on theann
text, so it makes mapping character offsets to token offsets really difficult.The text was updated successfully, but these errors were encountered: