Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IL Feedback #883

Open
29 of 30 tasks
matyaskopp opened this issue Nov 22, 2024 · 45 comments
Open
29 of 30 tasks

IL Feedback #883

matyaskopp opened this issue Nov 22, 2024 · 45 comments
Assignees

Comments

@matyaskopp
Copy link
Collaborator

matyaskopp commented Nov 22, 2024

Thanks for the great work on the corpora!
Please do not be scared of a long task list (everyone received it). I hope it will help you improve your corpus. I am ready to help and discuss any ambiguities or doubts, so do not hesitate to ask.

Are component filenames really unique

  • unique component files

The filenames (file IDs /TEI/@id) must be unique. I am not sure if multiple plenary/committee meetings can be held on the same day.

maintitle unique and also in Hebrew

  • unique
  • Hebrew translation

The text value of the main title in component files has to be unique within the corpus and there also should be Hebrew translation:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L9
so instead of reference corpus, you can place date and some more info that makes it unique (because you encode committees too, I believe the date is not enough):

<title type="main" xml:lang="he"><!-- ... --> ParlaMint-IL, <!--date + some more info--> [ParlaMint]</title>
<title type="main" xml:lang="en">Israeli parliamentary corpus ParlaMint-IL, <!--date + some more info--> [ParlaMint]</title>

<meeting> element in plenarys

  • parla.term
  • parla.session ?
  • parla.meeting
  • parla.sitting

Values from the <meeting> elements are used in corcondancers for filtering transcriptions, so the correct encoding is really important. See documentation: https://clarin-eric.github.io/ParlaMint/#exa-titleStmtComp
and also the taxonomy:

I believe this plenary hearing file: https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L12

<meeting ana="#parla.uni"/>

should be encoded this way

<meeting ana="#parla.term #parla.uni #period_18" n="18" corresp="#ParlaMint-IL-KNESS">הכנסת ה-18</meeting>
<meeting ana="#parla.meeting #parla.uni" n="5" corresp="#ParlaMint-IL-KNESS">ישיבה מס' 5</meeting>
<meeting ana="#parla.sitting #parla.uni" n="2009-03-12" corresp="#ParlaMint-IL-KNESS">2009-03-12</meeting>

<meeting> element in committees

  • parla.term
  • parla.meeting
  • parla.sitting

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2021/ParlaMint-IL_2021-12-21.xml#L12
can be encoded this way:

<meeting ana="#parla.term #parla.committee #period_24" n="24" >הכנסת ה-24</meeting><!-- no corresp attribute, we don't have org -->
<meeting ana="#parla.meeting #parla.committee" n="????">ישיבה מס' ????</meeting><!-- no corresp attribute, we don't have org -->
<meeting ana="#parla.sitting #parla.committee" n="2021-12-21">2021-12-21</meeting><!-- no corresp attribute, we don't have org -->

It is a pity you do not have committee organizations and texts so they can be linked. ParlaMint-BE has committee meetings too (but no <org>anization). On the other hand CZ and HU have organizations but not corresponding texts. It would be great to have one corpus that has both :-)

<meeting> element in teiCorpus

  • parla.term

There should be a list of terms in the <meeting> elements in corpus root files, like this:

<meeting n="27" corresp="#NR" ana="#parla.lower #parla.term #NR.XXVII"/>
<meeting n="26" corresp="#NR" ana="#parla.lower #parla.term #NR.XXVI"/>
<meeting n="25" corresp="#NR" ana="#parla.lower #parla.term #NR.XXV"/>
<meeting n="24" corresp="#NR" ana="#parla.lower #parla.term #NR.XXIV"/>
<meeting n="23" corresp="#NR" ana="#parla.lower #parla.term #NR.XXIII"/>
<meeting n="22" corresp="#NR" ana="#parla.lower #parla.term #NR.XXII"/>
<meeting n="21" corresp="#NR" ana="#parla.lower #parla.term #NR.XXI"/>
<meeting n="20" corresp="#NR" ana="#parla.lower #parla.term #NR.XX"/>

annotation of the file TEI/@ana

  • add #parla.sitting into TEI/@ana

Add #parla.sitting into TEI/@ana if one file corresponds to one sitting or the #parla.meeting value can be used if sitting is one to one to meeting.

bibliography

  • date
  • idno URL - texts available online, but the source is a different corpus that does not preserve this information

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-03-12.xml#L52-L53

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il</idno>
               <date from="1993-07-12" to="2024-04-03">1993-07-12 - 2024-04-03</date>
            </bibl>

should contain correct single day when="2009-03-12" - the day of making text public or meeting date.
Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il<!-- more concrete URL to the source --></idno>
               <date when="2009-03-12">2009-03-12</date>
            </bibl>

settingDesc

  • <setting>

Take a look at examples from other corpora:

<settingDesc>
<setting>
<name type="city" xml:lang="de">Wien</name>
<name type="city" xml:lang="en">Vienna</name>
<name type="country" key="AT" xml:lang="en">Austria</name>
<name type="country" key="AT" xml:lang="de">Österreich</name>
<date ana="#parla.sitting" when="2022-10-12">2022-10-12</date>
</setting>
</settingDesc>

<settingDesc>
<setting>
<name type="org">Parlament České republiky - Poslanecká sněmovna</name>
<name type="address">Sněmovní 176/4</name>
<name type="city">Praha</name>
<name key="CZ" type="country">Česká republika</name>
<date when="2023-07-26" ana="#parla.sitting">2023-07-26</date>
</setting>
</settingDesc>

<settingDesc>
<setting>
<name type="city">Ljubljana</name>
<name type="country" key="SI">Slovenija</name>
<date when="2022-04-06" ana="#parla.sitting">6. 4. 2022</date>
</setting>
</settingDesc>

ID format

  • u/@id
  • seg/@id
  • s/@id
  • w/@id and pc/@id

I know that ID value is just for technical purposes, but consider changing them in the way most corpora do it, something like
{file_id}.u{utteranceN}.p{paragraphN}.s{sentenceN}.w{tokenN} (CZech style of creating ids).

changed ids in annotated version

  • no id changes

For technical reasons, we want to preserve utterances and segment ids in annotated versions (they would be equal). When you annotate the corpus, you are only enriching it, not changing existing content.
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.xml#L97-L100

            <u xml:id="u.session.18_ptv_139208_doc.0"
               who="#person.526"
               ana="#chair">
               <seg xml:id="seg.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9">שלום לכולם, אני פותח את הדיון. על סדר היום – העברות תקציביות. פניה מספר 238 מדובר על מנהלת סל"ע.</seg>

vs annotated version:
https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L103-L106

            <u xml:id="u.session.18_ptv_139208_doc-1.0"
               who="#person.526"
               ana="#chair">
               <seg xml:id="seg.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-2">

syntactic vs orthographic words

  • syntactic words

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21.ana.xml#L107-L130

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0"
                        join="right">שלום</w>
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

annotation with udpipe for easier illustration:
image

it should be encoded this way:

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0">שלום</w> <!-- removed join="right" -->
<!-- REMOVED:
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
-->
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

Or the ways documented here: https://clarin-eric.github.io/ParlaMint/#sec-ana-norm

Not sure... "לכולם" does not have a lemma....
@TomazErjavec please help me here. We need to be able to convert it into conllu and vert. It would also be great if it would be possible to search it as one word for users...

named entities

  • Are there any multi-word named entities?

I have found only single-word named entities which were adjected, like this:

                     <name type="MISC">
                        <w lemma="כנסת"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.18"
                           join="right">כנסת</w>
                     </name>
                     <name type="PER">
                        <w lemma="שוש"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.19"
                           join="right">שוש</w>
                     </name>
                     <name type="PER">
                        <w lemma="כרם"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.20"
                           join="right">כרם</w>
                     </name>

taxonomies

  • common taxonomies
  • ParlaMint-IL specific taxonomies

there are two types of taxonomies

  1. common one with ParlaMint-taxonomy- prefix where no changes are allowed, only translation is required (except UD-SYN)
  2. country-specific, in your case, this prefix: ParlaMint-IL-taxonomy-

You have changed the content of the common taxonomies and also the filenames, so the taxonomies do not match the ParlaMint ones.

You can initialize common taxonomies with this command. Run it in the repository root folder:

make initTaxonomies4translation-IL

it creates taxonomies in Sample/ParlaMint-IL and place placeholders where the translations should appear (it overwrites existing ones if the filename is equal)

If you have the correct filename and IDs, you can use this sequence to prefill your translations:

# save your translations in the common translations
make translateTaxonomies-IL
# initialize taxonomies from common ones (if the translation exists, then it is used; otherwise, uses placeholder)
make initTaxonomies4translation-IL
# revert changes in common taxonomies (you are not allowed to change this folder - it is my job)
git checkout Build/Taxonomies/

languages

  • <langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>

there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

<langUsage>
<language ident="cs" xml:lang="cs">čeština</language>
<language ident="en" xml:lang="cs">angličtina</language>
<language ident="cs" xml:lang="en">Czech</language>
<language ident="en" xml:lang="en">English</language>
</langUsage>

invalit label content

  • org/listEvent/event/label content

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L420-L427

    <listEvent>
      <event xml:id="org.13_period_1" from="1949-02-14" to="1969-01-28">
        <label xml:lang="en">&lt;Element {http://www.tei-c.org/ns/1.0}orgName at 0x2ccab29f6c0&gt;_period_1</label>
      </event>
      <event xml:id="org.13_period_2" from="1984-10-22" to="1992-03-09">
        <label xml:lang="en">&lt;Element {http://www.tei-c.org/ns/1.0}orgName at 0x2ccab29f6c0&gt;_period_2</label>
      </event>
    </listEvent>

This appears in multiple organizations; the above is just a sample.

abbreviated form is longer than full

  • org/orgName

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L647-L648

  <org xml:id="org.31" role="parliamentaryGroup">
    <orgName full="yes">הליכוד</orgName>
    <orgName full="abb">גוש חירות ליברלים (גח"ל)</orgName>

This appears in multiple organizations; the above is just a sample.

independent MP forms parliamentary group

  • independent MP

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L2045-L2051

  <org xml:id="org.153" role="parliamentaryGroup">
    <orgName full="yes">ח"כ עצמאי - הלל קוק</orgName>
    <orgName full="abb">ח"כ עצמאי - הלל קוק</orgName>
    <event from="1951-02-20">
      <label xml:lang="en">existence</label>
    </event>
  </org>

approx 30 occurrences.

This solution allows to affiliate with political orientation an independent MP, but it is really strange. Probably we have to find a better solution. (@TomazErjavec ??)

only member affiliations

  • various affiliation roles

corpus contains only member roles. is there a possibility to add various roles? See https://clarin-eric.github.io/ParlaMint/#sec-affiliation

unknown person name

  • unknown person
  <person xml:id="person.3ae50273-a4de-47e6-99c8-71db4881d15d">
    <persName>
      <forename>Unknown</forename>
      <surname>קריאה</surname>
    </persName>
    <sex value="U"/>
  </person>
  <person xml:id="person.6311c681-5a63-4e29-b7f0-b76f0b3a0a6a">
    <persName>
      <forename>Unknown</forename>
      <surname>קריאות</surname>
    </persName>
    <sex value="U"/>
  </person>
  <person xml:id="person.31227de9-062c-4ff1-a01c-78f9dd30418c">
    <persName>
      <forename>Unknown</forename>
      <surname>Unknown</surname>
    </persName>
    <sex value="U"/>
  </person>

it is not necessary to fill in both forename and surname if unknown. If the person is completely unknown, then he/she shouldn't have a person record in listPerson (you can also skip @who attribute in utterance)

@TomazErjavec
Copy link
Collaborator

TomazErjavec commented Nov 23, 2024

@TomazErjavec please help me here. We need to be able to convert it into conllu and vert. It would also be great if it would be possible to search it as one word for users...

If I understand correctly, "לכולם" is a surface word that corresponds to two syntactic words "ל" and "כולם".
If this is the case, then this corresponds to the "abyste" example from https://clarin-eric.github.io/ParlaMint/#sec-ana-norm, so

<w join="right">לכולם
  <w norm="ל" lemma="ל" pos="ADP" msd="UPosTag=ADP"/>
  <w norm="כולם" lemma="כולם"  pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur"/>
</w>

(the join=right is because comma follows).

A similar case form IT sample would be

<w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.21-22" join="right">nell'<w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.21" norm="in" lemma="in" pos="E" msd="UPosTag=ADP"/><w xml:id="ParlaMint-IT_2018-03-23-LEG18-Senato-sed-1.ana.seg4.2.22" norm="l'" lemma="il" pos="RD" msd="UPosTag=DET|Definite=Def|Number=Sing|PronType=Art"/></w>

This gets converted to CoNLL-U like:

21-22 nell' _ _ _ _ _ _ _ NER=O|SpaceAfter=No
21 in in ADP E _ 23 case _ _
22 l' il DET RD Definite=Def|Number=Sing|PronType=Art 23 det _ _

and to vert like
nell' in|l' in|il ADP|DET -|Definite=Def Number=Sing PronType=Art 21|22 case|det assetto NOUN Gender=Masc Number=Sing 23

Note that vert (exactly for cases like this) have multivalued attributes on norm, lemma etc. Not ideal, but best we can do with vertical files.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Nov 24, 2024

random person check דורון אביטל

I have checked random person: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL-listPerson.xml#L241-L254

  <person xml:id="person.18990">
    <persName>
      <forename>דורון</forename>
      <surname>אביטל</surname>
    </persName>
    <sex value="M"/>
    <birth when="1959-01-22">
      <placeName>ישראל</placeName>
    </birth>
    <affiliation ref="#org.122" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="2011-03-18" to="2013-02-05"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="minister" from="2011-03-18" to="2013-02-05"/>
  </person>

His parliamentary group status at the time of membership:

    <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2009-03-31" to="2012-05-08"/>
    <relation name="coalition" mutual="#org.122" from="2012-05-08" to="2012-07-17"/>
    <relation name="opposition" active="#org.122" passive="#ParlaMint-IL-GOV" from="2012-07-17" to="2013-02-05"/>

There are some weirds:

  • he is at the same time a member of government and in the opposition
  • the government membership has the same timespan as parliament membership (in Czechia, it takes some time(weeks-months) to become a minister after becoming a parliament member)
  • wiki does not say he was a minister

Not sure if you understand the concept of members of the government in ParlaMint. It seems that all parliament members who are affiliated with the parliamentary group in the coalition are members of the government.
https://clarin-eric.github.io/ParlaMint/#sec-affiliation
A member of government is someone who has some position in government (not everyone from the coalition)

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Nov 25, 2024

INVALID @matyaskopp fault:

<meeting> element in teiCorpus

  • non-unique meeting element in teiCorpus

<meeting> element should be unique within the file, there are repetitions in a corpus root file: https://github.com/GiliGoldin/ParlaMint/blob/8040ae5cd6579b7e4f414517a766ce5ce8b93f74/Samples/ParlaMint-IL/ParlaMint-IL.xml#L13-L36

            <meeting n="14"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_14"
                     xml:lang="he">הכנסת ה-14</meeting>
            <meeting n="14"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_14"
                     xml:lang="en">14th Knesset</meeting>
            <meeting n="18"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_18"
                     xml:lang="he">הכנסת ה-18</meeting>
            <meeting n="18"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_18"
                     xml:lang="en">18th Knesset</meeting>
            <meeting n="24"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_24"
                     xml:lang="he">הכנסת ה-24</meeting>
            <meeting n="24"
                     corresp="#ParlaMint-IL-KNESS"
                     ana="#parla.uni #parla.term #period_24"
                     xml:lang="en">24th Knesset</meeting>

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Nov 25, 2024 via email

@matyaskopp
Copy link
Collaborator Author

Sorry I don't understand. What is the repetition? Shouldn't there be a meeting for each term? It's written in Hebrew and in English. How should it be then?

Of course, you are right.
Your encoding is correct!

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Nov 25, 2024

extra files

remove from repository:

  • ParlaMint-IL.teiCorpus.xml
  • ParlaMint-IL.ana.teiCorpus.xml
  • Scripts/bin/saxon-ee-11.6.jar
  • Scripts/bin/saxon-ee-11.6.jar:Zone.Identifier
  • Scripts/bin/trang-20220510/copying.txt
  • Scripts/bin/trang-20220510/trang-manual.html
  • Scripts/bin/trang-20220510/trang.jar
  • Scripts/bin/trang.jar (revert change)

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Nov 25, 2024 via email

@matyaskopp
Copy link
Collaborator Author

languages

https://github.com/GiliGoldin/ParlaMint/blob/199b869dd0c2734124a98663193da1a6ad972007/Samples/ParlaMint-IL/ParlaMint-IL.xml#L144-L149

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">אנגלית</language>
            <language ident="he">Hebrew</language>
            <language ident="en">English</language>
         </langUsage>

should be:

         <langUsage>
            <language ident="he" xml:lang="he">עברית</language>
            <language ident="en" xml:lang="he">אנגלית</language>
            <language ident="he" xml:lang="en">Hebrew</language>
            <language ident="en" xml:lang="en">English</language>
         </langUsage>

@matyaskopp
Copy link
Collaborator Author

taxonomies

There are still some taxonomies which are IL-specific or not linked:

Samples/ParlaMint-IL/ParlaMint-taxonomy-roles.xml
Samples/ParlaMint-IL/ParlaMint-taxonomy-sessionTypes.xml

I guess they can be removed

@matyaskopp
Copy link
Collaborator Author

Thanks for the great progress; I have ticked what has been resolved so far. If anything is unclear, please ask.

You are right, it seems that I inserted the time of his faction membership as the time in the government instead of the time in the coalition. I will fix this.
I did assign each coalition member as a government member and as a minister. I see now that this is a mistake, I will remove the minister role since I don't have the information regarding the roles of the ministers and government positions. In our corpus we do consider all the people in the coalition to be government members.

Well, you made more changes than just removing ministers and fixing the beginnings of timespans in 199b869; see Netanyahu:
image

Some remove seem to be correct (e.g. Netanyahu was not in government with Bennett) - I hope you are aware of these changes - it was a bugfix, not accidental removal.

The government beginnings seem to be okay (if the start of the coalition is the start of the government), but now you have most probably time spans without government because you have shifted only beginnings (old government still works after new MPs make parliamentary oath).
I believe you want to make a ParlaMint comparable corpus (not just using ParlaMint encoding) - so I suggest not sticking to your source corpus but rather extending it with more metadata. On Wikipedia, there are easily reachable all Israeli governments. Would it be a solution to use this data?
https://en.wikipedia.org/wiki/Cabinet_of_Israel#List_of_cabinets

We have a script for enriching tei with tsv data:

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Nov 28, 2024

bibliography

  • date
  • idno URL

Url should contain the proper source of the transcription (if available), so everyone can see the source that you have transformed to corpus.

           <bibl>
               <title type="main" xml:lang="he">פרוטוקולים של הכנסת</title>
               <title type="main" xml:lang="en">Knesset Protocols</title>
               <idno type="URI" subtype="parliament">https://www.knesset.gov.il<!-- more concrete URL to the source --></idno>
               <date when="2009-03-12">2009-03-12</date>
            </bibl>

The sources can be found online but I don't have this specific URL information since we didn't process the files directly from the website, we received them in email directly from the Knesset archivists.

@matyaskopp
Copy link
Collaborator Author

The sources can be found online but I don't have this specific URL information since we didn't process the files directly from the website, we received them in email directly from the Knesset archivists.

Okay, it's a shame. You can add it to your checklist for improving your source corpus.
I highly recommend including the source link to everyone - it is beneficial not only for development purposes but also for validating the correctness and completeness of your data (with non-trivial effort, but it is doable).

@GiliGoldin
Copy link
Collaborator

Well, you made more changes than just removing ministers and fixing the beginnings of timespans in 199b869; see Netanyahu: image

Some remove seem to be correct (e.g. Netanyahu was not in government with Bennett) - I hope you are aware of these changes - it was a bugfix, not accidental removal.

The government beginnings seem to be okay (if the start of the coalition is the start of the government), but now you have most probably time spans without government because you have shifted only beginnings (old government still works after new MPs make parliamentary oath). I believe you want to make a ParlaMint comparable corpus (not just using ParlaMint encoding) - so I suggest not sticking to your source corpus but rather extending it with more metadata. On Wikipedia, there are easily reachable all Israeli governments. Would it be a solution to use this data? https://en.wikipedia.org/wiki/Cabinet_of_Israel#List_of_cabinets

We have a script for enriching tei with tsv data:

I made sure to use the coalition dates rather than the faction membership dates. This caused all the mentioned changes which are correct now. The start of the coalition membership is the start of the government membership, not the parliamentary oath, but yes the end will be the end of the coalition membership.
I did see and fix a mistake in the coalition start_date 22.11.88 instead of 22.12.88, but the content of the corpus and it's accuracy should be our responsibility. I am of course making all effort for the corpus to fit the parlaMint formatting and instructions.
I manually added the prime ministers as roles in the government with the dates in the role. I hope this will suffice both for the request for roles other than members, and also that there won't be gaps without a government at all.

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Dec 1, 2024

annotation with udpipe for easier illustration: image

it should be encoded this way:

                  <s xml:id="s.id-7092c662-d9c6-4ca1-b68d-9c0ffc7be7d9-3">
                     <w lemma="שלום"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.0">שלום</w> <!-- removed join="right" -->
<!-- REMOVED:
                     <w lemma="_"
                        pos="X"
                        msd="UPosTag=X"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.1"
                        join="right">לכולם</w>
-->
                     <w lemma="ל"
                        pos="ADP"
                        msd="UPosTag=ADP"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.2"
                        join="right">ל</w>
                     <w lemma="כולם"
                        pos="NOUN"
                        msd="UPosTag=NOUN|Gender=Masc|Number=Plur"
                        xml:id="w.session.18_ptv_139208_doc-1.0.1.3"
                        join="right">כולם</w>
                     <pc xml:id="w.session.18_ptv_139208_doc-1.0.1.4"
                         msd="UPosTag=PUNCT"
                         join="right">,</pc>

Or the ways documented here: https://clarin-eric.github.io/ParlaMint/#sec-ana-norm

This was fixed according to what TomazErjavec suggested.

named entities

  • Are there any multi-word named entities?

I have found only single-word named entities which were adjected, like this:

                     <name type="MISC">
                        <w lemma="כנסת"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.18"
                           join="right">כנסת</w>
                     </name>
                     <name type="PER">
                        <w lemma="שוש"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.19"
                           join="right">שוש</w>
                     </name>
                     <name type="PER">
                        <w lemma="כרם"
                           pos="PROPN"
                           msd="UPosTag=PROPN"
                           xml:id="w.session.14_ptm_532674_docx-1.1758.3.20"
                           join="right">כרם</w>
                     </name>

This was fixed

taxonomies

  • common taxonomies
  • ParlaMint-IL specific taxonomies

there are two types of taxonomies

  1. common one with ParlaMint-taxonomy- prefix where no changes are allowed, only translation is required (except UD-SYN)
  2. country-specific, in your case, this prefix: ParlaMint-IL-taxonomy-

You have changed the content of the common taxonomies and also the filenames, so the taxonomies do not match the ParlaMint ones.

You can initialize common taxonomies with this command. Run it in the repository root folder:

make initTaxonomies4translation-IL

it creates taxonomies in Sample/ParlaMint-IL and place placeholders where the translations should appear (it overwrites existing ones if the filename is equal)

If you have the correct filename and IDs, you can use this sequence to prefill your translations:

This was fixed. There are no IL-specifix taxonomies anymore

# save your translations in the common translations
make translateTaxonomies-IL
# initialize taxonomies from common ones (if the translation exists, then it is used; otherwise, uses placeholder)
make initTaxonomies4translation-IL
# revert changes in common taxonomies (you are not allowed to change this folder - it is my job)
git checkout Build/Taxonomies/

languages

  • <langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>

there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

<langUsage>
<language ident="cs" xml:lang="cs">čeština</language>
<language ident="en" xml:lang="cs">angličtina</language>
<language ident="cs" xml:lang="en">Czech</language>
<language ident="en" xml:lang="en">English</language>
</langUsage>

This was fixed

independent MP forms parliamentary group

  • independent MP

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL-listOrg.xml#L2045-L2051

  <org xml:id="org.153" role="parliamentaryGroup">
    <orgName full="yes">ח"כ עצמאי - הלל קוק</orgName>
    <orgName full="abb">ח"כ עצמאי - הלל קוק</orgName>
    <event from="1951-02-20">
      <label xml:lang="en">existence</label>
    </event>
  </org>

approx 30 occurrences.

This solution allows to affiliate with political orientation an independent MP, but it is really strange. Probably we have to find a better solution. (@TomazErjavec ??)

Those are factions that are made of only one independent MP, but they are considered as a regular faction/political party in the parliament. I don't see why to save them differently.

@matyaskopp
Copy link
Collaborator Author

languages

  • <langUsage>

https://github.com/GiliGoldin/ParlaMint/blob/27a4fa70319f58c2dfeaf5e8bae00eff0f10fc8a/Samples/ParlaMint-IL/ParlaMint-IL.ana.xml#L143-L146

         <langUsage>
            <language ident="he">עברית</language>
            <language ident="en">English</language>
         </langUsage>

there should be both(@ident) languages information stored in both(@xml:lang) languages, like this:

<langUsage>
<language ident="cs" xml:lang="cs">čeština</language>
<language ident="en" xml:lang="cs">angličtina</language>
<language ident="cs" xml:lang="en">Czech</language>
<language ident="en" xml:lang="en">English</language>
</langUsage>

This was fixed

@GiliGoldin, it was fixed only partially, see my previous comment: #883 (comment)

should be:

         <langUsage>
            <language ident="he" xml:lang="he">עברית</language>
            <language ident="en" xml:lang="he">אנגלית</language>
            <language ident="he" xml:lang="en">Hebrew</language>
            <language ident="en" xml:lang="en">English</language>
         </langUsage>

@matyaskopp
Copy link
Collaborator Author

Those are factions that are made of only one independent MP, but they are considered as a regular faction/political party in the parliament. I don't see why to save them differently.

If this reflects the reality in Knesset, then do it this way - I am ok with it.
In most European parliaments, there are some restrictions on forming parliamentary groups. A minimum number of members is required, for example, in CZ, the minimum is 3.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Dec 2, 2024

join attribute

  • */@join="right"

There are too many joins, so the raw TEI and annotated (TEI.ana) versions are different
I have polished the script to make it easier to debug it. The idea is to convert all segments in TEI and TEI.ana into text, and the result should be the same:

$ make text.seg.ana-IL text.seg-IL
INFO: converting ParlaMint-IL_2009-10-21-18ptv139208.ana to text file
INFO: converting ParlaMint-IL_2021-12-21-24ptv616837.ana to text file
INFO: converting ParlaMint-IL_2009-03-12-18ptm186016.ana to text file
INFO: converting ParlaMint-IL_1998-07-08-14ptm532674.ana to text file
INFO: annotated segments converted to text are stored in Samples/ParlaMint-IL/text.seg.ana
INFO: converting ParlaMint-IL_2009-03-12-18ptm186016 to text file
INFO: converting ParlaMint-IL_2009-10-21-18ptv139208 to text file
INFO: converting ParlaMint-IL_2021-12-21-24ptv616837 to text file
INFO: converting ParlaMint-IL_1998-07-08-14ptm532674 to text file
INFO: segments converted to text are stored in Samples/ParlaMint-IL/text.seg

and then you can compare folders (I use meld):

$ meld Samples/ParlaMint-IL/text.seg Samples/ParlaMint-IL/text.seg.ana

image

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Dec 2, 2024

@GiliGoldin, you removed your comment before I could react, so there are probably still some doubts.
In the meantime, I "fixed" one commit(9cb4c22) with orthographical words that hides one of the bugs, so the result is slightly different.

I can give you an example https://github.com/GiliGoldin/ParlaMint/blob/4571733fe48a9d200c92fd1ba7b02807bfc7ccfb/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L484-L519 on how should this sentence be encoded.
The current state is:

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
                     <w lemma="איפה"
                        pos="ADV"
                        msd="UPosTag=ADV|PronType=Int"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
                     <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3" join="right">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
                           norm="ה"
                           lemma="ה"
                           pos="DET"
                           msd="UPosTag=DET|Definite=Def|PronType=Art"/>
                        <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
                           norm="מכינה"
                           lemma="מכינה"
                           pos="NOUN"
                           msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
                     </w>
                     <w lemma="הוקם"
                        pos="VERB"
                        msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
                         msd="UPosTag=PUNCT"
                         join="right">?</pc>
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

It should be:

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0">
                     <w lemma="איפה"
                        pos="ADV"
                        msd="UPosTag=ADV|PronType=Int"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t1">איפה</w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3) removing join="right" 
because the token(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4) 
on the right(=following) in this file is not joined
-->
                     <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2-3">המכינה<w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t2"
                           norm="ה"
                           lemma="ה"
                           pos="DET"
                           msd="UPosTag=DET|Definite=Def|PronType=Art"/>
                        <w xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t3"
                           norm="מכינה"
                           lemma="מכינה"
                           pos="NOUN"
                           msd="UPosTag=NOUN|Gender=Fem|Number=Sing"/>
                     </w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4) added join="right" 
because the punctation(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) is joined with this token
-->
                     <w lemma="הוקם"
                        join="right"
                        pos="VERB"
                        msd="UPosTag=VERB|Gender=Fem|HebBinyan=HUFAL|Number=Sing|Person=3|Tense=Fut|Voice=Pass"
                        xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t4">תוקם</w>
<!-- 
(ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5) removing join="right" 
because the sentence is at the end of the sentence
-->
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u2.p0.s0.t5"
                         msd="UPosTag=PUNCT"
                         join="right">?</pc>
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

I am sorry if I have written it unambiguously in #882. I hope this example helps

@GiliGoldin
Copy link
Collaborator

Yes, I removed the comment since I noticed more problems that needed fixing.
I think I solved this problem in the mean while. Now the problems are mostly with punctuations like ", which I didn't figure out how to distinguish between a closing one and an opening.

@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Dec 2, 2024

Yes, I removed the comment since I noticed more problems that needed fixing.
I think I solved this problem in the mean while. Now the problems are mostly with punctuations like ", which I didn't figure out how to distinguish between a closing one and an opening.

The idea is to store source spacing, not to create a typographically correct one. But the current state is much better - it does not break the text, so we can leave it as it is.
image

@matyaskopp
Copy link
Collaborator Author

I have spotted one easy-fix join issue:
https://github.com/clarin-eric/ParlaMint/actions/runs/12123458750/job/33799059974#step:4:11009

Make sure that the last token in a sentence does not contain the join attribute:
https://github.com/GiliGoldin/ParlaMint/blob/abc39b0c344fffa970992b82e2260ace3c5377ac/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L1975-L1978

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1">
                     <!-- SKIPPING -->
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t12"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t13"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc> <!-- REMOVE THIS JOIN -->
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

I believe your pipeline will be ready to run on all data when you fix this.

@GiliGoldin, thanks for the exceptional work!

@TomazErjavec, just for your update, ParlaMint-IL sample is close to being ready.

@GiliGoldin
Copy link
Collaborator

I have spotted one easy-fix join issue: https://github.com/clarin-eric/ParlaMint/actions/runs/12123458750/job/33799059974#step:4:11009

Make sure that the last token in a sentence does not contain the join attribute: https://github.com/GiliGoldin/ParlaMint/blob/abc39b0c344fffa970992b82e2260ace3c5377ac/Samples/ParlaMint-IL/2009/ParlaMint-IL_2009-10-21-18ptv139208.ana.xml#L1975-L1978

                  <s xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1">
                     <!-- SKIPPING -->
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t12"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc>
                     <pc xml:id="ParlaMint-IL_2009-10-21-18ptv139208.u10.p0.s1.t13"
                         msd="UPosTag=PUNCT"
                         join="right">-</pc> <!-- REMOVE THIS JOIN -->
                     <linkGrp targFunc="head argument" type="UD-SYN"><!-- SKIPPING --></linkGrp>
                  </s>

I believe your pipeline will be ready to run on all data when you fix this.

@GiliGoldin, thanks for the exceptional work!

@TomazErjavec, just for your update, ParlaMint-IL sample is close to being ready.

That's great, thank you so much!
I fixed this issue.

@GiliGoldin
Copy link
Collaborator

Hi, the full data is located here:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL

Happy holidays!
Gili

@TomazErjavec
Copy link
Collaborator

@GiliGoldin, thanks for letting us know. Will have a look and try to process it soon.
Happy holidays back!

@TomazErjavec
Copy link
Collaborator

I transferred your corpus and had a look & tried to process it. A lot of found errors are surprising, as things are
ok in the sample files, but not in the complete corpus:

Many elements in the TEI headers (esp. in the corpus root file) do not mark their content as being English,
e.g. <title type="sub">Knesset Protocols</title>. All such elements should have xml:lang="en", otherwise
processing will produce wrong results. For example, we can generate TSV files with metadata points in local language
or English, which won't work now.

I tried processing the complete corpus with our scripts (cf. Build/ directory) but it turns out your corpus it too
large to do it (in ParlaMint we typically only had transcripts from 2015 onwards); we have to think how to modify
the scripts to consume less memory. So, for now, I commented out the transcript files except for the first and last
two years and then did the build + validation. You can find the log files at
https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D where the .error and .warn and are just "grep -i error" and "grep
-i warn" over the complete log file ParlaMint-IL.log.

The main errors seem to be:

  • overlapping affiliations, e.g. a person is a member of both coalition and opposition on a certain date
  • strange nested divs (look for "error: element "div" not allowed here"), e.g.
<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <div type="parentSession">
     <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1">תוצופתהו הטילקה ,היילעה תדעו</head>
     <div type="sessionName">
        <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>
     </div>
  </div>

I think the intention was to preserve the name of the parent session but this is not the right way to do it. Maybe simply:

<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1" type="parentSessionName">תוצופתהו הטילקה ,היילעה תדעו</head>
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>

As for the .ana files:

  • quite a few of the files have not been linguistically annotated, e.g. ParlaMint-IL_2023-01-04-25ptv1558268.ana.xml
    (look for "error: text not allowed here")
  • NER annotations (i.e. <name> elements) are missing completely
  • the tokeniser you are using seems to be quite bad, there are many punctuation elements that contain space (look
    for "character content of element "pc" invalid") and, I noticed, a lot of words that contain punctuation e.g.
    <w lemma="פשתה&quot;ד–" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-01-01-25ptm3852892.u1.p0.s0.t94">ד"פשתה–</w> which
    surely can't be right

If you could fix (as many as you can of) these mistakes and post a new version, we could then take if from there.

One other thing, it would be nice to localise the common taxonomies, i.e. translate their category descriptions into
Hebrew. This is not required, but I think it is nice that you can have these names in your own language. If you do
decide to do so, pls. get in touch first as there are some nuances on how to do it.

matyaskopp added a commit that referenced this issue Jan 4, 2025
@matyaskopp
Copy link
Collaborator Author

matyaskopp commented Jan 4, 2025

One other thing, it would be nice to localise the common taxonomies, i.e. translate their category descriptions into
Hebrew. This is not required, but I think it is nice that you can have these names in your own language. If you do
decide to do so, pls. get in touch first as there are some nuances on how to do it.

@TomazErjavec, @GiliGoldin added translations to samples, but I forgot to update the common taxonomies. I am doing it manually with:

make translateTaxonomies-IL

it is now included in e4a0f26

@GiliGoldin
Copy link
Collaborator

I transferred your corpus and had a look & tried to process it. A lot of found errors are surprising, as things are ok in the sample files, but not in the complete corpus:

Many elements in the TEI headers (esp. in the corpus root file) do not mark their content as being English, e.g. <title type="sub">Knesset Protocols</title>. All such elements should have xml:lang="en", otherwise processing will produce wrong results. For example, we can generate TSV files with metadata points in local language or English, which won't work now.

I tried processing the complete corpus with our scripts (cf. Build/ directory) but it turns out your corpus it too large to do it (in ParlaMint we typically only had transcripts from 2015 onwards); we have to think how to modify the scripts to consume less memory. So, for now, I commented out the transcript files except for the first and last two years and then did the build + validation. You can find the log files at https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D where the .error and .warn and are just "grep -i error" and "grep -i warn" over the complete log file ParlaMint-IL.log.

The main errors seem to be:

  • overlapping affiliations, e.g. a person is a member of both coalition and opposition on a certain date
  • strange nested divs (look for "error: element "div" not allowed here"), e.g.
<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <div type="parentSession">
     <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1">תוצופתהו הטילקה ,היילעה תדעו</head>
     <div type="sessionName">
        <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>
     </div>
  </div>

I think the intention was to preserve the name of the parent session but this is not the right way to do it. Maybe simply:

<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1" type="parentSessionName">תוצופתהו הטילקה ,היילעה תדעו</head>
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>

As for the .ana files:

  • quite a few of the files have not been linguistically annotated, e.g. ParlaMint-IL_2023-01-04-25ptv1558268.ana.xml
    (look for "error: text not allowed here")
  • NER annotations (i.e. <name> elements) are missing completely
  • the tokeniser you are using seems to be quite bad, there are many punctuation elements that contain space (look
    for "character content of element "pc" invalid") and, I noticed, a lot of words that contain punctuation e.g.
    <w lemma="פשתה&quot;ד–" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-01-01-25ptm3852892.u1.p0.s0.t94">ד"פשתה–</w> which
    surely can't be right

If you could fix (as many as you can of) these mistakes and post a new version, we could then take if from there.

One other thing, it would be nice to localise the common taxonomies, i.e. translate their category descriptions into Hebrew. This is not required, but I think it is nice that you can have these names in your own language. If you do decide to do so, pls. get in touch first as there are some nuances on how to do it.

Okay, I will look into these problems and try to solve as many as possible. I will upload a new version soon.

@GiliGoldin
Copy link
Collaborator

The main errors seem to be:

  • overlapping affiliations, e.g. a person is a member of both coalition and opposition on a certain date
  • strange nested divs (look for "error: element "div" not allowed here"), e.g.
<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <div type="parentSession">
     <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1">תוצופתהו הטילקה ,היילעה תדעו</head>
     <div type="sessionName">
        <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>
     </div>
  </div>

I think the intention was to preserve the name of the parent session but this is not the right way to do it. Maybe simply:

<div type="debateSection" xml:id="session.25_ptv_1883249_doc">
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head1" type="parentSessionName">תוצופתהו הטילקה ,היילעה תדעו</head>
  <head xml:id="ParlaMint-IL_2023-02-12-25ptv1883249.head2">םילועל ירוביצ רוידו רוידב עויסל תוצופתהו הטילקה ,היילעה תדעו לש הנשמה תדעו</head>

As for the .ana files:

  • quite a few of the files have not been linguistically annotated, e.g. ParlaMint-IL_2023-01-04-25ptv1558268.ana.xml
    (look for "error: text not allowed here")
  • NER annotations (i.e. <name> elements) are missing completely
  • the tokeniser you are using seems to be quite bad, there are many punctuation elements that contain space (look
    for "character content of element "pc" invalid") and, I noticed, a lot of words that contain punctuation e.g.
    <w lemma="פשתה&quot;ד–" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-01-01-25ptm3852892.u1.p0.s0.t94">ד"פשתה–</w> which
    surely can't be right

If you could fix (as many as you can of) these mistakes and post a new version, we could then take if from there.
One other thing, it would be nice to localise the common taxonomies, i.e. translate their category descriptions into Hebrew. This is not required, but I think it is nice that you can have these names in your own language. If you do decide to do so, pls. get in touch first as there are some nuances on how to do it.

Okay, I will look into these problems and try to solve as many as possible. I will upload a new version soon.

Okay I tried to fix most of the problems:

  1. Regarding the overlapping affiliations. From my understanding all or at least most of the warnings are because of a one day overlap. In our data the last day of an affiliation and the first day of another are the same day.
  2. I think I fixed the div problem.
  3. Regarding the missing NER and linguistic annotations - apparently I used the wrong files as input. I fixed this.
  4. Regarding the quality of the tokenizer- it's not perfect but not that bad. the spaces in the PC elements are in phrases like ". . . " where there are spaces between the dots but it considers this as one punctuation (I think it's correct). However, I removed these kinds of punctuations.
  5. Regarding the lemmas like "פשתה"ד–" : the only problem here is the "–", which I don't know why it considers it inside the lemma, but the rest is fine. The word is: 'התשפ"ד', which is a Hebrew calendar year and it's one basic word including the " . I did remove punctuations like "–","-",".","," from the lemmas. 
    I hope these changes will solve most of the problems.
    The new data is here:
    https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL
    Let me know if anything else is needed.
    Thank you,
    Gili

@matyaskopp
Copy link
Collaborator Author

@GiliGoldin, I have tried to process your files: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/?P=*20250111*:

  1. it failed on ParlaMint-IL.TEI version, because you are including taxonomies that are used in annotated version:
      <classDecl>
        <xi:include href="ParlaMint-taxonomy-parla.legislature.xml"/>
        <xi:include href="ParlaMint-taxonomy-politicalOrientation.xml"/>
        <xi:include href="ParlaMint-taxonomy-speaker_types.xml"/>
        <xi:include href="ParlaMint-taxonomy-subcorpus.xml"/>
        <xi:include href="ParlaMint-taxonomy-NER.ana.xml"/> <!-- should not be present in TEI version -->
        <xi:include href="ParlaMint-taxonomy-UD-SYN.ana.xml"/> <!-- should not be present in TEI version -->
      </classDecl>

I will fix it manually, but @GiliGoldin please fix your pipeline too.

  1. I have spotted one repeated error in morphology annotation. is very often marked with UPosTag=SYM - but it should be UPosTag=PUNCT. Some other punctuation marks are also marked wrongly as SYM, it is easily identifiable by empty lemma lemma=""
            <s xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3">
              <w lemma="הצבעה" pos="NOUN" msd="UPosTag=NOUN|Gender=Fem|Number=Sing" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t1">הצבעה</w>
              <w lemma="מס'" pos="NOUN" msd="UPosTag=NOUN|Abbr=Yes|Gender=Masc|Number=Sing" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t2">מס'</w>
              <w lemma="6" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t3">6</w>
              <w lemma="בעד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t4">בעד</w>
              <w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t5">סעיפים</w>
              <w lemma="7-1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t6">7-1</w>
<!-- should be PUNCT -->
              <w lemma="" pos="SYM" msd="UPosTag=SYM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t7">–</w>
              <w lemma="4" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t8">4</w>
              <w lemma="נגד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t9">נגד</w>
              <pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t10" msd="UPosTag=PUNCT">–</pc>
              <w lemma="אין" pos="VERB" msd="UPosTag=VERB|Polarity=Neg" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t11">אין</w>
              <w lemma="נמנע" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t12">נמנעים</w>
              <pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t13" msd="UPosTag=PUNCT">–</pc>
              <w lemma="אין" pos="VERB" msd="UPosTag=VERB|Polarity=Neg" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t14">אין</w>
              <w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t15">סעיפים</w>
              <w lemma="7-1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t16">7-1</w>
              <w lemma="נתקבל" pos="VERB" msd="UPosTag=VERB|Gender=Masc|HebBinyan=NITPAEL|Number=Plur|Person=3|Tense=Past|Voice=Pass" xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t17" join="right">נתקבלו</w>
              <pc xml:id="ParlaMint-IL_1994-03-14-13ptm532079.u584.p0.s3.t18" msd="UPosTag=PUNCT">.</pc>

Can I fix these situations in

<!-- Processing tools also make various formal mistakes on words, here we try to fix them -->
<xsl:template mode="comp" match="tei:w">
<xsl:choose>
<!-- Bug where punctuation is encoded as a word: change <w> to <pc> -->
<xsl:when test="contains(@msd, 'UPosTag=PUNCT') and matches(., '^\p{P}+$')">
<!-- Do not output warning, as there are typically too many of them -->
<!--xsl:message select="concat('WARN: changing word ', ., ' to punctuation for ', @xml:id)"/-->
<pc>
<xsl:apply-templates mode="comp" select="@*[name() != 'lemma']"/>
<xsl:apply-templates mode="comp"/>
</pc>
</xsl:when>
<!-- Bug where syntactic word contains just one word: remove outer word and preserve annotations -->
<xsl:when test="tei:w[tei:w] and not(tei:w[tei:*[2]])">
<xsl:message select="concat('WARN ', /tei:TEI/@xml:id,
': removing useless syntactic word ', @xml:id)"/>
<xsl:copy>
<xsl:apply-templates mode="comp" select="tei:w/@*[name() != 'norm']"/>
<xsl:value-of select="normalize-space(.)"/>
</xsl:copy>
</xsl:when>
<xsl:otherwise>
<xsl:copy>
<xsl:apply-templates mode="comp" select="@*"/>
<xsl:apply-templates mode="comp"/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
with this rule:

      <xsl:when test="@lemma='' and @msd = 'UPosTag=SYM' ">
	<pc>
          <xsl:attribute name="msd">'UPosTag=PUNCT</xsl:attribute>
	  <xsl:apply-templates mode="comp" select="@*[name() != 'lemma' and name() != 'pos' and name() != 'msd']"/>
	  <xsl:apply-templates mode="comp"/>
	</pc>
      </xsl:when>

@GiliGoldin, @TomazErjavec, Can I do it?

@TomazErjavec
Copy link
Collaborator

@TomazErjavec, Can I do it?

@matyaskopp, sorry, I also ran the corpus (with just the first and last year of transcripts) through my build process soon after @GiliGoldin released it but then didn't find the time to comment on it. In short, there are some other problems apart from the ones you reported (although the situation is much better than for the first release), so I'd suggest another revision. If @GiliGoldin doesn't fix the errors you report there, then, sure, pls. feel free to upgrade the release scripts to implement the fixes you suggest. For manually editing the root files, I'd say not, as there are other problems there as well, and all should be fixed.

First, now that we have names, I could mount the sample corpus on our dev concordancer and you can test it at https://www.clarin.si/ske-beta/#dashboard?corpname=parlamint50_il (you need username "dev" and password "alfabetagama").
Going through the text types and doing some analyses can often reveal various otherwise hidden errorrs.

The log files of the build are, as before, on https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D
The errors on Sentiment taxonomy could be ignored for now (this is currently an on-going extension), the lemma ones are the ones @matyaskopp already reported, and the multiples party statuses we have already seen, i.e. a person at a certain date belongs to more than one party, a similar situation with conflicts on opposition/coalition already discussed, and taken up below.
There are lots of warnings, but I think they are for the most part ok, @GiliGoldin, still maybe worth taking a look at.

For the point by point reponse:

Regarding the overlapping affiliations. From my understanding all or at least most of the warnings are because of a one day overlap. In our data the last day of an affiliation and the first day of another are the same day.

Indeed, and we all have this situation. However, we decided that it is better to cheat and move the first day of the next affiliation so that there is no overlap, as with this we introduce a very small error. On the other hand, if we wanted to support two affiliations on the same date, this would mean that we need to cater for multi-valued attributes, which introduces a large overhead in the processing, makes computing various statistics more difficult etc. So, I'd suggest you do the same - it might even be done automatically. Maybe @matyaskopp also has some thought here.
For sure, we should add a discussion of this to https://clarin-eric.github.io/ParlaMint/#sec-temporal.

I think I fixed the div problem.
Regarding the missing NER and linguistic annotations - apparently I used the wrong files as input. I fixed this.

Yes, both fixed, thanks.

Regarding the quality of the tokenizer- it's not perfect but not that bad. the spaces in the PC elements are in phrases like ". . . " where there are spaces between the dots but it considers this as one punctuation (I think it's correct). However, I removed these kinds of punctuations.

Indeed, it could be considered correct, but the schema does not allow for it, so, thanks for changing it.

Regarding the lemmas like "פשתה"ד–" : the only problem here is the "–", which I don't know why it considers it inside the lemma, but the rest is fine. The word is: 'התשפ"ד', which is a Hebrew calendar year and it's one basic word including the " . I did remove punctuations like "–","-",".","," from the lemmas.

OK, great.

There is one other point that I made but you haven't addressed:

Many elements in the TEI headers (esp. in the corpus root file) do not mark their content as being English, e.g. <title type="sub">Knesset Protocols</title>. All such elements should have xml:lang="en", otherwise processing will produce wrong results. For example, we can generate TSV files with metadata points in local language or English, which won't work now.

This is still the case, many elements with English content both in the root and component headers are not marked as English.

And a minor plea: on HuggingFace your tarred files decompress into the current directory, it would help us a bit if they could decompress into the directory with the same name as the tar file, i.e. into ParlaMint-IL.TEI/ and ParlaMint-IL.TEI.ana/ (everything else stays the same).

@GiliGoldin
Copy link
Collaborator

@TomazErjavec, Can I do it?

@matyaskopp, sorry, I also ran the corpus (with just the first and last year of transcripts) through my build process soon after @GiliGoldin released it but then didn't find the time to comment on it. In short, there are some other problems apart from the ones you reported (although the situation is much better than for the first release), so I'd suggest another revision. If @GiliGoldin doesn't fix the errors you report there, then, sure, pls. feel free to upgrade the release scripts to implement the fixes you suggest. For manually editing the root files, I'd say not, as there are other problems there as well, and all should be fixed.

@TomazErjavec @matyaskopp
Ok I fixed the other mentioned errors, which are in the listPerson file and in the corpus TEI files, but fixing this one will require to reprocess all the protocols in all the years, and it will take a long time, so if @matyaskopp can easily do it with the mentioned rule, that would be great. Thanks!

Indeed, and we all have this situation. However, we decided that it is better to cheat and move the first day of the next affiliation so that there is no overlap, as with this we introduce a very small error. On the other hand, if we wanted to support two affiliations on the same date, this would mean that we need to cater for multi-valued attributes, which introduces a large overhead in the processing, makes computing various statistics more difficult etc. So, I'd suggest you do the same - it might even be done automatically. Maybe @matyaskopp also has some thought here. For sure, we should add a discussion of this to https://clarin-eric.github.io/ParlaMint/#sec-temporal.

I fixed this automatically by incrementing one day to the start of the new affiliation, when it's the same as the last day of the previous one. I hope this solves the problem.

There is one other point that I made but you haven't addressed:

Many elements in the TEI headers (esp. in the corpus root file) do not mark their content as being English, e.g. <title type="sub">Knesset Protocols</title>. All such elements should have xml:lang="en", otherwise processing will produce wrong results. For example, we can generate TSV files with metadata points in local language or English, which won't work now.

This is still the case, many elements with English content both in the root and component headers are not marked as English.

Sorry, I missed this one. Now I fixed it in these elements.

And a minor plea: on HuggingFace your tarred files decompress into the current directory, it would help us a bit if they could decompress into the directory with the same name as the tar file, i.e. into ParlaMint-IL.TEI/ and ParlaMint-IL.TEI.ana/ (everything else stays the same).

Ok, I think you can also do it with a parameter in the decompressing command, but I now compressed it so it will automatically keep the container folder.

The new files are in the same link:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL

@TomazErjavec
Copy link
Collaborator

I again processed your latest corpus @GiliGoldin (first and last year), and the log files, are, as before on https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D.

Apart from the linguistic annotation errors that @matyaskopp has/will take care of, there are still date clashes in coallition/opposition and party membership, it would be great if you coud take another look at this a fix them so they don't overlap.

You write that you have fixed the missing @xml:lang="en" in the TEI headers but I don't see that this is the case. E.g. in the root teiHeader you have either missing or wrong language:

<persName>CLG lab, University of Haifa</persName>
<resp>Corpus preparation and TEI conversion</resp>
<orgName>ParlaMint Project</orgName>
<measure unit="speeches" quantity="184531" xml:lang="he">184,531 speeches</measure>
<measure unit="words" quantity="6296846" xml:lang="he">6,296,846 words</measure>
<orgName>CLG lab, University of Haifa</orgName>
<ref target="https://github.com/HaifaCLG">CLG Lab University of Haifa</ref>
<p>This work is licensed under the Creative Commons Attribution 4.0 International License.</p>
<p>This corpus is part of the ParlaMint project, aiming to provide a multilingual set of comparable corpora of parliamentary proceedings.</p>
<p>Obvious OCR errors and encoding issues have been corrected.</p>

etc.
while in a component header I look at you have:

<persName>Gili Goldin</persName>
<resp>Responsible for compiling the corpus and encoding it in ParlaMint TEI format.</resp>
<persName ref="https://clg.haifa.ac.il/">Shuly Wintner</persName>
<resp>Project supervisor.</resp>
<orgName xml:lang="en">Ministry of Science &amp; Technology ,Israel</orgName> <-- note misplaced comma!
<measure unit="speeches" quantity="0">0 speeches</measure>
<measure unit="words" quantity="0">0 words</measure>
<change when="2025-01-07">Initial conversion to TEI format.</change>

Please do fix this both in .TEI and .TEI.ana.

But all the rest seems ok!

@GiliGoldin
Copy link
Collaborator

I again processed your latest corpus @GiliGoldin (first and last year), and the log files, are, as before on https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D.

Apart from the linguistic annotation errors that @matyaskopp has/will take care of, there are still date clashes in coallition/opposition and party membership, it would be great if you coud take another look at this a fix them so they don't overlap.

Yes the problem was that if the dates weren't sorted by the start_date for a person so it didn't apply correctly my changes. I think it should be ok now.

You write that you have fixed the missing @xml:lang="en" in the TEI headers but I don't see that this is the case. E.g. in the root teiHeader you have either missing or wrong language:

<persName>CLG lab, University of Haifa</persName>
<resp>Corpus preparation and TEI conversion</resp>
<orgName>ParlaMint Project</orgName>
<measure unit="speeches" quantity="184531" xml:lang="he">184,531 speeches</measure>
<measure unit="words" quantity="6296846" xml:lang="he">6,296,846 words</measure>
<orgName>CLG lab, University of Haifa</orgName>
<ref target="https://github.com/HaifaCLG">CLG Lab University of Haifa</ref>
<p>This work is licensed under the Creative Commons Attribution 4.0 International License.</p>
<p>This corpus is part of the ParlaMint project, aiming to provide a multilingual set of comparable corpora of parliamentary proceedings.</p>
<p>Obvious OCR errors and encoding issues have been corrected.</p>

For the missing ones- I previously thought that if I write the element only in English so I don't need to add the @xml:lang="en".
I now added @xml:lang="en" to all the english texts and also added a hebrew version, and fixed the wrong ones.

etc. while in a component header I look at you have:

<persName>Gili Goldin</persName>
<resp>Responsible for compiling the corpus and encoding it in ParlaMint TEI format.</resp>
<persName ref="https://clg.haifa.ac.il/">Shuly Wintner</persName>
<resp>Project supervisor.</resp>
<orgName xml:lang="en">Ministry of Science &amp; Technology ,Israel</orgName> <-- note misplaced comma!
<measure unit="speeches" quantity="0">0 speeches</measure>
<measure unit="words" quantity="0">0 words</measure>
<change when="2025-01-07">Initial conversion to TEI format.</change>

Ok I tried to make the same fixes here too. This will affect all the files so I currently only did it to one protocol so you can check and make sure it's okay before I run it on all the files.
The person_list file , the TEI headers and the one protocol files are here:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL/for_check
Please let me know if they are okay now. once it will be approved I will process all the files and upload a new compressed file for the corpus.

Thanks!

@matyaskopp
Copy link
Collaborator Author

I am sorry I missed this language bug while checking the sample.
I can see the following bugs:
https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/blob/79ad19bba05cb4f72876b60fa60f7986fc453b14/ParlaMint-IL/for_check/ParlaMint-IL_2000-11-14-15ptv2765211.xml#L15

<persName xml:lang="he"/>

https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/blob/79ad19bba05cb4f72876b60fa60f7986fc453b14/ParlaMint-IL/for_check/ParlaMint-IL_2000-11-14-15ptv2765211.xml#L87

<change when="2025-01-23">Initial conversion to TEI format.</change>

and also in root file:

<change when="2025-01-23">Initial release.</change>

Apart from the linguistic annotation errors that @matyaskopp has/will take care of

Well, I have fixed SYMbols with an empty lemma, but there are still many bugs:

1. character content of element "pc"

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1999/ParlaMint-IL_1999-07-27-15ptv490830.ana.xml:1668:111: error: character content of element "pc" invalid; must be a string matching the regular expression "\S+"

grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.CONTENT-PC-ERROR.txt

2. element "w" incomplete; missing required element "w"

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1999/ParlaMint-IL_1999-09-06-15ptv494233.ana.xml:74492:74: error: element "w" incomplete; missing required element "w"

grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MISSING-W-ERROR.txt

3. value of attribute "lemma" is invalid

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1992/ParlaMint-IL_1992-08-04-13ptm532077.ana.xml:14147:105: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"

grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.LEMMA-ERROR.txt

4. value of attribute "msd" is invalid

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2001/ParlaMint-IL_2001-11-26-15ptv496918.ana.xml:7678:107: error: value of attribute "msd" is invalid; must be a string matching the regular expression "\S+"

grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MSD-ERROR.txt

these errors are not good. The question is, what else can be affected but not reported?

<w lemma="(" pos="X" msd="UPosTag=1 –|PUNCT=" xml:id="ParlaMint-IL_2001-11-26-15ptv496918.u0.p0.s152.t25">1 –</w>
<w lemma="" pos="X" msd="UPosTag=1 -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u588.p0.s24.t20" join="right">1 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=5 -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u589.p0.s2.t6">5 -</w>
<w lemma="" pos="X" msd="UPosTag=5א -|NUM=" xml:id="ParlaMint-IL_2010-10-14-18ptv161765.u589.p0.s5.t6" join="right">5א -</w>
<w lemma="" pos="X" msd="UPosTag=20 –|NUM=" xml:id="ParlaMint-IL_2010-11-02-18ptv163061.u87.p0.s2.t3" join="right">20 –</w>
<w lemma="" pos="X" msd="UPosTag=1א -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u302.p0.s0.t6" join="right">1א -</w>
<w lemma="(א)" pos="X" msd="UPosTag=3 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u302.p0.s10.t6">3 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=41 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u857.p0.s1.t6">41 -</w>
<w lemma="" pos="X" msd="UPosTag=51 -|NUM=" xml:id="ParlaMint-IL_2010-11-14-18ptv163815.u859.p0.s0.t6" join="right">51 -</w>
<w lemma="" pos="X" msd="UPosTag=2 –|NUM=" xml:id="ParlaMint-IL_2010-11-15-18ptv163615.u1.p0.s7.t23" join="right">2 –</w>
<w lemma="" pos="X" msd="UPosTag=88 –|NUM=" xml:id="ParlaMint-IL_2010-11-16-18ptv164162.u1.p0.s2.t6" join="right">88 –</w>
<w lemma="" pos="X" msd="UPosTag=19א -|NUM=" xml:id="ParlaMint-IL_2010-11-18-18ptv164514.u37.p0.s0.t6" join="right">19א -</w>
<w lemma="" pos="X" msd="UPosTag=42 -|NUM=" xml:id="ParlaMint-IL_2010-11-18-18ptv164514.u133.p0.s0.t6" join="right">42 -</w>
<w lemma="" pos="X" msd="UPosTag=10 -|NUM=" xml:id="ParlaMint-IL_2010-11-23-18ptv164401.u45.p0.s0.t44" join="right">10 -</w>
<w lemma="" pos="X" msd="UPosTag=9 -|NUM=" xml:id="ParlaMint-IL_2010-11-23-18ptv164564.u457.p0.s1.t39" join="right">9 -</w>
<w lemma="(א)" pos="X" msd="UPosTag=75 -|NUM=" xml:id="ParlaMint-IL_2010-11-25-18ptv165046.u578.p0.s3.t6">75 -</w>
<w lemma="" pos="X" msd="UPosTag=94 -|NUM=" xml:id="ParlaMint-IL_2010-11-25-18ptv165046.u985.p0.s0.t6" join="right">94 -</w>
<w lemma="(" pos="X" msd="UPosTag=2 -|PUNCT=" xml:id="ParlaMint-IL_2010-11-30-18ptv165043.u43.p0.s0.t28">2 -</w>
<w lemma="" pos="X" msd="UPosTag=28 -|NUM=" xml:id="ParlaMint-IL_2010-11-30-18ptv165043.u51.p0.s10.t6" join="right">28 -</w>
<w lemma="" pos="X" msd="UPosTag=2 -|NUM=" xml:id="ParlaMint-IL_2010-12-06-18ptv164426.u334.p0.s1.t8" join="right">2 -</w>
<w lemma="" pos="X" msd="UPosTag=1 -|NUM=" xml:id="ParlaMint-IL_2010-12-09-18ptv164890.u242.p0.s0.t36" join="right">1 -</w>
<w lemma="" pos="X" msd="UPosTag=80 –|NUM=" xml:id="ParlaMint-IL_2010-12-14-18ptv165546.u167.p0.s0.t7" join="right">80 –</w>
<w lemma="3ה(" pos="X" msd="UPosTag=3ה(1) -|NUM=" xml:id="ParlaMint-IL_2023-07-20-25ptv3418350.u1763.p0.s0.t5">3ה(1) -</w>
<w lemma="(1" pos="X" msd="UPosTag=6 -|NUM=" xml:id="ParlaMint-IL_2024-01-30-25ptv4142816.u146.p0.s1.t15" join="right">6 -</w>
<w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Jan 27, 2025

I am sorry I missed this language bug while checking the sample. I can see the following bugs: https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/blob/79ad19bba05cb4f72876b60fa60f7986fc453b14/ParlaMint-IL/for_check/ParlaMint-IL_2000-11-14-15ptv2765211.xml#L15

https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/blob/79ad19bba05cb4f72876b60fa60f7986fc453b14/ParlaMint-IL/for_check/ParlaMint-IL_2000-11-14-15ptv2765211.xml#L87

Initial conversion to TEI format.
and also in root file:

Initial release.

ok I think I fixed these bugs. I uploaded a new version of these files. let me know if there is anything else.

Well, I have fixed SYMbols with an empty lemma, but there are still many bugs:

I can try to do my best to fix the problems but the corpus contains over 45K protocols and 35M sentences, I don't think we can expect the parser to be perfect on all of them. Especially in Hebrew, where the automatic parsers are in general not as good as models for Latin characters, and it won't be possible to manually fix all of the problems.

1. character content of element "pc"

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1999/ParlaMint-IL_1999-07-27-15ptv490830.ana.xml:1668:111: error: character content of element "pc" invalid; must be a string matching the regular expression "\S+"

grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.CONTENT-PC-ERROR.txt

I thought I fixed this problem by removing pcs that contained spaces but I see that it didn't entirely work. I made a few changes in the code, hopefully it will catch most of these problems now.

2. element "w" incomplete; missing required element "w"

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1999/ParlaMint-IL_1999-09-06-15ptv494233.ana.xml:74492:74: error: element "w" incomplete; missing required element "w"

grepped:: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MISSING-W-ERROR.txt

I'm not sure what to do with this one. The protocols contain all kinds of weird "words" like these. Most of the time it's probably because they didn't complete the sentence or the writer of the protocol missed it. Words like "ב..." or "ל..." is like saying "in ..." or "to ..." in Hebrew. But it's prefixes that are part of the same word. The ones with only dots were probably supposed to be classified as pc but the parser didn't recognize them correctly. The only solution I see here is to erase all words that contain more than one dot, but there will also be a loss of information, or to keep it like this.

3. value of attribute "lemma" is invalid

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/1992/ParlaMint-IL_1992-08-04-13ptm532077.ana.xml:14147:105: error: value of attribute "lemma" is invalid; must be a string matching the regular expression "(\S)|(\S[\S ]*\S)"

grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.LEMMA-ERROR.txt

From what I understand the problem is mostly spaces in the lemmas? I'll try to automatically remove spaces here too. I can also check if the lemma matches the regular expression "(\S)|(\S[\S ]*\S)" and if not replace with a "UNKNOWN" value. Will that be ok?

4. value of attribute "msd" is invalid

/lnet/work/people/kopp/ParlaMint/Build/Distro/ParlaMint-IL.TEI.ana/2001/ParlaMint-IL_2001-11-26-15ptv496918.ana.xml:7678:107: error: value of attribute "msd" is invalid; must be a string matching the regular expression "\S+"

grepped: https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.20250123.MSD-ERROR.txt

Here too I can remove spaces and replace with a fallback value if doesn't match regex.

@matyaskopp
Copy link
Collaborator Author

@GiliGoldin, I can give you an example of an error (type 4 on the list). I don't understand Hebrew, so I can only comment on anomalies in the corpus. My belief is that your scripts are getting the wrong results when the paragraph <seg> contains en-dash in the text (0x2013 character).

this is whole utterance from the corpus:

<u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904" who="#person.30772" ana="#regular">
  <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0">בעד סעיפים 1–11 – 24</seg>
</u>

and your annotation:

<u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904" who="#person.30772" ana="#regular">
  <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0">
    <s xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0">
      <w lemma="בעד" pos="ADP" msd="UPosTag=ADP" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t1">בעד</w>
      <w lemma="סעיף" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2">סעיפים</w>
      <w lemma="1" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t3">1–</w>
      <w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>
      <linkGrp targFunc="head argument" type="UD-SYN">
        <link ana="ud-syn:case" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t1"/>
        <link ana="ud-syn:root" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2"/>
        <link ana="ud-syn:nmod_npmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t3"/>
        <link ana="ud-syn:dep" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4"/>
      </linkGrp>
    </s>
  </seg>
</u>

where are multiple bugs present:

  • msd="UPosTag=11 –|NUM="
  • the word 24 is not present in the result
  • lemma="24" does not match 1– word form

I tried to annotate this "sentence" in the online trankit demo tool http://nlp.uoregon.edu/trankit

Image

My guess is that the bug is not in the trankit annotation tool but in the pipeline, which processes the result.

It would be nice to determine what is causing this problem and then decide how/if it can be fixed.

@GiliGoldin
Copy link
Collaborator

@GiliGoldin, I can give you an example of an error (type 4 on the list). I don't understand Hebrew, so I can only comment on anomalies in the corpus. My belief is that your scripts are getting the wrong results when the paragraph <seg> contains en-dash in the text (0x2013 character).

this is whole utterance from the corpus:

בעד סעיפים 1–11 – 24 and your annotation: בעד סעיפים 1– 11 – where are multiple bugs present:
  • msd="UPosTag=11 –|NUM="
  • the word 24 is not present in the result
  • lemma="24" does not match 1– word form

I tried to annotate this "sentence" in the online trankit demo tool http://nlp.uoregon.edu/trankit

Image

My guess is that the bug is not in the trankit annotation tool but in the pipeline, which processes the result.

It would be nice to determine what is causing this problem and then decide how/if it can be fixed.

Yes I see. I'll try to figure out what's causing this.

@TomazErjavec
Copy link
Collaborator

Let me just comment on the latest TEI headers:

  • thanks for adding the @xml:lang="en", indeed, we do need this so that it can be determined which strings are in which language. @xml:lang="he" is in general not needed, as the info about the language is inherited from the superordinate element, but it doesn't hurt
  • In the component TEI headers you have an error where your name in Hebrew is missing, i.e.
        <respStmt>
          <persName xml:lang="en">Gili Goldin</persName>
          <persName xml:lang="he"/>
  • In the corpus TEI header, I'd expect the same info about responsible persons as you have in the component headers. Right now you have
        <respStmt>
          <persName xml:lang="en">CLG lab University of Haifa</persName>
          <persName xml:lang="he">תדבעמ CLG הפיח תטיסרבינוא</persName>

this is conceptually wrong as CLG lab is not a person, hence should not be annotated as a person name. If you would really want CLG the have the responsibility here (but I think it better if you list people), you would have to change this to orgName, and we would have to change the schema to allow this, it doesn't now.

@GiliGoldin
Copy link
Collaborator

@GiliGoldin, I can give you an example of an error (type 4 on the list). I don't understand Hebrew, so I can only comment on anomalies in the corpus. My belief is that your scripts are getting the wrong results when the paragraph <seg> contains en-dash in the text (0x2013 character).
this is whole utterance from the corpus:

בעד סעיפים 1–11 – 24

and your annotation:

  בעד
  סעיפים
  1–
  11 –

where are multiple bugs present:

  • msd="UPosTag=11 –|NUM="
  • the word 24 is not present in the result
  • lemma="24" does not match 1– word form

I tried to annotate this "sentence" in the online trankit demo tool http://nlp.uoregon.edu/trankit
Image
My guess is that the bug is not in the trankit annotation tool but in the pipeline, which processes the result.
It would be nice to determine what is causing this problem and then decide how/if it can be fixed.

Yes I see. I'll try to figure out what's causing this.

I found out what happened here and it's a combination of the parser output and my post processing. Specifically in this example there was a shift in the field values. The original parser's output considers the "form" (which is what I write as the word text) to be "11 - 24" and the lemma is also "11 - 24". Because of a bug in my post processing the "24" landed in the lemma field rather than part of the "form" field , the number "11" moved to be part of the UposTag, the UposTag (NUM) was moved to the features and so on, resulting in the three weird outputs above (and all the wrong msd errors). I fixed this bug.

However, the parser itself does seem to have a problem in general when there is a "-" in a sentence. In many cases it takes the hyphen as part of the first word instead of separating it and considering it as a punctuation. We didn't run the same version that is now on their website so maybe there are differences, but the annotations we got already have this problem. This is not a problem in my post processing. It seems like this is what causing most of the errors with the lemmas, but I think that in almost all cases my solution of removing punctuations and spaces from the lemma solves the problem. For instance the word "ועדה – " will result in the lemma "ועדה", the word "1 –" will result in the lemma "1" etc. In the case of the "11 - 24" it won't be ideal as instead of having "11 - 24" as the lemma it will be "1124" after removing punctuations and spaces. But I think this will only happen in rare cases because usually it doesn't consider both numbers of the range as one form/lemma, only the first one.

I uploaded some of the new outputs to the same "for check" link: https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus/tree/main/ParlaMint-IL/for_check

@TomazErjavec I also fixed the TEI headers.

If everything looks okay now, I will rerun the scripts on the entire corpus and upload the new version once it's ready.
I should mention that my Ph.D. supervisor wants me to focus on tasks with a higher priority at this time, so I hope this version will be approved. If not I will only be able to make further adjustments in a few months.

@matyaskopp
Copy link
Collaborator Author

I believe the languages and responsibility statements are now ok


I have spotted one easy to fix bug.

  • lemma="UNKNOWN"

raw TEI:

        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69" who="#person.30713" ana="#regular">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0">בעיקר אנשים חפים מפשע בעיקר, בעיקר, כמעט 70% - - -</seg>
        </u>

annotated TEI:

        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69" who="#person.30713" ana="#regular">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0">
            <s xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0">
              <w lemma="בעיקר" pos="ADV" msd="UPosTag=ADV" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t1">בעיקר</w>
              <w lemma="איש" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t2">אנשים</w>
              <w lemma="חף" pos="ADJ" msd="UPosTag=ADJ|Gender=Masc|Number=Plur" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t3">חפים</w>
              <w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t4-5">מפשע<w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t4" norm="מ" lemma="מ" pos="ADP" msd="UPosTag=ADP"/><w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t5" norm="פשע" lemma="פשע" pos="NOUN" msd="UPosTag=NOUN|Gender=Masc|Number=Sing"/></w>
              <linkGrp targFunc="head argument" type="UD-SYN">
                <link ana="ud-syn:advmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t1"/>
                <link ana="ud-syn:root" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t2"/>
                <link ana="ud-syn:amod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t2 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t3"/>
                <link ana="ud-syn:case" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t5 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t4"/>
                <link ana="ud-syn:obl" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t3 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s0.t5"/>
              </linkGrp>
            </s>
            <s xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1">
              <w lemma="בעיקר" pos="ADV" msd="UPosTag=ADV" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t1" join="right">בעיקר</w>
              <pc xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t2" msd="UPosTag=PUNCT">,</pc>
              <w lemma="בעיקר" pos="ADV" msd="UPosTag=ADV" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t3" join="right">בעיקר</w>
              <pc xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t4" msd="UPosTag=PUNCT">,</pc>
              <w lemma="כמעט" pos="ADV" msd="UPosTag=ADV" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t5">כמעט</w>
              <w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t6-7">70%<w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t6" norm="70" lemma="70" pos="NUM" msd="UPosTag=NUM"/><w xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7" norm="%" lemma="%" pos="SYM" msd="UPosTag=SYM|Gender=Masc|Number=Plur"/></w>
<!-- UNKNOWN LEMMA -->
              <w lemma="UNKNOWN" pos="SYM" msd="UPosTag=SYM" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t8" join="right">- -</w>
              <pc xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t9" msd="UPosTag=PUNCT">-</pc>
              <linkGrp targFunc="head argument" type="UD-SYN">
                <link ana="ud-syn:advmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t1"/>
                <link ana="ud-syn:punct" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t3 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t2"/>
                <link ana="ud-syn:advmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t3"/>
                <link ana="ud-syn:punct" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t3 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t4"/>
                <link ana="ud-syn:advmod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t5"/>
                <link ana="ud-syn:nummod" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t6"/>
                <link ana="ud-syn:root" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7"/>
                <link ana="ud-syn:punct" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t8"/>
                <link ana="ud-syn:punct" target="#ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t7 #ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t9"/>
              </linkGrp>
            </s>
          </seg>
        </u>

  • <gap>

The second is more complicated to fix, as it is present in TEI version:

        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68" ana="#guest">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68.p0">- - -</seg>
        </u>
        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69" who="#person.30713" ana="#regular">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0">בעיקר אנשים חפים מפשע בעיקר, בעיקר, כמעט 70% - - -</seg>
        </u>
        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u70" ana="#guest">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u70.p0">- - -</seg>
        </u>
        <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u71" who="#person.30713" ana="#regular">
          <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u71.p0">אני נותן לך תשובה – או שסתם אתה רוצה לשאול? אתה תחזור לנתונים - - -</seg>
        </u>

It seems to me that triple dashes/hyphens are used when part of the transcription is missing:

Image

Which should be marked as <gap> if some part of the transcription is really missing.


  • coalition/opposition collisions still present
ERROR: multiple party statuses for person.2595 on 1992-07-13: Coalition Opposition
  <person xml:id="person.2595">
    <persName>
      <forename>יצחק</forename>
      <surname>רבין</surname>
    </persName>
    <sex value="M"/>
    <birth when="1922-03-01">
      <placeName>ארץ ישראל</placeName>
    </birth>
    <affiliation ref="#org.35" role="member" from="1974-01-21" to="1977-06-13"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1974-01-21" to="1977-06-13"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="head" from="1974-06-03" to="1977-06-20"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="1974-06-03" to="1977-06-20"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="head" from="1992-07-13" to="1995-11-22"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="1992-07-13" to="1995-11-22"/>
    <affiliation ref="#org.35" role="member" from="1977-06-14" to="1981-07-20"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1977-06-14" to="1981-07-20"/>
    <affiliation ref="#org.35" role="member" from="1981-07-21" to="1984-08-13"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1981-07-21" to="1984-08-13"/>
    <affiliation ref="#org.35" role="member" from="1984-08-14" to="1988-11-21"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1984-08-14" to="1988-11-21"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="1984-09-13" to="1988-11-21"/>
    <affiliation ref="#org.35" role="member" from="1988-11-22" to="1991-10-07"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1988-11-22" to="1991-10-07"/>
    <affiliation ref="#ParlaMint-IL-GOV" role="member" from="1988-12-22" to="1990-06-11"/>
<!-- change to 1992-07-12 -->
    <affiliation ref="#org.35" role="member" from="1991-10-08" to="1992-07-13"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1991-10-08" to="1992-07-13"/>
<!-- change to 1992-07-13 -->
    <affiliation ref="#org.35" role="member" from="1992-07-14" to="1995-11-04"/>
    <affiliation ref="#ParlaMint-IL-KNESS" role="member" from="1992-07-14" to="1995-11-04"/>
  </person>

First became a member of the government (1992-07-13) and then a member of parliament (1992-07-14), which is a bit weird.

The meeting from 1992-07-13 says it is הכנסת ה-13:

        <meeting ana="#parla.term #parla.uni #period_13" n="13" corresp="#ParlaMint-IL-KNESS">הכנסת ה-13</meeting>
      <event xml:id="period_13" from="1992-07-13" to="1996-06-17">
        <label xml:lang="he">הכנסת ה-13 (1992-07-13 - 1996-06-17)</label>
        <label xml:lang="en">13th Knesset (1992-07-13 - 1996-06-17)</label>
      </event>
<!-- change to 1992-07-12 -->
    <relation name="opposition" active="#org.35" passive="#ParlaMint-IL-GOV" from="1990-06-11" to="1992-07-13"/>
    <relation name="coalition" mutual="#org.89 #org.35" from="1992-07-13" to="1996-06-17"/>

If everything looks okay now, I will rerun the scripts on the entire corpus and upload the new version once it's ready.
I should mention that my Ph.D. supervisor wants me to focus on tasks with a higher priority at this time, so I hope this version will be approved. If not I will only be able to make further adjustments in a few months.

Well, I have to discuss and agree on this with @TomazErjavec.
There is a plan for the new release of ParlaMint corpora, but ParlaMint-IL still needs some effort to be compatible with other corpora.

@TomazErjavec
Copy link
Collaborator

I also now processed your latest offering @GiliGoldin, and, as always, the log files are at https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D
Formal errors are now only in the linguistic annotation, two for lemmas, one for msd. The lemma ones are both of the same kind, e.g.
lemma="1 ". While a lemma can contain spaces, it definitelly shouldn't start or end with a space. If I understand correctly, your fix was to remove punctuation from lemmas. I don't think this is the right approach. A better option is that in cases where the lemma is "strange" (also like the UNKNOWN that @matyaskopp mentioned), to simply copy the word-form into the lemma. As neither the wordform nor lemma should have leading or trailing spaces, this should be ok.

The msd error is

<w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>

I guess this is the error you already mentioned but didn't really manage to fix.

I should mention that my Ph.D. supervisor wants me to focus on tasks with a higher priority at this time, so I hope this version will be approved. If not I will only be able to make further adjustments in a few months.
Well, I have to discuss and agree on this with @TomazErjavec. There is a plan for the new release of ParlaMint corpora, but ParlaMint-IL still needs some effort to be compatible with other corpora.

I agree with @matyaskopp: I'd also prefer to have the linguistic annotation errors and the date clashes that @matyaskopp mentions fixed otherwise we will have a mess with the complete corpus. It would be a shame to give up at 99%.
Esp. as the new version of the corpora will be released (if things go according to plan) in the second half of this year, I think it better that you concentrate on your PhD, and come back to ParlaMint-IL when things are under control there. We can wait and then have a beautiful new corpus!

@GiliGoldin
Copy link
Collaborator

GiliGoldin commented Feb 11, 2025

I also now processed your latest offering @GiliGoldin, and, as always, the log files are at https://nl.ijs.si/et/tmp/ParlaMint/Logs/?C=M;O=D Formal errors are now only in the linguistic annotation, two for lemmas, one for msd. The lemma ones are both of the same kind, e.g. lemma="1 ". While a lemma can contain spaces, it definitelly shouldn't start or end with a space. If I understand correctly, your fix was to remove punctuation from lemmas. I don't think this is the right approach. A better option is that in cases where the lemma is "strange" (also like the UNKNOWN that @matyaskopp mentioned), to simply copy the word-form into the lemma. As neither the wordform nor lemma should have leading or trailing spaces, this should be ok.

Yes but the problem happened when there is a "-" and a space in the `form' we received from the parser. Usually in these cases the lemma was already the same as the form and it raised errors so this can't be the solution. This is why I removed punctuation and spaces from the lemma instead. I think it works fine in almost all cases. I put an "UNKNOWN" default lemma for cases that after the removal of punctuations and spaces I got an empty lemma. Is there a different default value I can use? Can I keep it empty instead?

The msd error is

<w lemma="24" pos="X" msd="UPosTag=11 –|NUM=" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –</w>

I guess this is the error you already mentioned but didn't really manage to fix.

Are you sure you processed the new files? Because this is the error i mentioned before that I did fix, but I mentioned it is a rare case where the lemma is still not as expected after the punctuation removal.:
Now it's like this:

<w lemma="1124" pos="NUM" msd="UPosTag=NUM" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u904.p0.s0.t4">11 –24</w>

I should mention that my Ph.D. supervisor wants me to focus on tasks with a higher priority at this time, so I hope this version will be approved. If not I will only be able to make further adjustments in a few months.
Well, I have to discuss and agree on this with @TomazErjavec. There is a plan for the new release of ParlaMint corpora, but ParlaMint-IL still needs some effort to be compatible with other corpora.

I agree with @matyaskopp: I'd also prefer to have the linguistic annotation errors and the date clashes that @matyaskopp mentions fixed otherwise we will have a mess with the complete corpus. It would be a shame to give up at 99%. Esp. as the new version of the corpora will be released (if things go according to plan) in the second half of this year, I think it better that you concentrate on your PhD, and come back to ParlaMint-IL when things are under control there. We can wait and then have a beautiful new corpus!

Okay I will keep working on it when I find time and update you when I make changes.

@GiliGoldin
Copy link
Collaborator

I believe the languages and responsibility statements are now ok

I have spotted one easy to fix bug.

  • lemma="UNKNOWN"
          <w lemma="UNKNOWN" pos="SYM" msd="UPosTag=SYM" xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0.s1.t8" join="right">- -</w>

I repeat my answer to Tomaz: I put an "UNKNOWN" default lemma for cases that after the removal of punctuations and spaces I got an empty lemma. Is there a different default value I can use? Can I keep it empty instead?

  • <gap>

The second is more complicated to fix, as it is present in TEI version:

    <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68" ana="#guest">
      <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68.p0">- - -</seg>
    </u>
    <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69" who="#person.30713" ana="#regular">
      <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u69.p0">בעיקר אנשים חפים מפשע בעיקר, בעיקר, כמעט 70% - - -</seg>
    </u>
    <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u70" ana="#guest">
      <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u70.p0">- - -</seg>
    </u>
    <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u71" who="#person.30713" ana="#regular">
      <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u71.p0">אני נותן לך תשובה – או שסתם אתה רוצה לשאול? אתה תחזור לנתונים - - -</seg>
    </u>

It seems to me that triple dashes/hyphens are used when part of the transcription is missing:

Image

Which should be marked as <gap> if some part of the transcription is really missing.

Yes in many cases the dashes are used when there is a missing part but it's not always the case and there are many variations for this. Sometimes it's many dashes or hyphens, sometimes only 2 or 3, sometimes there are spaces between them, other times there aren't. and there are also cases where it's not a gap but some sort of emphasis. So it might be hard catching all the cases and only the right ones.

  • coalition/opposition collisions still present
ERROR: multiple party statuses for person.2595 on 1992-07-13: Coalition Opposition
יצחק רבין ארץ ישראל First became a member of the government (1992-07-13) and then a member of parliament (1992-07-14), which is a bit weird.

The meeting from 1992-07-13 says it is הכנסת ה-13:

    <meeting ana="#parla.term #parla.uni #period_13" n="13" corresp="#ParlaMint-IL-KNESS">הכנסת ה-13</meeting>
  <event xml:id="period_13" from="1992-07-13" to="1996-06-17">
    <label xml:lang="he">הכנסת ה-13 (1992-07-13 - 1996-06-17)</label>
    <label xml:lang="en">13th Knesset (1992-07-13 - 1996-06-17)</label>
  </event>
<relation name="opposition" active="#org.35" passive="#ParlaMint-IL-GOV" from="1990-06-11" to="1992-07-13"/>
<relation name="coalition" mutual="#org.89 #org.35" from="1992-07-13" to="1996-06-17"/>

Oh I thought I fixed all the collisions. I'll take another look and fix it.

@TomazErjavec
Copy link
Collaborator

Are you sure you processed the new files?

Um, it seems not. Sorry about this.

Yes but the problem happened when there is a "-" and a space in the `form' we received from the parser. Usually in these cases the lemma was already the same as the form and it raised errors so this can't be the solution.

Ah, I see now, I am getting a bit confused it seems...

This is why I removed punctuation and spaces from the lemma instead. I think it works fine in almost all cases.

OK, this is then fine, as long as you also remove the leading and trailing spaces and get left with something (so, not empty) which is not composed only of punctuation.

I put an "UNKNOWN" default lemma for cases that after the removal of punctuations and spaces I got an empty lemma. Is there a different default value I can use? Can I keep it empty instead?

If at all possible, I'd suggest for these case to rather change the tag from <w> to <pc> as this is obviously puncutation rather than a word. And <pc> doesn't have @lemma, so this would solve the problem.

Yes in many cases the dashes are used when there is a missing part but it's not always the case and there are many variations for this. Sometimes it's many dashes or hyphens, sometimes only 2 or 3, sometimes there are spaces between them, other times there aren't. and there are also cases where it's not a gap but some sort of emphasis. So it might be hard catching all the cases and only the right ones.

You are not alone in having many different ways for representing essentially the same info in your source. For our corpus we wrote heuristics that catch most of the stuff correctly but invariably don't have 100% precision or recall...

@matyaskopp
Copy link
Collaborator Author

Are you sure you processed the new files?

Um, it seems not. Sorry about this.

Yes but the problem happened when there is a "-" and a space in the `form' we received from the parser. Usually in these cases the lemma was already the same as the form and it raised errors so this can't be the solution.

Ah, I see now, I am getting a bit confused it seems...

This is why I removed punctuation and spaces from the lemma instead. I think it works fine in almost all cases.

OK, this is then fine, as long as you also remove the leading and trailing spaces and get left with something (so, not empty) which is not composed only of punctuation.

I put an "UNKNOWN" default lemma for cases that after the removal of punctuations and spaces I got an empty lemma. Is there a different default value I can use? Can I keep it empty instead?

If at all possible, I'd suggest for these case to rather change the tag from <w> to <pc> as this is obviously puncutation rather than a word. And <pc> doesn't have @lemma, so this would solve the problem.

agree with @TomazErjavec that it should be changed to <pc> but also UPosTag=SYM should be changed to UPosTag=PUNCT because it is not a symbol, but punctuation. see https://universaldependencies.org/u/pos/SYM.html

Yes in many cases the dashes are used when there is a missing part but it's not always the case and there are many variations for this. Sometimes it's many dashes or hyphens, sometimes only 2 or 3, sometimes there are spaces between them, other times there aren't. and there are also cases where it's not a gap but some sort of emphasis. So it might be hard catching all the cases and only the right ones.

You are not alone in having many different ways for representing essentially the same info in your source. For our corpus we wrote heuristics that catch most of the stuff correctly but invariably don't have 100% precision or recall...

you can start with most striking errors like this (almost 115k/1% occurrences)

    <u xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68" ana="#guest">
      <seg xml:id="ParlaMint-IL_2024-04-01-25ptm4274256.u68.p0">- - -</seg>
    </u>

see https://ufallab.ms.mff.cuni.cz/~kopp/ParlaMint/Logs/ParlaMint-IL.UTTERANCE-EMPTY.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants