Skip to content

ParlaMint-SI: additional metadata files for sentiment? #897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
katjameden opened this issue Mar 24, 2025 · 4 comments
Open

ParlaMint-SI: additional metadata files for sentiment? #897

katjameden opened this issue Mar 24, 2025 · 4 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@katjameden
Copy link
Collaborator

The planned new version of the ParlaMint-SI corpus will, in addition to sentence-level sentiment, also include sentiment annotations for whole utterances (i.e. speech- level sentiment).

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

Would it be possible to add new metadata files focussing on the sentiment (e.g. ID, annotated element (u or s), sentiment class and numeric value for the sentiment)? This would in turn allow easier (pre-)processing of the corpus for further analyses/research, as the sentiment would be included as metadata and would eliminate the need to extract it from TEI.ana.

@katjameden katjameden added the enhancement New feature or request label Mar 24, 2025
@katjameden katjameden added this to the ParlaCAP milestone Mar 24, 2025
@katjameden katjameden changed the title ParlaMint-SI: additional metadata files for speech-level sentiment? ParlaMint-SI: additional metadata files for sentiment? Mar 24, 2025
@matyaskopp
Copy link
Collaborator

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

I think it is possible to add more columns into -meta.tsv and leave an empty value for all the corpora, that will be done in the future. Eg, we have agenda that is not present in other corpora.

But the question is whether it belongs to meta as it is annotation of the data - it is not describing the setting of the speech, but rather the content.

So, do we want to add another format?

  • We have *.txt which is formated as tsv (without column names), but I do not think it the values belongs there either.
  • Does it make sense to introduce another (two) tsv formats, one for u-level and one for s-level?
    sketch:
    • utterance id
    • element id (s or u)
    • orig id (reference to source sentence for english translation)
    • language
    • text
    • sentiment class
    • sentiment value
    • ?? some other possible numbers/stats
      • number of tokens
      • number of named entities
      • ...
    • ?? and in future when we will have audio alignment in TEI files
      • start time
      • end time
      • audio file ref
  • But if we introduce another format, will be the rest of corpora without this format?

I have no strong opinion on that (yet).

@TomazErjavec
Copy link
Collaborator

Yes, this was exactly my thinking, i.e. we introduce one more set of tsv files, called e.g. component-name.ana-meta.tsv and we add them to the ParlaMint-XX.conllu/ directories. Note that sentiment annotation is only added to TEI.ana files, and that all ParlaMint corpora will get s-level senti annotation in Parla-CAP.

The files should have a header row, and I'd suggest these are the columns:

  1. id
  2. element (s or u)
  3. language (can be several with u!)
  4. sentiment value (for u empty except for SI)
  5. sentiment 6 class (ditto)
  6. sentiment 3 class (ditto)
  7. number of sentences (always 1 for s element of course)
  8. number of words
  9. number of tokens
  10. number of named entities
  11. (maybe other numbers if it makes sense, e.g. number of UD PoSes or syntactic relations, all in one column like "NOUN:5 ADJ:3 ...")
  12. audio file ref (in future when we will have audio alignment in TEI files )
  13. audio start time (ditto)
  14. audio end time (ditto)

@TomazErjavec
Copy link
Collaborator

Although most of the scripts for adding sentiment and topic to the corpora have been already made, this issue has not been addressed yet. I guess either me or @matyaskopp should make the script if we have the definitive list of columns for the files.

What I did - for now - is to add the s-level sentiment score directly to CoNLL-U files, it doesn't hurt, and, in fact, with this we don't, strictly speaking even need the envisaged extra TSVs, as the info is in CoNLL-U. Right now only s-level sentiment is encoded, but I guess (for SI) u-level could be added in the same way. The format is like this:

# newdoc id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u1
# newpar id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1
# lang = sl
# sent_id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1.1
# senti_3 = Positive
# senti_6 = mixed positive
# senti_n = 4.26
# text = Spoštovani gospod predsednik Republike Slovenije, gospod Milan Kučan!

@TomazErjavec
Copy link
Collaborator

I guess either me or @matyaskopp should make the script

I had a look, and I guess I should do it, given that I made the parlamint2meta.xsl script, and this will be similar.
It would still be nice if somebody commented on the list of fields that the file should have - except if it is perfect as it is!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants