ParlaMint-SI: additional metadata files for sentiment? #897

katjameden · 2025-03-24T14:39:20Z

The planned new version of the ParlaMint-SI corpus will, in addition to sentence-level sentiment, also include sentiment annotations for whole utterances (i.e. speech- level sentiment).

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

Would it be possible to add new metadata files focussing on the sentiment (e.g. ID, annotated element (u or s), sentiment class and numeric value for the sentiment)? This would in turn allow easier (pre-)processing of the corpus for further analyses/research, as the sentiment would be included as metadata and would eliminate the need to extract it from TEI.ana.

matyaskopp · 2025-03-27T07:11:40Z

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

I think it is possible to add more columns into -meta.tsv and leave an empty value for all the corpora, that will be done in the future. Eg, we have agenda that is not present in other corpora.

But the question is whether it belongs to meta as it is annotation of the data - it is not describing the setting of the speech, but rather the content.

So, do we want to add another format?

We have *.txt which is formated as tsv (without column names), but I do not think it the values belongs there either.
Does it make sense to introduce another (two) tsv formats, one for u-level and one for s-level?
sketch:
- utterance id
- element id (s or u)
- orig id (reference to source sentence for english translation)
- language
- text
- sentiment class
- sentiment value
- ?? some other possible numbers/stats
  - number of tokens
  - number of named entities
  - ...
- ?? and in future when we will have audio alignment in TEI files
  - start time
  - end time
  - audio file ref
But if we introduce another format, will be the rest of corpora without this format?

I have no strong opinion on that (yet).

TomazErjavec · 2025-03-30T14:07:43Z

Yes, this was exactly my thinking, i.e. we introduce one more set of tsv files, called e.g. component-name.ana-meta.tsv and we add them to the ParlaMint-XX.conllu/ directories. Note that sentiment annotation is only added to TEI.ana files, and that all ParlaMint corpora will get s-level senti annotation in Parla-CAP.

The files should have a header row, and I'd suggest these are the columns:

id
element (s or u)
language (can be several with u!)
sentiment value (for u empty except for SI)
sentiment 6 class (ditto)
sentiment 3 class (ditto)
number of sentences (always 1 for s element of course)
number of words
number of tokens
number of named entities
(maybe other numbers if it makes sense, e.g. number of UD PoSes or syntactic relations, all in one column like "NOUN:5 ADJ:3 ...")
audio file ref (in future when we will have audio alignment in TEI files )
audio start time (ditto)
audio end time (ditto)

TomazErjavec · 2025-05-16T07:54:31Z

Although most of the scripts for adding sentiment and topic to the corpora have been already made, this issue has not been addressed yet. I guess either me or @matyaskopp should make the script if we have the definitive list of columns for the files.

What I did - for now - is to add the s-level sentiment score directly to CoNLL-U files, it doesn't hurt, and, in fact, with this we don't, strictly speaking even need the envisaged extra TSVs, as the info is in CoNLL-U. Right now only s-level sentiment is encoded, but I guess (for SI) u-level could be added in the same way. The format is like this:

# newdoc id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u1
# newpar id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1
# lang = sl
# sent_id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1.1
# senti_3 = Positive
# senti_6 = mixed positive
# senti_n = 4.26
# text = Spoštovani gospod predsednik Republike Slovenije, gospod Milan Kučan!

TomazErjavec · 2025-05-16T10:13:44Z

I guess either me or @matyaskopp should make the script

I had a look, and I guess I should do it, given that I made the parlamint2meta.xsl script, and this will be similar.
It would still be nice if somebody commented on the list of fields that the file should have - except if it is perfect as it is!

katjameden added the enhancement New feature or request label Mar 24, 2025

katjameden added this to the ParlaCAP milestone Mar 24, 2025

katjameden assigned TomazErjavec Mar 24, 2025

katjameden changed the title ~~ParlaMint-SI: additional metadata files for speech-level sentiment?~~ ParlaMint-SI: additional metadata files for sentiment? Mar 24, 2025

TomazErjavec mentioned this issue Mar 26, 2025

Adding sentiment to corpora #891

Closed

TomazErjavec added a commit that referenced this issue May 17, 2025

Add script to compute .ana metadata, e.g. sentiment (#897).

37f91ed

TomazErjavec added a commit that referenced this issue May 18, 2025

Integrate .ana meta TSVs into processing. (#897)

36db1da

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParlaMint-SI: additional metadata files for sentiment? #897

ParlaMint-SI: additional metadata files for sentiment? #897

katjameden commented Mar 24, 2025

matyaskopp commented Mar 27, 2025

TomazErjavec commented Mar 30, 2025

TomazErjavec commented May 16, 2025

TomazErjavec commented May 16, 2025

ParlaMint-SI: additional metadata files for sentiment? #897

ParlaMint-SI: additional metadata files for sentiment? #897

Comments

katjameden commented Mar 24, 2025

matyaskopp commented Mar 27, 2025

TomazErjavec commented Mar 30, 2025

TomazErjavec commented May 16, 2025

TomazErjavec commented May 16, 2025