mela (#1970)

Geralt-Targaryen · lintangsutawika · web-flow · commit a4987bba6e9e · 2024-08-20T15:06:24.000-04:00
* mela

* Update mela_en.yaml

* Create _mela.yaml

---------

Co-authored-by: Lintang Sutawika &lt;lintang@eleuther.ai&gt;
diff --git a/lm_eval/tasks/mela/README.md b/lm_eval/tasks/mela/README.md
@@ -0,0 +1,60 @@
+# Task-name
+
+### Paper
+
+Title: [MELA: Multilingual Evaluation of Linguistic Acceptability](https://arxiv.org/abs/2311.09033)
+
+**Abstract**: In this work, we present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages from a diverse set of language families. We establish LLM baselines on this benchmark, and investigate cross-lingual transfer in acceptability judgements with XLM-R. In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R to explore the process of syntax capability acquisition. Our results show that GPT-4o exhibits a strong multilingual ability, outperforming fine-tuned XLM-R, while open-source multilingual models lag behind by a noticeable gap. Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial: 500 Icelandic fine-tuning examples lead to 23 MCC performance in a completely unrelated language -- Chinese. Results of our probing experiments indicate that training on MELA improves the performance of XLM-R on syntax-related tasks.
+
+Homepage: https://github.com/sjtu-compling/MELA
+
+### Citation
+
+```
+@inproceedings{zhang2023mela,
+  author       = {Ziyin Zhang and
+                  Yikang Liu and
+                  Weifang Huang and
+                  Junyu Mao and
+                  Rui Wang and
+                  Hai Hu},
+  title        = {{MELA:} Multilingual Evaluation of Linguistic Acceptability},
+  booktitle    = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), {ACL} 2024, Bangkok, Thailand},
+  publisher    = {Association for Computational Linguistics},
+  year         = {2024},
+  url          = {https://doi.org/10.48550/arXiv.2311.09033}
+}
+```
+
+### Groups and Tasks
+
+#### Groups
+
+- `mela`: multilingual evaluation of linguistic acceptability
+
+#### Tasks
+
+- `mela_en`: English
+- `mela_zh`: Chinese
+- `mela_it`: Italian
+- `mela_ru`: Russian
+- `mela_de`: Germany
+- `mela_fr`: French
+- `mela_es`: Spanish
+- `mela_ja`: Japanese
+- `mela_ar`: Arabic
+- `mela_ar`: Icelandic
+
+### Checklist
+
+For adding novel benchmarks/datasets to the library:
+
+- [x] Is the task an existing benchmark in the literature?
+  - [x] Have you referenced the original paper that introduced the task?
+  - [x] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
+
+If other tasks on this dataset are already supported:
+
+- [ ] Is the "Main" variant of this task clearly denoted?
+- [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
+- [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
diff --git a/lm_eval/tasks/mela/_mela.yaml b/lm_eval/tasks/mela/_mela.yaml
@@ -0,0 +1,17 @@
+group: mela
+task:
+  - mela_en
+  - mela_zh
+  - mela_it
+  - mela_ru
+  - mela_de
+  - mela_fr
+  - mela_es
+  - mela_ja
+  - mela_ar
+  - mela_ar
+aggregate_metric_list:
+  - metric: mcc
+    weight_by_size: False
+metadata:
+  version: 1
diff --git a/lm_eval/tasks/mela/mela_ar.yaml b/lm_eval/tasks/mela/mela_ar.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_ar
+dataset_name: ar
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_de.yaml b/lm_eval/tasks/mela/mela_de.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_de
+dataset_name: de
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_en.yaml b/lm_eval/tasks/mela/mela_en.yaml
@@ -0,0 +1,17 @@
+task: mela_en
+dataset_path: Geralt-Targaryen/MELA
+dataset_name: en
+training_split: train
+validation_split: dev
+test_split: test
+output_type: multiple_choice
+doc_to_text: "Sentence: {{sentence}}\nDetermine whether this sentence is acceptable or unacceptable?\nA. Acceptable\nB. Unacceptable\nAnswer:"
+doc_to_choice: ["A", "B"]
+doc_to_target: "{{['B', 'A'][label]}}"
+description: "Determine whether the following sentence(s) violate certain linguistic constraints. If yes, then it is \"unacceptable\"; otherwise, \"acceptable\".\n\n"
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+metric_list:
+  - metric: mcc
+    higher_is_better: true
diff --git a/lm_eval/tasks/mela/mela_es.yaml b/lm_eval/tasks/mela/mela_es.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_es
+dataset_name: es
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_fr.yaml b/lm_eval/tasks/mela/mela_fr.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_fr
+dataset_name: fr
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_is.yaml b/lm_eval/tasks/mela/mela_is.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_is
+dataset_name: is
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_it.yaml b/lm_eval/tasks/mela/mela_it.yaml
@@ -0,0 +1,3 @@
+include: mela_en.yaml
+task: mela_it
+dataset_name: it
diff --git a/lm_eval/tasks/mela/mela_ja.yaml b/lm_eval/tasks/mela/mela_ja.yaml
@@ -0,0 +1,4 @@
+include: mela_en.yaml
+task: mela_ja
+dataset_name: ja
+training_split: null
diff --git a/lm_eval/tasks/mela/mela_ru.yaml b/lm_eval/tasks/mela/mela_ru.yaml
@@ -0,0 +1,3 @@
+include: mela_en.yaml
+task: mela_ru
+dataset_name: ru
diff --git a/lm_eval/tasks/mela/mela_zh.yaml b/lm_eval/tasks/mela/mela_zh.yaml
@@ -0,0 +1,3 @@
+include: mela_en.yaml
+task: mela_zh
+dataset_name: zh

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: mela_en.yaml`
	`2`	`+task: mela_it`
	`3`	`+dataset_name: it`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: mela_en.yaml`
	`2`	`+task: mela_ru`
	`3`	`+dataset_name: ru`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+include: mela_en.yaml`
	`2`	`+task: mela_zh`
	`3`	`+dataset_name: zh`