New duplicate algorithm to check for similar entries (#52)

Add similarity based deduplication algorithm
asreview · Feb 6, 2025 · 3d9b906 · 3d9b906
1 parent c5109a5
commit 3d9b906
Show file tree

Hide file tree

Showing 9 changed files with 492 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -178,6 +178,47 @@ asreview data dedup synergy:van_de_schoot_2018 -o van_de_schoot_2018_dedup.csv
 Removed 104 records from dataset with 6189 records.
 ```
 
+We can also choose to deduplicate based on the similarity of the title and abstract, instead of checking for an exact match. This way we can find duplicates that have small differences, but are actually the same record (for example, an additional comma or a fixed typo). This can be done by using the `--drop_similar` flag. This process takes about 4s on a dataset of about 2068 entries.
+
+```bash
+asreview data dedup neurips_2020.tsv --drop_similar
+```
+```
+Not using doi for deduplication because there is no such data.
+Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:03<00:00, 531.93it/s]
+Found 2 duplicates in dataset with 2068 records.
+```
+
+If we want to check which entries were found as duplicates, we can use the `--verbose` flag. This will print the lines of the dataset that were found as duplicates, as well as the difference between them. Any text that has to be removed from the first entry to become the second one is shown as red and has a strikethrough, and any text that has to be added to the first entry is shown as green. All text that is the same in both entries is dimmed.
+
+```bash
+asreview data dedup neurips_2020.tsv --drop_similar --verbose
+```
+
+![Verbose drop similar](./dedup_similar.png)
+
+The similarity threshold can be set with the `--similarity` flag. The default similarity threshold is `0.98`. We can also choose to only use the title for deduplication by using the `--skip_abstract` flag.
+
+```bash
+asreview data dedup neurips_2020.tsv --drop_similar --similarity 0.98 --skip_abstract
+```
+```
+Not using doi for deduplication because there is no such data.
+Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:02<00:00, 770.74it/s]
+Found 4 duplicates in dataset with 2068 records.
+```
+
+Note that you might have to adjust your similarity score if you choose to only use the title for deduplication. The similarity score is calculated using the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class from the `difflib` package. The similarity score is calculated as the ratio of the number of matching characters to the total number of characters in the two strings. For example, the similarity score between the strings "hello" and "hello world" is 0.625. By default, we use the [real_quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.real_quick_ratio) and [quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.quick_ratio) methods, which are faster and usually good enough, but less accurate. If you want to use the ratio method, you can use the `--strict_similarity` flag.
+
+Now, if we want to discard stopwords for deduplication (for a more strict check on the important words), we can use the `--discard_stopwords` flag. The default language for the stopwords is `english`, but that can be set with the `--stopwords_language` flag. The list of supported languages for the stopwords are the same supported by the [nltk](https://www.nltk.org/index.html) package. To check the list of available languages, you can run the following commands on your python environment:
+
+```python
+from nltk.corpus import stopwords
+print(stopwords.fileids())
+```
+```
+['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
+```
 
 ### Data Vstack (Experimental)
 
@@ -186,7 +227,7 @@ Vertical stacking: combine as many datasets in the same file format as you want
 ❗ Vstack is an experimental feature. We would love to hear your feedback.
 Please keep in mind that this feature can change in the future.
 
-Stack several datasets on top of each other: 
+Stack several datasets on top of each other:
 ```
 asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv
 ```
@@ -206,7 +247,7 @@ Compose is where datasets containing records with different labels (or no
 labels) can be assembled into a single dataset.
 
 ❗ Compose is an experimental feature. We would love to hear your feedback.
-Please keep in mind that this feature can change in the future. 
+Please keep in mind that this feature can change in the future.
 
 Overview of possible input files and corresponding properties, use at least
 one of the following arguments:
@@ -231,7 +272,7 @@ case of conflicts, use the `--conflict_resolve`/`-c` flag. This is set to
 | Resolve method | Action in case of conflict                                                              |
 |----------------|-----------------------------------------------------------------------------------------|
 | `keep_one`     | Keep one label, using `--hierarchy` to determine which label to keep                    |
-| `keep_all`     | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) | 
+| `keep_all`     | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
 | `abort`        | Abort                                                                                   |
 
 

diff --git a/Tutorials.md b/Tutorials.md
@@ -1,6 +1,6 @@
 # Tutorials
 
---- 
+---
 Below are several examples to illustrate how to use `ASReview-datatools`.  Make
 sure to have installed
 [asreview-datatools](https://github.com/asreview/asreview-datatools) and
@@ -18,17 +18,17 @@ ASReview converts the labeling decisions in [RIS files](https://asreview.readthe
 irrelevant as `0` and relevant as `1`. Records marked as unseen or with
 missing labeling decisions are converted to `-1`.
 
---- 
+---
 
-## Update Systematic Review 
+## Update Systematic Review
 
 Assume you are working on a systematic review and you want to update the
 review with newly available records. The original data is stored in
 `MY_LABELED_DATASET.csv` and the file contains a
 [column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format)
 containing the labeling decissions. In order to update the systematic review,
 you run the original  search query again but with a new date. You save the
-newly found records in `SEARCH_UPDATE.ris`. 
+newly found records in `SEARCH_UPDATE.ris`.
 
 
 In the command line interface (CLI), navigate to the directory where the
@@ -52,12 +52,18 @@ asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv
 
 Duplicate records can be removed with with `dedup` script. The algorithm
 removes duplicates using the Digital Object Indentifier
-([DOI](https://www.doi.org/)) and the title plus abstract. 
+([DOI](https://www.doi.org/)) and the title plus abstract.
 
 ```bash
 asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv
 ```
 
+This can also be done considering a similarity threshold between the titles and abstracts.
+
+```bash
+asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv --drop_similar
+```
+
 ### Describe input
 
 If you want to see descriptive info on your input datasets, run these commands:
@@ -78,12 +84,12 @@ asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPD
 The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept.
 
 The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be
-added as unlabeled to the composed dataset. 
+added as unlabeled to the composed dataset.
 
 If a record exists in both datasets, it is assumed the record containing a
 label is maintained, see the default [conflict resolving
 strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels).
-To keep both records (with and without label), use 
+To keep both records (with and without label), use
 
 ```bash
 asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep
@@ -154,14 +160,14 @@ added as unlabeled.
 
 If any duplicate records exist across the datasets, by default the order of
 keeping labels is:
-1. relevant 
+1. relevant
 2. irrelevant
 3. unlabeled
 
 You can configure the behavior in resolving conflicting labels by setting the
 hierarchy differently. To do so, pass the letters r (relevant), i
 (irrelevant), and u (unlabeled) in any order to, for example, `--hierarchy
-uir`. 
+uir`.
 
 
 The composed dataset will be exported to `search_with_priors.ris`.
@@ -193,12 +199,12 @@ new search.
 Assume you want to use the [simulation
 mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of
 ASReview but the data is not stored in one singe file containing the meta-data
-and labelling decissions as required by ASReview. 
+and labelling decissions as required by ASReview.
 
 Suppose the following files are available:
 
 - `SCREENED.ris`: all records that were screened
-- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.  
+- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.
 
 You need to compose the files into a single file where all records from
 `RELEVANT.csv` are relevant all other records are irrelevant.

diff --git a/asreviewcontrib/datatools/dedup.py b/asreviewcontrib/datatools/dedup.py
@@ -0,0 +1,211 @@
+import re
+from argparse import Namespace
+from difflib import SequenceMatcher
+
+import ftfy
+import pandas as pd
+from asreview import ASReviewData
+from pandas.api.types import is_object_dtype
+from pandas.api.types import is_string_dtype
+from rich.console import Console
+from rich.text import Text
+from tqdm import tqdm
+
+
+def _print_similar_list(
+        similar_list: list[tuple[int, int]],
+        data: pd.Series,
+        pid: str,
+        pids: pd.Series = None
+        ) -> None:
+
+    print_seq_matcher = SequenceMatcher()
+    console = Console()
+
+    if pids is not None:
+        print(f'Found similar titles or same {pid} at lines:')
+    else:
+        print('Found similar titles at lines:')
+
+    for i, j in similar_list:
+        print_seq_matcher.set_seq1(data.iloc[i])
+        print_seq_matcher.set_seq2(data.iloc[j])
+        text = Text()
+
+        if pids is not None:
+            text.append(f'\nLines {i+1} and {j+1} ', style='bold')
+            if pids.iloc[i] == pids.iloc[j]:
+                text.append(f'(same {pid} "{pids.iloc[i]}"):\n', style='dim')
+            else:
+                text.append(f'({pid} "{pids.iloc[i]}" and "{pids.iloc[j]}"):\n',
+                            style='dim')
+
+        else:
+            text.append(f'\nLines {i+1} and {j+1}:\n', style='bold')
+
+        for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes():
+            if tag == 'replace':
+                # add rich strikethrough
+                text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
+                text.append(f'{data.iloc[j][j1:j2]}', style='green')
+            if tag == 'delete':
+                text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
+            if tag == 'insert':
+                text.append(f'{data.iloc[j][j1:j2]}', style='green')
+            if tag == 'equal':
+                text.append(f'{data.iloc[i][i1:i2]}', style='dim')
+
+        console.print(text)
+
+    print('')
+
+
+def _drop_duplicates_by_similarity(
+        asdata: ASReviewData,
+        pid: str,
+        similarity: float = 0.98,
+        skip_abstract: bool = False,
+        discard_stopwords: bool = False,
+        stopwords_language: str = 'english',
+        strict_similarity: bool = False,
+        verbose: bool = False,
+        ) -> None:
+
+    if skip_abstract:
+        data = asdata.df['title']
+    else:
+        data = pd.Series(asdata.texts)
+
+    symbols_regex = re.compile(r'[^ \w\d\-_]')
+    spaces_regex = re.compile(r'\s+')
+
+    # clean the data
+    s = (
+        data
+        .apply(ftfy.fix_text)
+        .str.replace(symbols_regex, '', regex=True)
+        .str.replace(spaces_regex, ' ', regex=True)
+        .str.lower()
+        .str.strip()
+        .replace('', None)
+    )
+
+    if discard_stopwords:
+        try:
+            from nltk.corpus import stopwords
+            stopwords_set = set(stopwords.words(stopwords_language))
+        except LookupError:
+            import nltk
+            nltk.download('stopwords')
+            stopwords_set = set(stopwords.words(stopwords_language))
+
+        stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b')
+        s = s.str.replace(stopwords_regex, '', regex=True)
+
+    seq_matcher = SequenceMatcher()
+    duplicated = [False] * len(s)
+
+    if verbose:
+        similar_list = []
+    else:
+        similar_list = None
+
+    if pid in asdata.df.columns:
+        if is_string_dtype(asdata.df[pid]) or is_object_dtype(asdata.df[pid]):
+            pids = asdata.df[pid].str.strip().replace("", None)
+            if pid == "doi":
+                pids = pids.str.lower().str.replace(
+                    r"^https?://(www\.)?doi\.org/", "", regex=True
+                )
+
+        else:
+            pids = asdata.df[pid]
+
+        for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
+            seq_matcher.set_seq2(text)
+
+            # loop through the rest of the data if it has the same pid or similar length
+            for j, t in s.iloc[i+1:][(asdata.df[pid] == asdata.df.iloc[i][pid]) |
+                                     (abs(s.str.len() - len(text)) < 5)].items():
+                seq_matcher.set_seq1(t)
+
+                # if the texts have the same pid or are similar enough,
+                # mark the second one as duplicate
+                if pids.iloc[i] == pids.iloc[j] or \
+                    (seq_matcher.real_quick_ratio() > similarity and \
+                    seq_matcher.quick_ratio() > similarity and \
+                    (not strict_similarity or seq_matcher.ratio() > similarity)):
+
+                    if verbose and not duplicated[j]:
+                        similar_list.append((i, j))
+
+                    duplicated[j] = True
+
+        if verbose:
+            _print_similar_list(similar_list, data, pid, pids)
+
+    else:
+        print(f'Not using {pid} for deduplication because there is no such data.')
+
+        for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
+            seq_matcher.set_seq2(text)
+
+            # loop through the rest of the data if it has similar length
+            for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items():
+                seq_matcher.set_seq1(t)
+
+                # if the texts are similar enough, mark the second one as duplicate
+                if seq_matcher.real_quick_ratio() > similarity and \
+                    seq_matcher.quick_ratio() > similarity and \
+                    (not strict_similarity or seq_matcher.ratio() > similarity):
+
+                    if verbose and not duplicated[j]:
+                        similar_list.append((i, j))
+
+                    duplicated[j] = True
+
+        if verbose:
+            _print_similar_list(similar_list, data, pid)
+
+    asdata.df = asdata.df[~pd.Series(duplicated)].reset_index(drop=True)
+
+
+def deduplicate_data(asdata: ASReviewData, args: Namespace) -> None:
+    initial_length = len(asdata.df)
+
+    if not args.similar:
+        if args.pid not in asdata.df.columns:
+            print(
+                f'Not using {args.pid} for deduplication '
+                'because there is no such data.'
+            )
+
+        # retrieve deduplicated ASReview data object
+        asdata.drop_duplicates(pid=args.pid, inplace=True)
+
+    else:
+        _drop_duplicates_by_similarity(
+            asdata,
+            args.pid,
+            args.threshold,
+            args.title_only,
+            args.stopwords,
+            args.stopwords_language,
+            args.strict,
+            args.verbose,
+            )
+
+    # count duplicates
+    n_dup = initial_length - len(asdata.df)
+
+    if args.output_path:
+        asdata.to_file(args.output_path)
+        print(
+            f'Removed {n_dup} duplicates from dataset with'
+            f' {initial_length} records.'
+        )
+    else:
+        print(
+            f'Found {n_dup} duplicates in dataset with'
+            f' {initial_length} records.'
+        )