Skip to content

Commit

Permalink
New duplicate algorithm to check for similar entries (#52)
Browse files Browse the repository at this point in the history
Add similarity based deduplication algorithm
  • Loading branch information
george-gca authored Feb 6, 2025
1 parent c5109a5 commit 3d9b906
Show file tree
Hide file tree
Showing 9 changed files with 492 additions and 39 deletions.
47 changes: 44 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,47 @@ asreview data dedup synergy:van_de_schoot_2018 -o van_de_schoot_2018_dedup.csv
Removed 104 records from dataset with 6189 records.
```

We can also choose to deduplicate based on the similarity of the title and abstract, instead of checking for an exact match. This way we can find duplicates that have small differences, but are actually the same record (for example, an additional comma or a fixed typo). This can be done by using the `--drop_similar` flag. This process takes about 4s on a dataset of about 2068 entries.

```bash
asreview data dedup neurips_2020.tsv --drop_similar
```
```
Not using doi for deduplication because there is no such data.
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:03<00:00, 531.93it/s]
Found 2 duplicates in dataset with 2068 records.
```

If we want to check which entries were found as duplicates, we can use the `--verbose` flag. This will print the lines of the dataset that were found as duplicates, as well as the difference between them. Any text that has to be removed from the first entry to become the second one is shown as red and has a strikethrough, and any text that has to be added to the first entry is shown as green. All text that is the same in both entries is dimmed.

```bash
asreview data dedup neurips_2020.tsv --drop_similar --verbose
```

![Verbose drop similar](./dedup_similar.png)

The similarity threshold can be set with the `--similarity` flag. The default similarity threshold is `0.98`. We can also choose to only use the title for deduplication by using the `--skip_abstract` flag.

```bash
asreview data dedup neurips_2020.tsv --drop_similar --similarity 0.98 --skip_abstract
```
```
Not using doi for deduplication because there is no such data.
Deduplicating: 100%|████████████████████████████████████| 2068/2068 [00:02<00:00, 770.74it/s]
Found 4 duplicates in dataset with 2068 records.
```

Note that you might have to adjust your similarity score if you choose to only use the title for deduplication. The similarity score is calculated using the [SequenceMatcher](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher) class from the `difflib` package. The similarity score is calculated as the ratio of the number of matching characters to the total number of characters in the two strings. For example, the similarity score between the strings "hello" and "hello world" is 0.625. By default, we use the [real_quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.real_quick_ratio) and [quick_ratio](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.quick_ratio) methods, which are faster and usually good enough, but less accurate. If you want to use the ratio method, you can use the `--strict_similarity` flag.

Now, if we want to discard stopwords for deduplication (for a more strict check on the important words), we can use the `--discard_stopwords` flag. The default language for the stopwords is `english`, but that can be set with the `--stopwords_language` flag. The list of supported languages for the stopwords are the same supported by the [nltk](https://www.nltk.org/index.html) package. To check the list of available languages, you can run the following commands on your python environment:

```python
from nltk.corpus import stopwords
print(stopwords.fileids())
```
```
['arabic', 'azerbaijani', 'basque', 'bengali', 'catalan', 'chinese', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hebrew', 'hinglish', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
```

### Data Vstack (Experimental)

Expand All @@ -186,7 +227,7 @@ Vertical stacking: combine as many datasets in the same file format as you want
❗ Vstack is an experimental feature. We would love to hear your feedback.
Please keep in mind that this feature can change in the future.

Stack several datasets on top of each other:
Stack several datasets on top of each other:
```
asreview data vstack output.csv MY_DATASET_1.csv MY_DATASET_2.csv MY_DATASET_3.csv
```
Expand All @@ -206,7 +247,7 @@ Compose is where datasets containing records with different labels (or no
labels) can be assembled into a single dataset.

❗ Compose is an experimental feature. We would love to hear your feedback.
Please keep in mind that this feature can change in the future.
Please keep in mind that this feature can change in the future.

Overview of possible input files and corresponding properties, use at least
one of the following arguments:
Expand All @@ -231,7 +272,7 @@ case of conflicts, use the `--conflict_resolve`/`-c` flag. This is set to
| Resolve method | Action in case of conflict |
|----------------|-----------------------------------------------------------------------------------------|
| `keep_one` | Keep one label, using `--hierarchy` to determine which label to keep |
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
| `keep_all` | Keep conflicting records as duplicates in the composed dataset (ignoring `--hierarchy`) |
| `abort` | Abort |


Expand Down
28 changes: 17 additions & 11 deletions Tutorials.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Tutorials

---
---
Below are several examples to illustrate how to use `ASReview-datatools`. Make
sure to have installed
[asreview-datatools](https://github.com/asreview/asreview-datatools) and
Expand All @@ -18,17 +18,17 @@ ASReview converts the labeling decisions in [RIS files](https://asreview.readthe
irrelevant as `0` and relevant as `1`. Records marked as unseen or with
missing labeling decisions are converted to `-1`.

---
---

## Update Systematic Review
## Update Systematic Review

Assume you are working on a systematic review and you want to update the
review with newly available records. The original data is stored in
`MY_LABELED_DATASET.csv` and the file contains a
[column](https://asreview.readthedocs.io/en/latest/data_labeled.html#label-format)
containing the labeling decissions. In order to update the systematic review,
you run the original search query again but with a new date. You save the
newly found records in `SEARCH_UPDATE.ris`.
newly found records in `SEARCH_UPDATE.ris`.


In the command line interface (CLI), navigate to the directory where the
Expand All @@ -52,12 +52,18 @@ asreview data convert SEARCH_UPDATE.ris SEARCH_UPDATE.csv

Duplicate records can be removed with with `dedup` script. The algorithm
removes duplicates using the Digital Object Indentifier
([DOI](https://www.doi.org/)) and the title plus abstract.
([DOI](https://www.doi.org/)) and the title plus abstract.

```bash
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv
```

This can also be done considering a similarity threshold between the titles and abstracts.

```bash
asreview data dedup SEARCH_UPDATE.csv -o SEARCH_UPDATE_DEDUP.csv --drop_similar
```

### Describe input

If you want to see descriptive info on your input datasets, run these commands:
Expand All @@ -78,12 +84,12 @@ asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPD
The flag `-l` means the labels in `MY_LABELED_DATASET.csv` will be kept.

The flag `-u` means all records from `SEARCH_UPDATE_DEDUP.csv` will be
added as unlabeled to the composed dataset.
added as unlabeled to the composed dataset.

If a record exists in both datasets, it is assumed the record containing a
label is maintained, see the default [conflict resolving
strategy](https://github.com/asreview/asreview-datatools#resolving-conflicting-labels).
To keep both records (with and without label), use
To keep both records (with and without label), use

```bash
asreview data compose updated_search.csv -l MY_LABELED_DATASET.csv -u SEARCH_UPDATE_DEDUP.csv -c keep
Expand Down Expand Up @@ -154,14 +160,14 @@ added as unlabeled.

If any duplicate records exist across the datasets, by default the order of
keeping labels is:
1. relevant
1. relevant
2. irrelevant
3. unlabeled

You can configure the behavior in resolving conflicting labels by setting the
hierarchy differently. To do so, pass the letters r (relevant), i
(irrelevant), and u (unlabeled) in any order to, for example, `--hierarchy
uir`.
uir`.


The composed dataset will be exported to `search_with_priors.ris`.
Expand Down Expand Up @@ -193,12 +199,12 @@ new search.
Assume you want to use the [simulation
mode](https://asreview.readthedocs.io/en/latest/simulation_overview.html) of
ASReview but the data is not stored in one singe file containing the meta-data
and labelling decissions as required by ASReview.
and labelling decissions as required by ASReview.

Suppose the following files are available:

- `SCREENED.ris`: all records that were screened
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.
- `RELEVANT.ris`: the subset of relevant records after manually screening all the records.

You need to compose the files into a single file where all records from
`RELEVANT.csv` are relevant all other records are irrelevant.
Expand Down
211 changes: 211 additions & 0 deletions asreviewcontrib/datatools/dedup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
import re
from argparse import Namespace
from difflib import SequenceMatcher

import ftfy
import pandas as pd
from asreview import ASReviewData
from pandas.api.types import is_object_dtype
from pandas.api.types import is_string_dtype
from rich.console import Console
from rich.text import Text
from tqdm import tqdm


def _print_similar_list(
similar_list: list[tuple[int, int]],
data: pd.Series,
pid: str,
pids: pd.Series = None
) -> None:

print_seq_matcher = SequenceMatcher()
console = Console()

if pids is not None:
print(f'Found similar titles or same {pid} at lines:')
else:
print('Found similar titles at lines:')

for i, j in similar_list:
print_seq_matcher.set_seq1(data.iloc[i])
print_seq_matcher.set_seq2(data.iloc[j])
text = Text()

if pids is not None:
text.append(f'\nLines {i+1} and {j+1} ', style='bold')
if pids.iloc[i] == pids.iloc[j]:
text.append(f'(same {pid} "{pids.iloc[i]}"):\n', style='dim')
else:
text.append(f'({pid} "{pids.iloc[i]}" and "{pids.iloc[j]}"):\n',
style='dim')

else:
text.append(f'\nLines {i+1} and {j+1}:\n', style='bold')

for tag, i1, i2, j1, j2 in print_seq_matcher.get_opcodes():
if tag == 'replace':
# add rich strikethrough
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
text.append(f'{data.iloc[j][j1:j2]}', style='green')
if tag == 'delete':
text.append(f'{data.iloc[i][i1:i2]}', style='red strike')
if tag == 'insert':
text.append(f'{data.iloc[j][j1:j2]}', style='green')
if tag == 'equal':
text.append(f'{data.iloc[i][i1:i2]}', style='dim')

console.print(text)

print('')


def _drop_duplicates_by_similarity(
asdata: ASReviewData,
pid: str,
similarity: float = 0.98,
skip_abstract: bool = False,
discard_stopwords: bool = False,
stopwords_language: str = 'english',
strict_similarity: bool = False,
verbose: bool = False,
) -> None:

if skip_abstract:
data = asdata.df['title']
else:
data = pd.Series(asdata.texts)

symbols_regex = re.compile(r'[^ \w\d\-_]')
spaces_regex = re.compile(r'\s+')

# clean the data
s = (
data
.apply(ftfy.fix_text)
.str.replace(symbols_regex, '', regex=True)
.str.replace(spaces_regex, ' ', regex=True)
.str.lower()
.str.strip()
.replace('', None)
)

if discard_stopwords:
try:
from nltk.corpus import stopwords
stopwords_set = set(stopwords.words(stopwords_language))
except LookupError:
import nltk
nltk.download('stopwords')
stopwords_set = set(stopwords.words(stopwords_language))

stopwords_regex = re.compile(rf'\b{"\\b|\\b".join(stopwords_set)}\b')
s = s.str.replace(stopwords_regex, '', regex=True)

seq_matcher = SequenceMatcher()
duplicated = [False] * len(s)

if verbose:
similar_list = []
else:
similar_list = None

if pid in asdata.df.columns:
if is_string_dtype(asdata.df[pid]) or is_object_dtype(asdata.df[pid]):
pids = asdata.df[pid].str.strip().replace("", None)
if pid == "doi":
pids = pids.str.lower().str.replace(
r"^https?://(www\.)?doi\.org/", "", regex=True
)

else:
pids = asdata.df[pid]

for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
seq_matcher.set_seq2(text)

# loop through the rest of the data if it has the same pid or similar length
for j, t in s.iloc[i+1:][(asdata.df[pid] == asdata.df.iloc[i][pid]) |
(abs(s.str.len() - len(text)) < 5)].items():
seq_matcher.set_seq1(t)

# if the texts have the same pid or are similar enough,
# mark the second one as duplicate
if pids.iloc[i] == pids.iloc[j] or \
(seq_matcher.real_quick_ratio() > similarity and \
seq_matcher.quick_ratio() > similarity and \
(not strict_similarity or seq_matcher.ratio() > similarity)):

if verbose and not duplicated[j]:
similar_list.append((i, j))

duplicated[j] = True

if verbose:
_print_similar_list(similar_list, data, pid, pids)

else:
print(f'Not using {pid} for deduplication because there is no such data.')

for i, text in tqdm(s.items(), total=len(s), desc='Deduplicating'):
seq_matcher.set_seq2(text)

# loop through the rest of the data if it has similar length
for j, t in s.iloc[i+1:][abs(s.str.len() - len(text)) < 5].items():
seq_matcher.set_seq1(t)

# if the texts are similar enough, mark the second one as duplicate
if seq_matcher.real_quick_ratio() > similarity and \
seq_matcher.quick_ratio() > similarity and \
(not strict_similarity or seq_matcher.ratio() > similarity):

if verbose and not duplicated[j]:
similar_list.append((i, j))

duplicated[j] = True

if verbose:
_print_similar_list(similar_list, data, pid)

asdata.df = asdata.df[~pd.Series(duplicated)].reset_index(drop=True)


def deduplicate_data(asdata: ASReviewData, args: Namespace) -> None:
initial_length = len(asdata.df)

if not args.similar:
if args.pid not in asdata.df.columns:
print(
f'Not using {args.pid} for deduplication '
'because there is no such data.'
)

# retrieve deduplicated ASReview data object
asdata.drop_duplicates(pid=args.pid, inplace=True)

else:
_drop_duplicates_by_similarity(
asdata,
args.pid,
args.threshold,
args.title_only,
args.stopwords,
args.stopwords_language,
args.strict,
args.verbose,
)

# count duplicates
n_dup = initial_length - len(asdata.df)

if args.output_path:
asdata.to_file(args.output_path)
print(
f'Removed {n_dup} duplicates from dataset with'
f' {initial_length} records.'
)
else:
print(
f'Found {n_dup} duplicates in dataset with'
f' {initial_length} records.'
)
Loading

0 comments on commit 3d9b906

Please sign in to comment.