Big-Data-Matching

Introduction

Now we have a very large dataset, and we want to figure out our target information using queries that require exact matching, or fuzzy matching.

And we have a lot of existing amazing packages to support this work. In this repository, I will test the time complexity (matching efficiency), and the matching accuracy of some chosen packages.

Project Structure

Program structure is given as follow:
    ├─data: will not uploaded
    ├─paper: some research & empirical papers
    ├─faiss: a package developed by Facebook AI Research
    │  └─tutorial: some code examples
    ├─string_grouper: a super fast string matching package in Python
    │  └─tutorial: some code examples
    ├─splink: help link dataset without unique identifier (probablistic model)
    │  └─tutorial: some code examples
    ├─further_check: identify the true matching pairs from fuzzy matching results
    │  └─tutorial: some code examples
    ...

Workflow

We first implement the fuzzy matching (to get potential matching pairs)
- In this step, for the fuzzy matching with multiple comparison levels (requirements), we prefer Splink README-splink.
- For the fuzzy matching with only one comparison level, i.e., we need to find best potential results for one pair (company pair, fullname pair, etc.), we prefer String_grouper README-string grouper
Then, we calculate some relative scores, to further check whether the potential pairs are true or not.
- We calculate some scores (including the Levenshtein distance, ngram distance, phonetic distance, etc. See Scores-code example here) first. We also figure out an algorithm that can help improve the predicting quality in RF model, which scores based on the name features (first name/middle name/last name, name initial, name composition, name "etymology". See Custom_scores-code example)
- We construct a random forest model to predict the matching likelihood.
- BTW, here's a interesting tool to help identify whether the potential matching pairs are true or not, that is Claude Claude-code example. (to assist the manual checking/labeling process)

Results

Testing the matching algorithms

FAISS

Input dataset scale\time	SentenceTransformer searching time	Another vectorization (TFIDF) vectorization+training+searching
(10000*10000)	0.4s	1.4s \| + 0.9s \| + 1.2s \| = 3.5s
(50000*50000)	9.9s	8.2s \|+ 4.5s \| + 48.7s \|= 61.4s
(100000*100000)	40.6s	17s \|+ 6s \| + 241.7s\| = 264s
(500000*500000)	1043.5s	\| \| \|

It's more like an $O(N^2)$ method.
More details see README-explore faiss

String Grouper

Compared with FAISS

Input data scale\time	Num_process = int(ncpu*3/4)	Num_process = 1	Speed of faiss
(10000*10000)	2.4s	2s	3.5s
(50000*50000)	10.6s	21s	61.4s
(100000*100000)	21.1s	65s	264s
(500000*500000)	127.5s	1272s

Also an $O(N^2)$ techique.
More details see README-explore string_grouper

Coding Tips

I am using the Polars dataframe to speed up the I/O process. And if the Polars dataframe contains several data types (check data.dtypes), when transfering to Pandas dataframe, it will occur some errors.

The way I solve the problem is [simply change the columns with list data type to string data type]:
```
for col in ['company_raw', 'location', 'naics_code', 'country', 'state', 'msa', 'company_ticker']:
    data = data.with_columns(
       pl.format("[{}]",
          pl.col(col).cast(pl.List(pl.Utf8)).list.join(", ")).alias(col))
```

Post and News

Announcing ScaNN: Efficient Vector Similarity Search [Blog]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big-Data-Matching

Introduction

Project Structure

Workflow

Results

Testing the matching algorithms

FAISS

String Grouper

Coding Tips

Post and News

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
data		data
faiss		faiss
further_check		further_check
paper		paper
splink		splink
string_grouper		string_grouper
.gitignore		.gitignore
README.md		README.md

ZhimingMei/Big-Data-Matching

Folders and files

Latest commit

History

Repository files navigation

Big-Data-Matching

Introduction

Project Structure

Workflow

Results

Testing the matching algorithms

FAISS

String Grouper

Coding Tips

Post and News

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages