CopyShield is a simple Plagiarism Detection tool, which reads collection of documents and checks for similarity between them. It can be used to detect plagiarism in documents or source codes.
The report generation feature creates three separate files with detailed information as follows:
-
Detected Plagiarism and Similarities: This file contains the pairs of files that are flagged as likely duplicates along with the similarity percentage.
-
Pairwise Similarities: This file contains list of similarity percentages between each pair of participants.
-
Participant Plagiarism Scores: This file contains the plagiarism score of each participant.
The program generates an HTML report containing the code snippets of the all pairs of files that are flagged as likely duplicates.
Our application includes a Code Comparison Visualization feature that makes it easy to identify differences between two sets of code
The left side displays the first (who submit first) participant's code and the right side displays the second (who submit second) participant's code. the differences are highlighted as follows:
- Green: The code that second participant added.
- Red: The code that second participant removed.
- Blue : The code that is common between the two participants.
note: the order of the participants in submission time is only available in codeforces submissions, (not in vjudge case cuz can't know who submit first Β―\(γ)/Β― ).
you can see the example below to understand it better ππ.
-
Text Preprocessing: The code from each file is preprocessed to remove comments and whitespace, and all characters are converted to lowercase.
-
n-grams Generation: Each processed code snippet is divided into n-grams
-
Hashing: The n-grams are hashed to reduce the dimensionality of the feature space.
-
Fingerprinting: A sliding window approach is used to create fingerprints from the hashed n-grams, allowing efficient comparison.
-
Similarity Calculation: The program computes Jaccard Similarity between fingerprints of each pair of files. If similarity exceeds a threshold , it flags the files as likely duplicates.
- Clone the repository
git clone https://github.com/saifadin1/CopyShield.git
- Install the required packages
pip install -r requirements.txt
- Create the
.env
file: Copy the contents of the.env.example
file to create a new.env
file in the project root directory and set the required environment variables if needed.
First, the submissions should be fetched from the online judge (Vjudge or CodeForces especially).
Simply download the submissions from the contest page as a zip file and files names will be formatted correctly as: <submission Id>_<Verdict>_<username>_<problem name>
the image below shows the export submissions button in the contest page of Vjudge.
Similarly, download the submissions as a zip file from the contest page. However, there's a slight issue: the filenames are not formatted as needed. To fix this, we need to reformat them to match the required format: <submission Id>_<Verdict>_<username>_<problem name>
.
CodeForcesSubmissionsReformatting
this directory contains two scripts to help you with that:
codeforces_api_client.py
: this script will fetch the metadata of the submissions and save it in a json file.rename_submissions.py
: this script will rename the files in./src/CodeForcesSubmissionsReformatting/submissions
to be formatted so the fetched submissions should be in this path.
you can find the contest admin page in the following path: https://codeforces.com/group/<group_id>/contest/<contest_id>/admin
and the image below shows the export submissions button in the contest admin page of codeforces.
- Navigate to the
src
directory using the following command:
cd ./src
- Compile the code using the following command:
g++ *.cpp -o main
- Run the compiled code using the following command:
./src/main ./<path to the directory containing the files to be checked>
The reports will be generated in ./src/reports
directory as follows structure:
| reports
|---| result.csv
|---| pairs.csv
|---| participants.csv
|---| index.html
|---| problems_data
|---|---| A
|---|---|---|HTMLreports
|---|---|---|index.html
|---|---| B
|---|---|---|HTMLreports
|---|---|---|index.html
|---|---|..
|---|---|..
to view the HTML report, open the index.html
file in the browser.
You should flag participants who have been verified as cheaters to send them emails in reports/praticapnts.csv
, all participants will be marked with False
by defualte in the Flag
column,
so if you confirmed that they are cheaters change the value to True
and
you can send emails to the flagged participants by the following steps
Add a csv file with the following name group_data.csv
in the following path ./src/sending_mails
and contains the following columns:
| Handle | Email | Name |
Set up Mailjet API credentials
Ensure the following environment variables are set in the .env
file:
MAILJET_API_KEY="<your-api-key>"
MAILJET_API_SECRET="<your-api-secret>"
MAILJET_SENDER_EMAIL="<your-sender-mail>"
python .\src\sending_mails\send_mails.py
-
Set the threshold value for similarity
--threshold, -t <value>
-
Set the window size for fingerprinting
--window-size, -w <value>
-
Set the n-gram size
--grams, -g <value>
-
Exclude specific files (problem)
--exclude-problems, -e <problem1,problem2,...>
-
Include only specific files (problem)
--include-problems, -i <problem1,problem2,...>
-
Include only specific users
--include-users, -u <user1,user2,...>
-
Display the help message showing the available options and their descriptions
--help, -h
.\src\main .\problems -t 70 -w 5 -g 3 -e problem1,problem2
- Add support for highlighting the similer blocks in the HTML report
- Add better hashing function
- Add more efficient similarity calculation algorithm