Skip to content

Simple Plagiarism detection tool for competitive programming competitions

Notifications You must be signed in to change notification settings

saifadin1/CopyShield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CopyShield πŸ›‘οΈ

Table of Contents

What is CopyShield ? πŸ€”

CopyShield is a simple Plagiarism Detection tool, which reads collection of documents and checks for similarity between them. It can be used to detect plagiarism in documents or source codes.

Report Generation

CSV Reports

The report generation feature creates three separate files with detailed information as follows:

  1. Detected Plagiarism and Similarities: This file contains the pairs of files that are flagged as likely duplicates along with the similarity percentage.

  2. Pairwise Similarities: This file contains list of similarity percentages between each pair of participants.

  3. Participant Plagiarism Scores: This file contains the plagiarism score of each participant.

HTML Report

The program generates an HTML report containing the code snippets of the all pairs of files that are flagged as likely duplicates.

HTML report

HTML report

Code Comparison Visualization πŸ“Š

Our application includes a Code Comparison Visualization feature that makes it easy to identify differences between two sets of code

How it works ?

The left side displays the first (who submit first) participant's code and the right side displays the second (who submit second) participant's code. the differences are highlighted as follows:

  • Green: The code that second participant added.
  • Red: The code that second participant removed.
  • Blue : The code that is common between the two participants.

note: the order of the participants in submission time is only available in codeforces submissions, (not in vjudge case cuz can't know who submit first Β―\(ツ)/Β― ).

you can see the example below to understand it better πŸ‘‡πŸ‘‡.

HTML report

HTML report

How it works ? πŸ› οΈ

  1. Text Preprocessing: The code from each file is preprocessed to remove comments and whitespace, and all characters are converted to lowercase.

  2. n-grams Generation: Each processed code snippet is divided into n-grams

  3. Hashing: The n-grams are hashed to reduce the dimensionality of the feature space.

  4. Fingerprinting: A sliding window approach is used to create fingerprints from the hashed n-grams, allowing efficient comparison.

  5. Similarity Calculation: The program computes Jaccard Similarity between fingerprints of each pair of files. If similarity exceeds a threshold , it flags the files as likely duplicates.

Getting Started πŸš€

Setting up the environment

  1. Clone the repository
git clone https://github.com/saifadin1/CopyShield.git
  1. Install the required packages
pip install -r requirements.txt
  1. Create the .env file: Copy the contents of the .env.example file to create a new .env file in the project root directory and set the required environment variables if needed.

Fetching Submissions ⬇️

First, the submissions should be fetched from the online judge (Vjudge or CodeForces especially).

Vjudge

Simply download the submissions from the contest page as a zip file and files names will be formatted correctly as: <submission Id>_<Verdict>_<username>_<problem name> the image below shows the export submissions button in the contest page of Vjudge.

Vjudge export submissions

CodeForces

Similarly, download the submissions as a zip file from the contest page. However, there's a slight issue: the filenames are not formatted as needed. To fix this, we need to reformat them to match the required format: <submission Id>_<Verdict>_<username>_<problem name>. CodeForcesSubmissionsReformatting this directory contains two scripts to help you with that:

  1. codeforces_api_client.py : this script will fetch the metadata of the submissions and save it in a json file.
  2. rename_submissions.py : this script will rename the files in ./src/CodeForcesSubmissionsReformatting/submissions to be formatted so the fetched submissions should be in this path.

you can find the contest admin page in the following path: https://codeforces.com/group/<group_id>/contest/<contest_id>/admin and the image below shows the export submissions button in the contest admin page of codeforces.

Codeforces export submissions

Compile cpp code πŸ”¨

  1. Navigate to the src directory using the following command:
cd ./src
  1. Compile the code using the following command:
g++ *.cpp -o main
  1. Run the compiled code using the following command:
./src/main ./<path to the directory containing the files to be checked>

Getting the reports πŸ—‚οΈ

The reports will be generated in ./src/reports directory as follows structure:

| reports
|---| result.csv
|---| pairs.csv
|---| participants.csv
|---| index.html
|---| problems_data
|---|---| A
|---|---|---|HTMLreports
|---|---|---|index.html
|---|---| B
|---|---|---|HTMLreports
|---|---|---|index.html
|---|---|..
|---|---|..

to view the HTML report, open the index.html file in the browser.

Sending emails πŸ“©

You should flag participants who have been verified as cheaters to send them emails in reports/praticapnts.csv , all participants will be marked with False by defualte in the Flag column, so if you confirmed that they are cheaters change the value to True and you can send emails to the flagged participants by the following steps

Prepare a CSV file

Add a csv file with the following name group_data.csv in the following path ./src/sending_mails and contains the following columns:

| Handle | Email | Name |

Set up Mailjet API credentials

Ensure the following environment variables are set in the .env file:

MAILJET_API_KEY="<your-api-key>"
MAILJET_API_SECRET="<your-api-secret>"
MAILJET_SENDER_EMAIL="<your-sender-mail>"

Run the following command to send the emails

python .\src\sending_mails\send_mails.py

Command-Line options ☰

  • Set the threshold value for similarity

    --threshold, -t <value>
  • Set the window size for fingerprinting

    --window-size, -w <value>
  • Set the n-gram size

    --grams, -g <value>
  • Exclude specific files (problem)

    --exclude-problems, -e <problem1,problem2,...>
  • Include only specific files (problem)

    --include-problems, -i <problem1,problem2,...>
  • Include only specific users

    --include-users, -u <user1,user2,...>
  • Display the help message showing the available options and their descriptions

    --help, -h

Example

.\src\main .\problems -t 70 -w 5 -g 3 -e problem1,problem2

TODO πŸ“

  • Add support for highlighting the similer blocks in the HTML report
  • Add better hashing function
  • Add more efficient similarity calculation algorithm