Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Greedy String Tiling out of bounds #2179

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

TwoOfTwelve
Copy link
Contributor

Moved the token value mapping outside greedy string tiling to prevent racing conditions

The out of bounds exception that occasionally occurs is presumably caused by calculation the token values in multiple threads and using non-thread-safe methods. This PR solves that problem by calculating the lists before starting greedy string tiling.

I took some measurements and the total time for JPlag seems to be unaffected

On the progpedia data I got The following measurement:
old: 2.053 s ± 0.014 s
new: 2.080 s ± 0.016 s

Using 100 copies of JPlag core code as input:
old: 13.006 s ± 0.201 s
new: 12.997 s ± 0.150 s

@TwoOfTwelve TwoOfTwelve added the bug Issue/PR that involves a bug label Feb 5, 2025
@TwoOfTwelve TwoOfTwelve requested review from tsaglam and Kr0nox February 5, 2025 16:24
Copy link

sonarqubecloud bot commented Feb 5, 2025

* @param options determines the parameterization.
* @deprecated in favor of static {@link #run(JPlagOptions)}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did this move down due to spotless?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I'm pretty sure I didn't touch the Javadoc. But I think this is the canonical order.

* @return the results of the comparison, specifically the submissions whose similarity exceeds a set threshold.
* @throws ExitException if JPlag exits preemptively.
* @deprecated in favor of static {@link #run(JPlagOptions)}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

return this.tokenValueMap.get(submission);
}

public static TokenValueMapper generateTokenValueMapper(SubmissionSet submissionSet) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason to have this generator method and not use the constrcutor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of. Usually constructors should not contain application logic, at least the way I learned it. So having a constructor create the entire map automatically feels like it's bad code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

II would say in this case here, the creating logic is not complex enough here to warrant a factory method.
It makes usability and readability a bit more complex that it should, thus I would say we keep it simple here.
This also avoids side effects, as submissions can no longer be added after construction, thus improving encapsulation.

@tsaglam tsaglam added the minor Minor issue/feature/contribution/change label Feb 10, 2025
Copy link
Member

@tsaglam tsaglam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the core algorithm is the most long-living and complex part of our code base, it is especially important to make PRs as well as the JDoc for these parts especially descriptive, as the may be referenced frequently in the future.

There are a few issues I see here, mostly regarding high-level design.

Could you also specify explicitly in the PR what changes regarding the point in time of the token mapping. If I understand it correctly, we map all tokens at the beginning now instead of on demand, right?

import de.jplag.logging.ProgressBarLogger;
import de.jplag.logging.ProgressBarType;

public class TokenValueMapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class definitely needs exhaustive documentation. First, the complexity in the GST algorithm is already high, making it harder to understand this part of the code base. Second, this class here is currently very generically named, making hard to understand what it does, why it is needed, and what is actually models conceptually.

Please add helpful JDoc comments here, also include your design decision in the PR description.

Comment on lines +74 to +76
GreedyStringTiling coreAlgorithm = new GreedyStringTiling(options, TokenValueMapper.generateTokenValueMapper(submissionSet));
ComparisonStrategy comparisonStrategy = new ParallelComparisonStrategy(options, coreAlgorithm);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a architecture standpoint, I think the token value mapper should be part of the comparison strategy (specifically the abstract super class). I have two reasons for this:

  • Process order: With your order the submissions are partially processed before the input is checked against basic constraint further down.
  • Cohesion and granularity: The token value mapping is a internal step of the core algorithm, thus having it in here is too much detail for this high-level routine. I would say the responsibility should be located deeper down in the core.

(thus, the value mapping should be done as part of the call in line 86, the submission set is already passed there anyways)

return this.tokenValueMap.get(submission);
}

public static TokenValueMapper generateTokenValueMapper(SubmissionSet submissionSet) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

II would say in this case here, the creating logic is not complex enough here to warrant a factory method.
It makes usability and readability a bit more complex that it should, thus I would say we keep it simple here.
This also avoids side effects, as submissions can no longer be added after construction, thus improving encapsulation.

@@ -6,6 +6,7 @@
public enum ProgressBarType {
LOADING("Loading Submissions ", false),
PARSING("Parsing Submissions ", false),
HASH_CREATION("Preparing Submissions", false),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum naming not consistent with CLI message text. Do we really create hashes here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR that involves a bug minor Minor issue/feature/contribution/change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants