-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Greedy String Tiling out of bounds #2179
base: develop
Are you sure you want to change the base?
Conversation
… racing conditions
|
* @param options determines the parameterization. | ||
* @deprecated in favor of static {@link #run(JPlagOptions)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did this move down due to spotless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'm pretty sure I didn't touch the Javadoc. But I think this is the canonical order.
* @return the results of the comparison, specifically the submissions whose similarity exceeds a set threshold. | ||
* @throws ExitException if JPlag exits preemptively. | ||
* @deprecated in favor of static {@link #run(JPlagOptions)}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
return this.tokenValueMap.get(submission); | ||
} | ||
|
||
public static TokenValueMapper generateTokenValueMapper(SubmissionSet submissionSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a special reason to have this generator method and not use the constrcutor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of. Usually constructors should not contain application logic, at least the way I learned it. So having a constructor create the entire map automatically feels like it's bad code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
II would say in this case here, the creating logic is not complex enough here to warrant a factory method.
It makes usability and readability a bit more complex that it should, thus I would say we keep it simple here.
This also avoids side effects, as submissions can no longer be added after construction, thus improving encapsulation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the core algorithm is the most long-living and complex part of our code base, it is especially important to make PRs as well as the JDoc for these parts especially descriptive, as the may be referenced frequently in the future.
There are a few issues I see here, mostly regarding high-level design.
Could you also specify explicitly in the PR what changes regarding the point in time of the token mapping. If I understand it correctly, we map all tokens at the beginning now instead of on demand, right?
import de.jplag.logging.ProgressBarLogger; | ||
import de.jplag.logging.ProgressBarType; | ||
|
||
public class TokenValueMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class definitely needs exhaustive documentation. First, the complexity in the GST algorithm is already high, making it harder to understand this part of the code base. Second, this class here is currently very generically named, making hard to understand what it does, why it is needed, and what is actually models conceptually.
Please add helpful JDoc comments here, also include your design decision in the PR description.
GreedyStringTiling coreAlgorithm = new GreedyStringTiling(options, TokenValueMapper.generateTokenValueMapper(submissionSet)); | ||
ComparisonStrategy comparisonStrategy = new ParallelComparisonStrategy(options, coreAlgorithm); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a architecture standpoint, I think the token value mapper should be part of the comparison strategy (specifically the abstract super class). I have two reasons for this:
- Process order: With your order the submissions are partially processed before the input is checked against basic constraint further down.
- Cohesion and granularity: The token value mapping is a internal step of the core algorithm, thus having it in here is too much detail for this high-level routine. I would say the responsibility should be located deeper down in the core.
(thus, the value mapping should be done as part of the call in line 86, the submission set is already passed there anyways)
return this.tokenValueMap.get(submission); | ||
} | ||
|
||
public static TokenValueMapper generateTokenValueMapper(SubmissionSet submissionSet) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
II would say in this case here, the creating logic is not complex enough here to warrant a factory method.
It makes usability and readability a bit more complex that it should, thus I would say we keep it simple here.
This also avoids side effects, as submissions can no longer be added after construction, thus improving encapsulation.
@@ -6,6 +6,7 @@ | |||
public enum ProgressBarType { | |||
LOADING("Loading Submissions ", false), | |||
PARSING("Parsing Submissions ", false), | |||
HASH_CREATION("Preparing Submissions", false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum naming not consistent with CLI message text. Do we really create hashes here?
Moved the token value mapping outside greedy string tiling to prevent racing conditions
The out of bounds exception that occasionally occurs is presumably caused by calculation the token values in multiple threads and using non-thread-safe methods. This PR solves that problem by calculating the lists before starting greedy string tiling.
I took some measurements and the total time for JPlag seems to be unaffected
On the progpedia data I got The following measurement:
old: 2.053 s ± 0.014 s
new: 2.080 s ± 0.016 s
Using 100 copies of JPlag core code as input:
old: 13.006 s ± 0.201 s
new: 12.997 s ± 0.150 s