-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Optimize report viewer dto size #507
Draft: Optimize report viewer dto size #507
Conversation
Max Metric is now included in Webreport DTO alongside average metrix.
@nestabentum So, I might have good news: I tested the lookup-table version with 51 real submissions with the Also, you might already solved the bug with the lag in the UI anonymization, at least I did not encounter it this time. However, the problem in the generation still is there, while comparing took like 3 seconds, JPlag freezed for about 15 minutes while saving the files, so I was unable to test it with larger sets. However, I will test it again with a bigger programming task (50 submissions again, but larger ones) and will let you know about the result. |
I repeated the test with larger submissions, this time the look-up table is around 3.5MB. So, regarding file sizes, this could work. The only thing that I'm currently worried about is that this table already has around 50,000 entries. I don't know if this can become a problem with larger submission sizes. Once the saving problem is solved (this took around half an hour of freezed JPlag), I can test it with around 500 submissions. Regarding the viewer, I did not encounter any problems. The initial loading of the 12MB zip file took around 3 seconds, afterwards, everything was smooth. However, this might change when you add the resolving of indices to real code lines in the comparison view, so that's something to keep an eye on. Are there any other opinions? @tsaglam @dfuchss |
I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a look at the Java code :)
We can discuss possibilities for compression / serialization :)
|
||
comparisons.stream().map(JPlagComparison::similarity).map(percent -> percent / 10).map(Float::intValue).map(index -> index == 10 ? 9 : index) | ||
.forEach(index -> similarityDistribution[index]++); | ||
return calculateDistributionFor(comparisons, (JPlagComparison::similarity)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return calculateDistributionFor(comparisons, (JPlagComparison::similarity)); | |
return calculateDistributionFor(comparisons, JPlagComparison::similarity); |
|
||
private int[] calculateDistributionFor(List<JPlagComparison> comparisons, Function<JPlagComparison, Float> similarityExtractor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this lambda to a 'normal' loop. From my point of view, it's not that easy to read
|
||
private List<Long> readFileLines(File file) { | ||
List<Long> lineIndices = new ArrayList<>(); | ||
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe scanner is faster. But not for sure
List<Long> lineIndices = new ArrayList<>(); | ||
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) { | ||
String line; | ||
while ((line = bufferedReader.readLine()) != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we not simply store the start/end of a match. Why the string ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afterwards, we can simply compress the submissions of the students, and maybe more if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure if I know what you mean.. Are you asking why we are reading the code from the submission files and writing it to the report at all? If so, I do think this is the way to go to enable to the UI to have as little complexity as possible.
Also, we do only store beginning and end of matches (see ComparisonReportMapper
line 101)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to include the submission code directly in the jsons.
I'm not sure why the lookup table is needed.
I think:
a) store the source code to the jsons
b) save the start/end lines of a match
should contain the information we need. Why do we need the indirection (map)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s exactly the format I‘ve switched to and I‘m currently implementing: #507 (comment)
} | ||
|
||
private Long getIndexOfLine(String line) { | ||
return lineLookUpTable.entrySet().stream().filter(entry -> Objects.equals(entry.getValue(), line)).map(Map.Entry::getKey).findFirst() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A reversed map may be a lot faster here.
int endSecond = usesIndex ? endTokenSecond.getIndex() : endTokenSecond.getLine(); | ||
int tokens = match.getLength(); | ||
|
||
return new Match(startTokenFirst.getFile(), startTokenSecond.getFile(), startFirst, endFirst, startSecond, endSecond, tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simply store this information should be enough if I store the student submissions as well, or am I wrong ?
// their lines are unresolved, so they are only numbers until we looked them up with the lineLookUpTable | ||
|
||
for (FilesOfSubmission fileOfSubmissionActual : unresolvedLinesFilesActual) { // per mapped submissionFile | ||
var resolvedLinesOfFile = fileOfSubmissionActual.lines().stream().map(actualMapperResult.lineLookupTable()::get).toList(); // resolve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatter :)
jplag/src/test/java/de/jplag/reporting/reportobject/mapper/MetricMapperTest.java
Show resolved
Hide resolved
jplag/src/test/resources/mockito-extensions/org.mockito.plugins.MockMaker
Show resolved
Hide resolved
No no no, the 30 minutes are only there due to a bug - the real question discussed here is whether there should be one global lookup table per result or one lookup table per submission |
Ah ok :) |
I would save the start/end info and not try to save lines of code in a map, or did I misunderstand anything? How you save the code depends on the task type. Still, I think I would also save the source code per submission, then you don't have to load everything for comparison. |
@dfuchss Thanks for the extensive feedback:) I will look into in detail asap, as you saw, most of this is very barebones and mainly for proof of concept |
@sebinside I pushed a (very) quick fix for the serialisation issues. can you have look at it and run it? |
I am currently trying another file format that does not use a lookup table at all: I realised that while it is convenient for the UI that the submissions are in the comparison model, (even with a lookup approach) it is redundant. So the format I'm currently trying does not have Submission files in the comparison model at all, but stores the submissions in |
@nestabentum That look very intersting, a good approach to the problem. I would argue that its not that bad to have some level of complexity in the UI as long as the peak work load does not become higher, which is probably the case here. Unfortunately I have some other deadlines so I'm not sure whether I will able to test out the new generation before our next meeting. |
8e43f9f
to
718b9c7
Compare
718b9c7
to
e2ef33b
Compare
Kudos, SonarCloud Quality Gate passed!
|
Closed as there is now a cleaner version of this PR (#535) |
Previous discussion about the new file format can be found here.
This PR builds upon my changes introduced in #487 which yet have to be merged. Therefore, in this PR only the changes introduced by commit efe795e are relevant.
I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table.
This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.
@sebinside noted that a global lookup table could prove unhandily large for large submission sets.
We will look into this concern and might change the format to one lookup table per submission.