Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Optimize report viewer dto size #507

Closed

Conversation

nestabentum
Copy link
Contributor

Previous discussion about the new file format can be found here.

This PR builds upon my changes introduced in #487 which yet have to be merged. Therefore, in this PR only the changes introduced by commit efe795e are relevant.

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table.
This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

@sebinside noted that a global lookup table could prove unhandily large for large submission sets.
We will look into this concern and might change the format to one lookup table per submission.

Max Metric is now included in Webreport DTO alongside average metrix.
@sebinside
Copy link
Member

@nestabentum So, I might have good news: I tested the lookup-table version with 51 real submissions with the -n -1 flag, so that it generates all comparisons (about 1300 files in total). The older version generates about 23MB of data, your version only about 7.5MB. Also, the lookup table for 51 submissions was only about 2 MB, so that should be fine.

Also, you might already solved the bug with the lag in the UI anonymization, at least I did not encounter it this time. However, the problem in the generation still is there, while comparing took like 3 seconds, JPlag freezed for about 15 minutes while saving the files, so I was unable to test it with larger sets. However, I will test it again with a bigger programming task (50 submissions again, but larger ones) and will let you know about the result.

@sebinside
Copy link
Member

I repeated the test with larger submissions, this time the look-up table is around 3.5MB. So, regarding file sizes, this could work. The only thing that I'm currently worried about is that this table already has around 50,000 entries. I don't know if this can become a problem with larger submission sizes. Once the saving problem is solved (this took around half an hour of freezed JPlag), I can test it with around 500 submissions.

Regarding the viewer, I did not encounter any problems. The initial loading of the 12MB zip file took around 3 seconds, afterwards, everything was smooth. However, this might change when you add the resolving of indices to real code lines in the comparison view, so that's something to keep an eye on. Are there any other opinions? @tsaglam @dfuchss

@dfuchss
Copy link
Member

dfuchss commented Jul 15, 2022

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

Copy link
Member

@dfuchss dfuchss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a look at the Java code :)

We can discuss possibilities for compression / serialization :)


comparisons.stream().map(JPlagComparison::similarity).map(percent -> percent / 10).map(Float::intValue).map(index -> index == 10 ? 9 : index)
.forEach(index -> similarityDistribution[index]++);
return calculateDistributionFor(comparisons, (JPlagComparison::similarity));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return calculateDistributionFor(comparisons, (JPlagComparison::similarity));
return calculateDistributionFor(comparisons, JPlagComparison::similarity);


private int[] calculateDistributionFor(List<JPlagComparison> comparisons, Function<JPlagComparison, Float> similarityExtractor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this lambda to a 'normal' loop. From my point of view, it's not that easy to read


private List<Long> readFileLines(File file) {
List<Long> lineIndices = new ArrayList<>();
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe scanner is faster. But not for sure

List<Long> lineIndices = new ArrayList<>();
try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we not simply store the start/end of a match. Why the string ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afterwards, we can simply compress the submissions of the students, and maybe more if needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure if I know what you mean.. Are you asking why we are reading the code from the submission files and writing it to the report at all? If so, I do think this is the way to go to enable to the UI to have as little complexity as possible.

Also, we do only store beginning and end of matches (see ComparisonReportMapper line 101)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to include the submission code directly in the jsons.
I'm not sure why the lookup table is needed.

I think:
a) store the source code to the jsons
b) save the start/end lines of a match

should contain the information we need. Why do we need the indirection (map)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s exactly the format I‘ve switched to and I‘m currently implementing: #507 (comment)

}

private Long getIndexOfLine(String line) {
return lineLookUpTable.entrySet().stream().filter(entry -> Objects.equals(entry.getValue(), line)).map(Map.Entry::getKey).findFirst()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reversed map may be a lot faster here.

int endSecond = usesIndex ? endTokenSecond.getIndex() : endTokenSecond.getLine();
int tokens = match.getLength();

return new Match(startTokenFirst.getFile(), startTokenSecond.getFile(), startFirst, endFirst, startSecond, endSecond, tokens);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply store this information should be enough if I store the student submissions as well, or am I wrong ?

// their lines are unresolved, so they are only numbers until we looked them up with the lineLookUpTable

for (FilesOfSubmission fileOfSubmissionActual : unresolvedLinesFilesActual) { // per mapped submissionFile
var resolvedLinesOfFile = fileOfSubmissionActual.lines().stream().map(actualMapperResult.lineLookupTable()::get).toList(); // resolve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatter :)

@sebinside
Copy link
Member

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

No no no, the 30 minutes are only there due to a bug - the real question discussed here is whether there should be one global lookup table per result or one lookup table per submission

@dfuchss
Copy link
Member

dfuchss commented Jul 16, 2022

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

No no no, the 30 minutes are only there due to a bug - the real question discussed here is whether there should be one global lookup table per result or one lookup table per submission

Ah ok :)

@dfuchss
Copy link
Member

dfuchss commented Jul 16, 2022

I would save the start/end info and not try to save lines of code in a map, or did I misunderstand anything?

How you save the code depends on the task type. Still, I think I would also save the source code per submission, then you don't have to load everything for comparison.

@nestabentum
Copy link
Contributor Author

@dfuchss Thanks for the extensive feedback:) I will look into in detail asap, as you saw, most of this is very barebones and mainly for proof of concept

@nestabentum
Copy link
Contributor Author

@sebinside I pushed a (very) quick fix for the serialisation issues. can you have look at it and run it?

@nestabentum
Copy link
Contributor Author

I am currently trying another file format that does not use a lookup table at all:
ImmediateSerialization.zip

I realised that while it is convenient for the UI that the submissions are in the comparison model, (even with a lookup approach) it is redundant. So the format I'm currently trying does not have Submission files in the comparison model at all, but stores the submissions in Submission/<SubmissionName>/.
This again shifts complexity in to the UI but reduces the memory footprint even more.

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies labels Jul 25, 2022
@tsaglam tsaglam added this to the v4.0.0 milestone Jul 25, 2022
@sebinside
Copy link
Member

@nestabentum That look very intersting, a good approach to the problem. I would argue that its not that bad to have some level of complexity in the UI as long as the peak work load does not become higher, which is probably the case here.

Unfortunately I have some other deadlines so I'm not sure whether I will able to test out the new generation before our next meeting.

@nestabentum nestabentum force-pushed the optimize_reportViewer_dto_size_new branch 3 times, most recently from 8e43f9f to 718b9c7 Compare July 28, 2022 08:37
@nestabentum nestabentum force-pushed the optimize_reportViewer_dto_size_new branch from 718b9c7 to e2ef33b Compare July 28, 2022 08:38
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@nestabentum nestabentum changed the title Optimize report viewer dto size Draft: Optimize report viewer dto size Jul 28, 2022
@nestabentum
Copy link
Contributor Author

Closed as there is now a cleaner version of this PR (#535)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants