Draft: Optimize report viewer dto size #507

nestabentum · 2022-07-14T12:31:14Z

Previous discussion about the new file format can be found here.

This PR builds upon my changes introduced in #487 which yet have to be merged. Therefore, in this PR only the changes introduced by commit efe795e are relevant.

I implemented an approach to reduce the report's file size: Code lines in a comparison file are not saved as plain text any more, but as numbers. These numbers are indices to a lookup table.
This lookup table is global, i.e., it contains an index -> code line mapping, for every code line across all report files. This lookup table is persisted in an additional file lookupTable.json and will have to be used by the report viewer to resolve code lines before displaying them.

@sebinside noted that a global lookup table could prove unhandily large for large submission sets.
We will look into this concern and might change the format to one lookup table per submission.

Max Metric is now included in Webreport DTO alongside average metrix.

sebinside · 2022-07-15T08:01:06Z

@nestabentum So, I might have good news: I tested the lookup-table version with 51 real submissions with the -n -1 flag, so that it generates all comparisons (about 1300 files in total). The older version generates about 23MB of data, your version only about 7.5MB. Also, the lookup table for 51 submissions was only about 2 MB, so that should be fine.

Also, you might already solved the bug with the lag in the UI anonymization, at least I did not encounter it this time. However, the problem in the generation still is there, while comparing took like 3 seconds, JPlag freezed for about 15 minutes while saving the files, so I was unable to test it with larger sets. However, I will test it again with a bigger programming task (50 submissions again, but larger ones) and will let you know about the result.

sebinside · 2022-07-15T09:00:33Z

I repeated the test with larger submissions, this time the look-up table is around 3.5MB. So, regarding file sizes, this could work. The only thing that I'm currently worried about is that this table already has around 50,000 entries. I don't know if this can become a problem with larger submission sizes. Once the saving problem is solved (this took around half an hour of freezed JPlag), I can test it with around 500 submissions.

Regarding the viewer, I did not encounter any problems. The initial loading of the 12MB zip file took around 3 seconds, afterwards, everything was smooth. However, this might change when you add the resolving of indices to real code lines in the comparison view, so that's something to keep an eye on. Are there any other opinions? @tsaglam @dfuchss

dfuchss · 2022-07-15T23:05:24Z

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

dfuchss

I just had a look at the Java code :)

We can discuss possibilities for compression / serialization :)

dfuchss · 2022-07-15T23:13:35Z

jplag/src/main/java/de/jplag/JPlagResult.java

-
-        comparisons.stream().map(JPlagComparison::similarity).map(percent -> percent / 10).map(Float::intValue).map(index -> index == 10 ? 9 : index)
-                .forEach(index -> similarityDistribution[index]++);
+        return calculateDistributionFor(comparisons, (JPlagComparison::similarity));


Suggested change

return calculateDistributionFor(comparisons, (JPlagComparison::similarity));

return calculateDistributionFor(comparisons, JPlagComparison::similarity);

dfuchss · 2022-07-15T23:15:21Z

jplag/src/main/java/de/jplag/JPlagResult.java


+    private int[] calculateDistributionFor(List<JPlagComparison> comparisons, Function<JPlagComparison, Float> similarityExtractor) {


Please make this lambda to a 'normal' loop. From my point of view, it's not that easy to read

dfuchss · 2022-07-15T23:19:21Z

jplag/src/main/java/de/jplag/reporting/reportobject/mapper/ComparisonReportMapper.java

+
+    private List<Long> readFileLines(File file) {
+        List<Long> lineIndices = new ArrayList<>();
+        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {


Maybe scanner is faster. But not for sure

dfuchss · 2022-07-15T23:20:15Z

jplag/src/main/java/de/jplag/reporting/reportobject/mapper/ComparisonReportMapper.java

+        List<Long> lineIndices = new ArrayList<>();
+        try (BufferedReader bufferedReader = new BufferedReader(new FileReader(file))) {
+            String line;
+            while ((line = bufferedReader.readLine()) != null) {


Why should we not simply store the start/end of a match. Why the string ?

Afterwards, we can simply compress the submissions of the students, and maybe more if needed

I'm not really sure if I know what you mean.. Are you asking why we are reading the code from the submission files and writing it to the report at all? If so, I do think this is the way to go to enable to the UI to have as little complexity as possible.

Also, we do only store beginning and end of matches (see ComparisonReportMapper line 101)

I want to include the submission code directly in the jsons.
I'm not sure why the lookup table is needed.

I think:
a) store the source code to the jsons
b) save the start/end lines of a match

should contain the information we need. Why do we need the indirection (map)?

That’s exactly the format I‘ve switched to and I‘m currently implementing: #507 (comment)

dfuchss · 2022-07-15T23:21:58Z

jplag/src/main/java/de/jplag/reporting/reportobject/mapper/ComparisonReportMapper.java

+    }
+
+    private Long getIndexOfLine(String line) {
+        return lineLookUpTable.entrySet().stream().filter(entry -> Objects.equals(entry.getValue(), line)).map(Map.Entry::getKey).findFirst()


A reversed map may be a lot faster here.

dfuchss · 2022-07-15T23:23:16Z

jplag/src/main/java/de/jplag/reporting/reportobject/mapper/ComparisonReportMapper.java

+        int endSecond = usesIndex ? endTokenSecond.getIndex() : endTokenSecond.getLine();
+        int tokens = match.getLength();
+
+        return new Match(startTokenFirst.getFile(), startTokenSecond.getFile(), startFirst, endFirst, startSecond, endSecond, tokens);


Simply store this information should be enough if I store the student submissions as well, or am I wrong ?

dfuchss · 2022-07-15T23:25:19Z

jplag/src/test/java/de/jplag/reporting/reportobject/mapper/ComparisonReportMapperTest.java

+            // their lines are unresolved, so they are only numbers until we looked them up with the lineLookUpTable
+
+            for (FilesOfSubmission fileOfSubmissionActual : unresolvedLinesFilesActual) { // per mapped submissionFile
+                var resolvedLinesOfFile = fileOfSubmissionActual.lines().stream().map(actualMapperResult.lineLookupTable()::get).toList(); // resolve


Formatter :)

jplag/src/test/java/de/jplag/reporting/reportobject/mapper/MetricMapperTest.java

jplag/src/test/resources/mockito-extensions/org.mockito.plugins.MockMaker

sebinside · 2022-07-16T08:19:49Z

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

No no no, the 30 minutes are only there due to a bug - the real question discussed here is whether there should be one global lookup table per result or one lookup table per submission

dfuchss · 2022-07-16T09:56:05Z

I'll have a look at it next week. 30 min are a very long time. We may need to discuss what takes this huge amount of time.

No no no, the 30 minutes are only there due to a bug - the real question discussed here is whether there should be one global lookup table per result or one lookup table per submission

Ah ok :)

dfuchss · 2022-07-16T11:02:10Z

I would save the start/end info and not try to save lines of code in a map, or did I misunderstand anything?

How you save the code depends on the task type. Still, I think I would also save the source code per submission, then you don't have to load everything for comparison.

nestabentum · 2022-07-22T10:11:10Z

@dfuchss Thanks for the extensive feedback:) I will look into in detail asap, as you saw, most of this is very barebones and mainly for proof of concept

nestabentum · 2022-07-22T10:13:27Z

@sebinside I pushed a (very) quick fix for the serialisation issues. can you have look at it and run it?

nestabentum · 2022-07-25T06:18:22Z

I am currently trying another file format that does not use a lookup table at all:
ImmediateSerialization.zip

I realised that while it is convenient for the UI that the submissions are in the comparison model, (even with a lookup approach) it is redundant. So the format I'm currently trying does not have Submission files in the comparison model at all, but stores the submissions in Submission/<SubmissionName>/.
This again shifts complexity in to the UI but reduces the memory footprint even more.

sebinside · 2022-07-25T07:04:15Z

@nestabentum That look very intersting, a good approach to the problem. I would argue that its not that bad to have some level of complexity in the UI as long as the peak work load does not become higher, which is probably the case here.

Unfortunately I have some other deadlines so I'm not sure whether I will able to test out the new generation before our next meeting.

sonarqubecloud · 2022-07-28T08:39:07Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
13 Code Smells

No Coverage information
0.0% Duplication

nestabentum · 2022-07-28T10:42:14Z

Closed as there is now a cleaner version of this PR (#535)

nestabentum added 2 commits July 14, 2022 07:23

Add Max Metric to Webreport

1708cc6

Max Metric is now included in Webreport DTO alongside average metrix.

ReportViewer - Save File Lines as Indices to LookUpTable

efe795e

dfuchss requested changes Jul 15, 2022

View reviewed changes

Serialize ReportObjects Immediately

726efb7

Save Submissions Unchanged and in their own folders

c023549

tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change report-viewer PR / Issue deals (partly) with the report viewer and thus involves web-dev technologies labels Jul 25, 2022

tsaglam added this to the v4.0.0 milestone Jul 25, 2022

Use Original SubmissionFiles in Report DTOs

6084b8f

nestabentum force-pushed the optimize_reportViewer_dto_size_new branch 3 times, most recently from 8e43f9f to 718b9c7 Compare July 28, 2022 08:37

Improve Readability of JPlagResult#calculateDistributionFor

e2ef33b

nestabentum force-pushed the optimize_reportViewer_dto_size_new branch from 718b9c7 to e2ef33b Compare July 28, 2022 08:38

nestabentum changed the title ~~Optimize report viewer dto size~~ Draft: Optimize report viewer dto size Jul 28, 2022

nestabentum mentioned this pull request Jul 28, 2022

Optimize report viewer dto size new #535

Merged

nestabentum closed this Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Optimize report viewer dto size #507

Draft: Optimize report viewer dto size #507

nestabentum commented Jul 14, 2022

sebinside commented Jul 15, 2022

sebinside commented Jul 15, 2022

dfuchss commented Jul 15, 2022

dfuchss left a comment

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

nestabentum Jul 24, 2022

dfuchss Jul 27, 2022

nestabentum Jul 28, 2022

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

dfuchss Jul 15, 2022

sebinside commented Jul 16, 2022

dfuchss commented Jul 16, 2022

dfuchss commented Jul 16, 2022

nestabentum commented Jul 22, 2022

nestabentum commented Jul 22, 2022

nestabentum commented Jul 25, 2022

sebinside commented Jul 25, 2022

sonarqubecloud bot commented Jul 28, 2022

nestabentum commented Jul 28, 2022

	return calculateDistributionFor(comparisons, (JPlagComparison::similarity));
	return calculateDistributionFor(comparisons, JPlagComparison::similarity);


		private int[] calculateDistributionFor(List<JPlagComparison> comparisons, Function<JPlagComparison, Float> similarityExtractor) {

Draft: Optimize report viewer dto size #507

Draft: Optimize report viewer dto size #507

Conversation

nestabentum commented Jul 14, 2022

sebinside commented Jul 15, 2022

sebinside commented Jul 15, 2022

dfuchss commented Jul 15, 2022

dfuchss left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sebinside commented Jul 16, 2022

dfuchss commented Jul 16, 2022

dfuchss commented Jul 16, 2022

nestabentum commented Jul 22, 2022

nestabentum commented Jul 22, 2022

nestabentum commented Jul 25, 2022

sebinside commented Jul 25, 2022

sonarqubecloud bot commented Jul 28, 2022

nestabentum commented Jul 28, 2022