Description
I think it would be a good idea to produce a reference result for each of the benchmarks, such that implementations can verify their correctness. I already found several bugs in several implementations, which would probably have been discovered if there were a reference result.
One problem that I see is that it is not clear how to compare results. First, the benchmark does not specify what "plot" means, i.e., how to configure the histograms. As far as I have seen, the different implementations largely use the same configuration, but, for example, the Go and the Coffea implementations use different configurations in Task 8. I think it would be a good idea to specify the histograms in the benchmark.
Second, different tools serialize their histograms differently. While Groot and Coffea give the lower bounds of each bin, ROOT gives the bin centers. Also, the former two have the underflows and overflows separately while ROOT has two extra bins. This has an easy solution: pick one, say bin centers plus two extra bins, and have the implementations convert as part of the comparison.