Gzip battle logs before writing them #2733

Antar1011 · 2016-09-02T13:24:12Z

@Zarel, I know you're concerned both about disk space and I/O bottlenecks. So why not gzip (or bzip or xzip) each log before writing? It's not going to give you the best compression (compared to batch-compressing a bunch of logs at once), but I just tested gzipping a random log, and it reduced the file size from 5211B to 1649B (so 3x?)

antar@smogon-sim:~$ ls -l battle-ou-200004623.log.json 
-rw-rw-r-- 1 antar antar 5211 Jan  5  2015 battle-ou-200004623.log.json
antar@smogon-sim:~$ gzip battle-ou-200004623.log.json 
antar@smogon-sim:~$ ls -l battle-ou-200004623.log.json.gz 
-rw-rw-r-- 1 antar antar 1649 Jan  5  2015 battle-ou-200004623.log.json.gz```

The text was updated successfully, but these errors were encountered:

Zarel · 2016-09-02T21:15:35Z

I'd rather batch them all at once.

Antar1011 · 2016-09-02T21:23:17Z

The problem is, unless you're caching them in-memory, batch-compressing doesn't save you disk i/o. You might want to do both: gzip on generation, then have a nightly job that reads into memory and then compresses en masse.

Zarel · 2016-09-02T21:33:54Z

I know, but it saves disk space which is way more important. 90% compression is a lot better than 60% compression.

Antar1011 · 2016-09-03T20:54:26Z

You should be able to do both, and my guess is if you do it right it'd be more performant than just doing the nightly compression: I started gzipping my intermediate files for the Smogon Usage Stats scripts, not to save space, but because compressing and writing shorter files turned out to be faster than just writing the uncompressed files.

Obviously you don't want to be compressing already compressed files (although I wouldn't be surprised if one of the compression libraries handles that intelligently if it's used for both file-level and tar-level compression), but you shouldn't have to uncompress them to disk before recompressing.

Not sure what libraries are out there for node, but in python, you can gzip.open the files and read them to a string buffer, then add them as files to a compressed tar archive using the tarfile library.

Let me know if you want me to put together a POC demonstration.

Zarel · 2016-09-04T03:58:17Z

That sounds like it'd be significantly slower than the commandline utility...

Zarel · 2016-09-04T03:58:27Z

You can try it if you'd like.

Zarel · 2017-05-07T09:48:13Z

On second thought, this sounds like a good idea.

Slayer95 · 2018-03-20T05:40:09Z

Update on this. I had the idea for long of using preset dictionaries to get the best of both worlds: high compression ratio for ~16 KB files and immediate compression. Now I got around doing some testing.

I took my February battles dataset, and used it for two different sets of tests. One dealing with the whole month data, and the other only with the OU tier. For presets dictionaries, all the training was done with the corresponding January data.

For baseline and top compression ratio targets, I found the following size reductions:

Ratio	Method	Source data
-67.5%	GZip per file at Level 6	All tiers
-67.8%	GZip per file at Level 7+	All tiers
-87.4%	Tar+GZip at default Level (9)	All tiers
-97.8%	Tar+LZMA at default Level (probably max, slow as fuck)	All tiers
-86.5%	Tar+GZip at default Level (9)	OU
-93.0%	Tar+LZMA at default Level (probably max, slow as fuck)	OU

Now, enter preset dictionaries. The most viable implementation is rolling a different dictionary in a month-per-month basis, with the data from the month before, to catch up with tier shifts, which would change the metagame state and common strings. There are two ergonomic tools for that as far as I can tell.

One is brought to us by CloudFlare in Dictator, and generates Deflate dictionaries.

The other is by Facebook in a built-in training mode for the Zstandard compression format, for which I have verified there is an appropriate Python package (compat with Antar's side).

Ratio	Method	Dictionary	Data	Requirements
-69.6%	GZip	16KB (Level 4)	All tiers	Google GO
-68.7%	zstd	112 KB (Level 3?)	All tiers, Dictionary sample=100	zstd CLI
-76.6%	GZip	24KB (Level 4)	OU	Google GO
-73.9%	zstd	72KB (Level 3?)	OU, Dictionary sample=100	zstd CLI

So, I propose using per-tier monthly dictionaries. This proposal turns out to be exactly in-between between the mentioned 60% and 90% compression rates, sitting at ~75% rate, and still compressing every file on write.

Accounting for the log saving changes that would be required for speed and resilience, I have found Zstandard to be far easier to work with than Dictator+GZip. However, the project might still want to go with Dictator if GO is still in the roadmap.

PS. Fuck markdown, why won't the tables render properly?

KamilaBorowska · 2018-03-20T06:07:35Z

Worth noting that ext4 often is configured with a block size of 4KB, which means:

Actual file size as stored is rounded up to the nearest block. For instance, a 2KB file is stored as 4KB file, and 6KB file is stored as 8KB file.
There isn't much value in compressing files smaller than 4KB due to that. A 3KB file may become 1KB, and it changes pretty much nothing.
On the other hand, compressing 5KB files saves 4KB of data.

Also, zstd compression is rather cheap as far the CPU cost goes. This is likely worth it if we don't have too many files smaller than 4KB.

By the way, @Slayer95, you are missing -|-|-|-|- between headers and rows which is why table doesn't render. Yes, this | character is rather important.

Slayer95 · 2018-03-20T06:31:10Z

too many files smaller than 4KB.

I have found about 3.5% of files are smaller than that.

log saving changes

The plan is to keep a /$month/$tier/pending/ folder for the time window while the new dictionary is being generated, so we may keep that 3.5% of uncompressable files there forever, or create a /$month/$tier/raw/ for them if that approach yields issues.

PS1. Also, I have calculated file size reductions using du -sh, which as far as I can tell reports the real in-disk storage size.

On the other hand, compressing 5KB files saves 4KB of data.

PS2. I don't think this is kosher... It could at most be saved in a single 4KB block, so it would be a save of at most 1KB. Really, in a per-file basis, the ~75% plus file size reduction would only be accurate for 16KB+ files (which, as it turns out, is also their average size; it seems that the log size distribution has a long tail or something).

By the way, @Slayer95, you are missing -|-|-|-|- between headers and rows which is why table doesn't render. Yes, this | character is rather important.

PS3. Thanks, @xfix !

Zarel · 2018-03-21T21:23:16Z

I'm not sure the compression improvement over Gzip is necessarily worth the complication of using dictionaries?

scheibo · 2019-05-30T01:53:22Z

Can we revisit this? From:

On second thought, this sounds like a good idea.

Originally posted by @Zarel in #2733 (comment)

It seems like you were open to gzipping on write (per file), which is fairly trivial to implement and the new stats processing framework handles this already. I can do this as soon as the new stats processing framework has been verified if thats the route youre OK with us taking.

If we were going to compress at a higher level (directories per day or per month), I'd like us to consider using a compression format that is amenable to reading files from without decompressing the entire thing (ie/ pretty sure you need to decompress the entire .tar.gz before you can read a file from it, as opposed to .zip where it appears you can figure out the listing and read + decompress specific files from it while still leaving the majority compressed)

As an aside, I still maintain that we should be paying for a cheap 'archival' file storage server (< $20/month on S3, but there are almost certainly better deals) just to serve as file storage, so space doesn't have to be a concern. In the past you mentioned "the problem is more, like, getting the logs from the server onto the data storage", but this should be fairly simply be solved with just a file watcher?

Zarel · 2019-06-01T14:49:44Z

Yes, I'm open to gzipping on write; it was blocked on me not wanting to rewrite the stats processor, and also on "honestly, we should be using TokuDB or MyRocks or some other compressed write-optimized database instead".

The only other issue is that we have a battle log search feature that depends on the data being uncompressed. I think the ideal solution is still the write-optimized database.

Zarel · 2019-06-01T14:50:24Z

If we can merge the battle log and replay databases, that might be ideal. I've been considering that for a long time. Then uploading a replay could just be a matter of setting a flag to "visible".

scheibo · 2019-06-01T16:05:10Z

Yes, I'm open to gzipping on write; it was blocked on me not wanting to rewrite the stats processor,

Would not require a rewrite (though I did one anyway), it literally only requires a 1-3 line change in one of the files of the Python scripts (though I can definitely understand not wanting to dive into that codebase).

If we can merge the battle log and replay databases, that might be ideal. I've been considering that for a long time. Then uploading a replay could just be a matter of setting a flag to "visible".

Having the source of truth for stats be a unified log + replay database is very straightforward (or at least, I will refactor my design a little bit to abstract out log listing/reading behind a Storage interface so we can just write an adapter to let us switch our input source more easily going forward).

Less clear is whether we'd want to use the same database for processed logs and analysis, but definitely shouldn't be an issue deduping storage such that we no longer store the raw JSONlogs in text files and in the replay database.

As for switching the source database to a 'compressed write optimized database (that also has support for search?)', I'll punt on that discussion for now as well.

Anyway, once my stats processing is done we can decide whether we should continue writing flat files (in which case, ill add compression on write per this bug), or whether we want to just be writing directly to the replay database (in which case i'll write an Storage adapter for processing stats from the database instead of files).

Zarel · 2019-06-02T02:54:11Z

If we hook up to a replay database, side servers will have problems logging. So the logging code would have to support both.

Antar1011 mentioned this issue Sep 2, 2016

Store timestamp in the log #2732

Closed

Antar1011 self-assigned this Sep 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gzip battle logs before writing them #2733

Gzip battle logs before writing them #2733

Antar1011 commented Sep 2, 2016 •

edited

Loading

Zarel commented Sep 2, 2016

Antar1011 commented Sep 2, 2016

Zarel commented Sep 2, 2016

Antar1011 commented Sep 3, 2016

Zarel commented Sep 4, 2016

Zarel commented Sep 4, 2016

Zarel commented May 7, 2017

Slayer95 commented Mar 20, 2018 •

edited

Loading

KamilaBorowska commented Mar 20, 2018 •

edited

Loading

Slayer95 commented Mar 20, 2018 •

edited

Loading

Zarel commented Mar 21, 2018

scheibo commented May 30, 2019

Zarel commented Jun 1, 2019

Zarel commented Jun 1, 2019

scheibo commented Jun 1, 2019

Zarel commented Jun 2, 2019

Gzip battle logs before writing them #2733

Gzip battle logs before writing them #2733

Comments

Antar1011 commented Sep 2, 2016 • edited Loading

Zarel commented Sep 2, 2016

Antar1011 commented Sep 2, 2016

Zarel commented Sep 2, 2016

Antar1011 commented Sep 3, 2016

Zarel commented Sep 4, 2016

Zarel commented Sep 4, 2016

Zarel commented May 7, 2017

Slayer95 commented Mar 20, 2018 • edited Loading

KamilaBorowska commented Mar 20, 2018 • edited Loading

Slayer95 commented Mar 20, 2018 • edited Loading

Zarel commented Mar 21, 2018

scheibo commented May 30, 2019

Zarel commented Jun 1, 2019

Zarel commented Jun 1, 2019

scheibo commented Jun 1, 2019

Zarel commented Jun 2, 2019

Antar1011 commented Sep 2, 2016 •

edited

Loading

Slayer95 commented Mar 20, 2018 •

edited

Loading

KamilaBorowska commented Mar 20, 2018 •

edited

Loading

Slayer95 commented Mar 20, 2018 •

edited

Loading