-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gzip battle logs before writing them #2733
Comments
I'd rather batch them all at once. |
The problem is, unless you're caching them in-memory, batch-compressing doesn't save you disk i/o. You might want to do both: gzip on generation, then have a nightly job that reads into memory and then compresses en masse. |
I know, but it saves disk space which is way more important. 90% compression is a lot better than 60% compression. |
You should be able to do both, and my guess is if you do it right it'd be more performant than just doing the nightly compression: I started gzipping my intermediate files for the Smogon Usage Stats scripts, not to save space, but because compressing and writing shorter files turned out to be faster than just writing the uncompressed files. Obviously you don't want to be compressing already compressed files (although I wouldn't be surprised if one of the compression libraries handles that intelligently if it's used for both file-level and tar-level compression), but you shouldn't have to uncompress them to disk before recompressing. Not sure what libraries are out there for node, but in python, you can Let me know if you want me to put together a POC demonstration. |
That sounds like it'd be significantly slower than the commandline utility... |
You can try it if you'd like. |
On second thought, this sounds like a good idea. |
Update on this. I had the idea for long of using preset dictionaries to get the best of both worlds: high compression ratio for ~16 KB files and immediate compression. Now I got around doing some testing. I took my February battles dataset, and used it for two different sets of tests. One dealing with the whole month data, and the other only with the OU tier. For presets dictionaries, all the training was done with the corresponding January data. For baseline and top compression ratio targets, I found the following size reductions:
Now, enter preset dictionaries. The most viable implementation is rolling a different dictionary in a month-per-month basis, with the data from the month before, to catch up with tier shifts, which would change the metagame state and common strings. There are two ergonomic tools for that as far as I can tell. One is brought to us by CloudFlare in Dictator, and generates Deflate dictionaries. The other is by Facebook in a built-in training mode for the Zstandard compression format, for which I have verified there is an appropriate Python package (compat with Antar's side).
So, I propose using per-tier monthly dictionaries. This proposal turns out to be exactly in-between between the mentioned 60% and 90% compression rates, sitting at ~75% rate, and still compressing every file on write. Accounting for the log saving changes that would be required for speed and resilience, I have found Zstandard to be far easier to work with than Dictator+GZip. However, the project might still want to go with Dictator if GO is still in the roadmap. PS. Fuck markdown, why won't the tables render properly? |
Worth noting that ext4 often is configured with a block size of 4KB, which means:
Also, zstd compression is rather cheap as far the CPU cost goes. This is likely worth it if we don't have too many files smaller than 4KB. By the way, @Slayer95, you are missing |
I have found about 3.5% of files are smaller than that.
The plan is to keep a PS1. Also, I have calculated file size reductions using
PS2. I don't think this is kosher... It could at most be saved in a single 4KB block, so it would be a save of at most 1KB. Really, in a per-file basis, the ~75% plus file size reduction would only be accurate for 16KB+ files (which, as it turns out, is also their average size; it seems that the log size distribution has a long tail or something).
PS3. Thanks, @xfix ! |
I'm not sure the compression improvement over Gzip is necessarily worth the complication of using dictionaries? |
Can we revisit this? From:
Originally posted by @Zarel in #2733 (comment) It seems like you were open to gzipping on write (per file), which is fairly trivial to implement and the new stats processing framework handles this already. I can do this as soon as the new stats processing framework has been verified if thats the route youre OK with us taking. If we were going to compress at a higher level (directories per day or per month), I'd like us to consider using a compression format that is amenable to reading files from without decompressing the entire thing (ie/ pretty sure you need to decompress the entire As an aside, I still maintain that we should be paying for a cheap 'archival' file storage server (< $20/month on S3, but there are almost certainly better deals) just to serve as file storage, so space doesn't have to be a concern. In the past you mentioned "the problem is more, like, getting the logs from the server onto the data storage", but this should be fairly simply be solved with just a file watcher? |
Yes, I'm open to gzipping on write; it was blocked on me not wanting to rewrite the stats processor, and also on "honestly, we should be using TokuDB or MyRocks or some other compressed write-optimized database instead". The only other issue is that we have a battle log search feature that depends on the data being uncompressed. I think the ideal solution is still the write-optimized database. |
If we can merge the battle log and replay databases, that might be ideal. I've been considering that for a long time. Then uploading a replay could just be a matter of setting a flag to "visible". |
Would not require a rewrite (though I did one anyway), it literally only requires a 1-3 line change in one of the files of the Python scripts (though I can definitely understand not wanting to dive into that codebase).
Having the source of truth for stats be a unified log + replay database is very straightforward (or at least, I will refactor my design a little bit to abstract out log listing/reading behind a Less clear is whether we'd want to use the same database for processed logs and analysis, but definitely shouldn't be an issue deduping storage such that we no longer store the raw JSONlogs in text files and in the replay database. As for switching the source database to a 'compressed write optimized database (that also has support for search?)', I'll punt on that discussion for now as well. Anyway, once my stats processing is done we can decide whether we should continue writing flat files (in which case, ill add compression on write per this bug), or whether we want to just be writing directly to the replay database (in which case i'll write an |
If we hook up to a replay database, side servers will have problems logging. So the logging code would have to support both. |
@Zarel, I know you're concerned both about disk space and I/O bottlenecks. So why not gzip (or bzip or xzip) each log before writing? It's not going to give you the best compression (compared to batch-compressing a bunch of logs at once), but I just tested gzipping a random log, and it reduced the file size from 5211B to 1649B (so 3x?)
The text was updated successfully, but these errors were encountered: