-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrating Modlog to Postgres #6469
Comments
@scheibo So I finally had time to work on database migrations, and talked them over with JetOU. JetOU thought that migrating to SQLite was unnecessary and he wanted to migrate to Postgres instead; I said I was tentatively okay with it, but it had to be done on Modlog and not Punishments because I didn't want Punishments on Postgres. |
I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao). What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...
Yeah... this effectively is getting us pretty much zero value over flat files and would in fact likely be slower. If you actually want to improve modlog performance you want to add a better index than just room id and actually use it, though this is going to probably involve changing how the modlog search command is structured
Why? |
I told him that, but he said it was just nicer to have all the login and sim stuff in a centralized database, which I guess I don't disagree with. Would be nice to have modlog in userpages.
I didn't see that. Yeah, that query isn't going to be fast. Maybe start off with some username queries?
I preferred a migration that would make more of a difference. I think Punishments on Postgres might actually make it slower (because it's going over the network) rather than being local. |
My intuition is that SQLite will be faster than Postgres for a database of this size. The reason is that a Postgres query always involves inter-process communication (incl. serialization overheads), whereas SQLite operates in-process. Other advantages of SQLite over Postgres include:
Some of this would be mitigated by having pluggable persistence but then you run into the issue of potentially running different code in testing than in production. As for the schema/query
|
Note that the Postgres query would be over-network (it would be hosted on a different server). It's more than just IPC overhead. |
Also note that jetou said he didn't care about "better cross-platform story" because he planned to continue maintaining flat files for side servers. |
Yes, I do plan to make it configurable, but you still need to write code to support the option. Also, RE: SQLite vs Postgres The main reason I want Postgres, is as Zarel said, to share the same database with the client. Though over network doesn't sound too nice, so I might have to concede.. OK, schema and query, take two! Schema: CREATE TABLE modlog (
-- UNIX timestamp
timestamp timestampz NOT NULL,
room_id TEXT NOT NULL,
-- ROOMBAN, NOTE etc.
action TEXT NOT NULL,
user_id TEXT,
autoconfirmed TEXT,
alts TEXT[],
ip TEXT,
-- Naming might be a bit poor here
logged_by TEXT,
note TEXT
);
CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note)); Query: SELECT *
FROM modlog
WHERE
room_id = $1 AND $2 = ANY(alts)
AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4 NOTE: This will make the modlog command's syntax be: |
Do Postgres/SQLite support fulltext search? If so, maybe we could have a text column containing all the parameters, and then fulltext search that? |
Why is Postgres still recommending 32-bit ints in literal 2020? |
It does, I'm using it for
Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for |
Yeah, that's what I was thinking, but I'm not a database person. I'd want @scheibo or @monsanto to weigh in on whether or not that's a good idea. |
Bump. |
I'm going to work on this. I agree with chaos that we should use SQLite rather than Postgres for this; less overhead is good here. @thejetou, I know you worked on this. Did you have some WIP design that you would be okay with sharing, besides the schema below?
|
tbh the schema looks ok to me - AFAIK There's also the question of whether the Also, I know JetOU wanted to continue maintaining flat files, which is okay I guess, but I feel like sqlite isn't that much more work for side servers (since, unlike Postgres, it doesn't require maintaining a separate service). |
@AnnikaCodes I don't think this PR reflects the current state of jetou's work. From our discussions:
I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too. |
What is
I was thinking of a separate boolean (or rather tinyint since SQLite doesn't support real booleans) to represent whether a modlog entry should be in the global modlog (since something can be in a room modlog and the global modlog). The Other than that jetou's last setup looks good.
Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally? |
I believe so
I'd add a
Whatever you would feel is most helpful to you |
@AnnikaCodes - https://github.com/TheJetOU/Pokemon-Showdown/tree/modlog-sqlite IDK how well up to date this with PS code since I wrote it four months ago.
Yes. Thanks for carrying my will forward and good luck! :). |
Thanks! That looks like a good start. Like I said, I'd prefer to stop supporting FS modlogs because of the extra maintenance burden. (Also, I'd rather make modlog writes take an object with optional properties to avoid the @monsanto - you meant |
Yes
This is just how you store arrays in SQL.
Use an |
So far this is what I have for a query (alternately with SELECT *, (SELECT group_concat(userid, ',') FROM alts WHERE entryid = modlog.entryid) as alts
FROM modlog
WHERE
roomid IN ($1, $2) AND
action = $3 AND
(
userid REGEXP $4 OR
autoconfirmed_userid REGEXP $4 OR
EXISTS(SELECT * FROM alts WHERE entryid = modlog.entryid AND userid REGEXP $4)
) AND
ip REGEXP $5 AND
action_taker REGEXP $6 AND
note REGEXP $7
ORDER BY timestamp DESC
LIMIT $8 And for a schema: CREATE TABLE modlog (
entryid INTEGER NOT NULL PRIMARY KEY,
timestamp INTEGER NOT NULL,
roomid TEXT NOT NULL,
action TEXT NOT NULL,
visual_roomid TEXT,
action_taker TEXT,
userid TEXT,
autoconfirmed_userid TEXT,
ip TEXT,
note TEXT
);
CREATE TABLE alts (
entryid INTEGER NOT NULL,
userid TEXT NOT NULL
);
CREATE INDEX ml_index_1 ON modlog(timestamp);
CREATE INDEX ml_index_2 ON modlog(roomid, timestamp);
CREATE INDEX ml_index_3 ON modlog(action, timestamp);
CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp);
CREATE INDEX ml_index_5 ON modlog(userid, timestamp);
CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp);
CREATE INDEX ml_index_7 ON modlog(ip, timestamp); |
Hm actually, I think it might be better to sort by |
I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment. |
Good point.. Does the rest of the schema look OK? |
I'd just explicitly use the function you want to use instead of relying on this sugar.
These will never be used unless you special case exact matches when generating SQL; the query-planner won't be able to look inside your user-defined JS function.
What is this?
I generally prefer to call these like
If you intend to have both of these use an index you need a multicolumn index like |
I don't think actions are very common search terms. Probably The other suggestions seem fine, I'll implement them. |
Context
The current Modlog setup of using text files and ripgrep to meet the search criteria is sluggish and has been the result of numerous complaints.
Solution
Databases! It has shown to work in PM logs, from taking ~5 minutes to being instantaneous. It is obviously not a 'silver bullet' though, like everything it requires optimizations and monitoring to ensure it's meeting requirements.
Why Postgres?
While SQLite is easier to set up (it's just a file), I think Postgres is a better choice for a couple reasons
Proposed Schema
Proposed Query
Of course, concatenation could in theory be expensive and thus make this query slow, but I don't think making premature optimization is worth it. If it does turn out to be too slow, we could rewrite the modlog command such that it takes a filter for each query and we don't have to concatenate all of them.
Conversion
Create a
tools/database-converters
where we support converting modlog from the old format (with the script @sparkychildcharlie wrote) to the new format, converting the new one to Postgres, and vice versa only for the latter.\cc @Zarel @scheibo
The text was updated successfully, but these errors were encountered: