Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating Modlog to Postgres #6469

Open
thejetou opened this issue Mar 18, 2020 · 26 comments
Open

Migrating Modlog to Postgres #6469

thejetou opened this issue Mar 18, 2020 · 26 comments
Assignees

Comments

@thejetou
Copy link
Contributor

thejetou commented Mar 18, 2020

Context

The current Modlog setup of using text files and ripgrep to meet the search criteria is sluggish and has been the result of numerous complaints.

Solution

Databases! It has shown to work in PM logs, from taking ~5 minutes to being instantaneous. It is obviously not a 'silver bullet' though, like everything it requires optimizations and monitoring to ensure it's meeting requirements.

Why Postgres?

While SQLite is easier to set up (it's just a file), I think Postgres is a better choice for a couple reasons

  • The server will use the same database engine as the client (see: Postgres pokemon-showdown-client#1423)
  • SQLite is meant to be a 'lightweight' database, intended for use in embedded devices (or mobile phones), so scaling Postgres higher should in theory be easier.

Proposed Schema

CREATE TABLE modlog (
    -- UNIX timestamp
    -- Integer technically works here, but would break on 2038
    timestamp bigint NOT NULL
    room_id TEXT NOT NULL
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL
    user_id TEXT
    autoconfirmed TEXT
    alts TEXT[],
    ip TEXT,
    logged_by TEXT,
    note TEXT
);

-- We use `WHERE roomid = $1` _every_ single time, so an index seems worth it
CREATE INDEX ml_index ON modlog(room_id)

Proposed Query

SELECT *
FROM modlog
WHERE
    room_id = $1
    AND
    concat_ws(" ", time_stamp, action, user_id, autoconfirmed, array_to_string(autoconfirmed, " "), alts, ip, logged_by, note) LIKE $2
ORDER BY timestamp DESC
LIMIT $3

Of course, concatenation could in theory be expensive and thus make this query slow, but I don't think making premature optimization is worth it. If it does turn out to be too slow, we could rewrite the modlog command such that it takes a filter for each query and we don't have to concatenate all of them.

Conversion

Create a tools/database-converters where we support converting modlog from the old format (with the script @sparkychildcharlie wrote) to the new format, converting the new one to Postgres, and vice versa only for the latter.


\cc @Zarel @scheibo

@thejetou thejetou changed the title Modlog Postgres Migration Migrating Modlog to Postgres Mar 18, 2020
@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

@scheibo So I finally had time to work on database migrations, and talked them over with JetOU. JetOU thought that migrating to SQLite was unnecessary and he wanted to migrate to Postgres instead; I said I was tentatively okay with it, but it had to be done on Modlog and not Punishments because I didn't want Punishments on Postgres.

@scheibo
Copy link
Contributor

scheibo commented Mar 18, 2020

SQLite vs. Postrgres

I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).

What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...

Proposed Query

Yeah... this effectively is getting us pretty much zero value over flat files and would in fact likely be slower. If you actually want to improve modlog performance you want to add a better index than just room id and actually use it, though this is going to probably involve changing how the modlog search command is structured

I didn't want Punishments on Postgres.

Why?

@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

I have nothing against going right to Postgres, but SQLite can easily handle modlogs 'scale'. PM logs are considerably larger and are plenty fast on SQLite (though theyre not doing stupid table scans lmao).

I told him that, but he said it was just nicer to have all the login and sim stuff in a centralized database, which I guess I don't disagree with. Would be nice to have modlog in userpages.

Proposed Query

I didn't see that. Yeah, that query isn't going to be fast.

Maybe start off with some username queries?

Why [not Punishments on Postgres]?

I preferred a migration that would make more of a difference. I think Punishments on Postgres might actually make it slower (because it's going over the network) rather than being local.

@monsanto
Copy link
Member

monsanto commented Mar 18, 2020

My intuition is that SQLite will be faster than Postgres for a database of this size. The reason is that a Postgres query always involves inter-process communication (incl. serialization overheads), whereas SQLite operates in-process.

Other advantages of SQLite over Postgres include:

  • better cross-platform story
  • easier to write tests for (although it's also easy enough to write tests for Postgres using testcontainers)
  • easier to administer, no separate Postgres service

Some of this would be mitigated by having pluggable persistence but then you run into the issue of potentially running different code in testing than in production.

As for the schema/query

  • Use an appropriate date/time type instead of bigint
  • You need a multi-column index on room_id + timestamp, since you are sorting on timestamp
  • I agree with @scheibo, your proposed query will probably be slower than a flat file for IPC overhead alone.

@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

Note that the Postgres query would be over-network (it would be hosted on a different server). It's more than just IPC overhead.

@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

Also note that jetou said he didn't care about "better cross-platform story" because he planned to continue maintaining flat files for side servers.

@thejetou
Copy link
Contributor Author

thejetou commented Mar 18, 2020

What happened to making the persistence pluggable? I'm still fairly confident theres a level of abstraction where Postgres vs. Sqlite vs. Files is just configurable...

Yes, I do plan to make it configurable, but you still need to write code to support the option.

Also, RE: SQLite vs Postgres

The main reason I want Postgres, is as Zarel said, to share the same database with the client. Though over network doesn't sound too nice, so I might have to concede..


OK, schema and query, take two!

Schema:

CREATE TABLE modlog (
    -- UNIX timestamp
    timestamp timestampz NOT NULL,
    room_id TEXT NOT NULL,
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL,
    user_id TEXT,
    autoconfirmed TEXT,
    alts TEXT[],
    ip TEXT,
   -- Naming might be a bit poor here
    logged_by TEXT,
    note TEXT
);

CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));

Query:

SELECT *
FROM modlog
WHERE
    room_id = $1 AND $2 = ANY(alts)
    AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4

NOTE: This will make the modlog command's syntax be: /modlog [roomid], user=[user], ip=[ip], note=[note], instead of a full text search of the entire log.

@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

Do Postgres/SQLite support fulltext search? If so, maybe we could have a text column containing all the parameters, and then fulltext search that?

@Zarel
Copy link
Member

Zarel commented Mar 18, 2020

The type integer is the common choice, as it offers the best balance between range, storage size, and performance. The smallint type is generally only used if disk space is at a premium. The bigint type is designed to be used when the range of the integer type is insufficient.

Why is Postgres still recommending 32-bit ints in literal 2020?

@thejetou
Copy link
Contributor Author

Do Postgres/SQLite support fulltext search?

It does, I'm using it for note.

If so, maybe we could have a text column containing all the parameters, and then fulltext search that?

Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for note, but I suppose that could work.

@Zarel
Copy link
Member

Zarel commented Mar 19, 2020

Like, keep all the columns, but have one for the full log for searching? I'm currently using full text search only for note, but I suppose that could work.

Yeah, that's what I was thinking, but I'm not a database person. I'd want @scheibo or @monsanto to weigh in on whether or not that's a good idea.

@thejetou
Copy link
Contributor Author

Bump.

@AnnikaCodes
Copy link
Collaborator

I'm going to work on this. I agree with chaos that we should use SQLite rather than Postgres for this; less overhead is good here. @thejetou, I know you worked on this. Did you have some WIP design that you would be okay with sharing, besides the schema below?

Schema:

CREATE TABLE modlog (
    -- UNIX timestamp
    timestamp timestampz NOT NULL,
    room_id TEXT NOT NULL,
    -- ROOMBAN, NOTE etc.
    action TEXT NOT NULL,
    user_id TEXT,
    autoconfirmed TEXT,
    alts TEXT[],
    ip TEXT,
   -- Naming might be a bit poor here
    logged_by TEXT,
    note TEXT
);

CREATE INDEX ml_index_1 ON modlog(room_id, timestamp);
-- If we're searching by user, we're going to look up all of these cols
CREATE INDEX ml_index_2 ON modlog(user_id, alts, autoconfirmed);
-- Optimizing the full text search
CREATE INDEX ml_index_3 USING GIN ON modlog(tsvector(note));

Query:

SELECT *
FROM modlog
WHERE
    room_id = $1 AND $2 = ANY(alts)
    AND to_tsvector(note) @@ to_tsquery($3);
ORDER BY timestamp DESC
LIMIT $4

@AnnikaCodes
Copy link
Collaborator

AnnikaCodes commented Aug 17, 2020

tbh the schema looks ok to me - AFAIK tsvector is Postgres only, so we'd have to use something else for searching notes. @monsanto, your thoughts? Would a WHERE ... LIKE query be too slow here?

There's also the question of whether the user=, ip=, note=, action= etc syntax would be too confusing to staff (I think it's fine, although /modlog (text) with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.

Also, I know JetOU wanted to continue maintaining flat files, which is okay I guess, but I feel like sqlite isn't that much more work for side servers (since, unlike Postgres, it doesn't require maintaining a separate service).

@monsanto
Copy link
Member

@AnnikaCodes I don't think this PR reflects the current state of jetou's work. From our discussions:

  • We agreed no flat files, no point in maintaining two code paths
  • I suggested using LIKE for substring search or streaming the rows to JS and doing post-processing.
  • better-sqlite3 instead of node-sqlite3 which is used for the PM log
  • His last schema had indexes
CREATE INDEX ml_index_1 ON modlog(timestamp, room_id);
CREATE INDEX ml_index_2 ON modlog(action);
CREATE INDEX ml_index_3 ON modlog(action_taker);
CREATE INDEX ml_index_4 ON modlog(user_id, autoconfirmed);
CREATE INDEX ml_index_5 ON modlog(ip);
  • His last query was
    -- /modlog [roomid], action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
    -- Alts and notes will be filtered via streaming to JavaScript
    SELECT *
    FROM modlog
    WHERE
        (room_id = $1 OR roomid = 'global-' || $1) AND
        action = $2 AND
        action_taker = $3 AND
        (user_id = $4 OR autoconfirmed = $4) AND
        ip = $5
    ORDER BY timestamp DESC

    -- /modlog global, action=[action], actiontaker=[userid], user=[userid], ip=[ip], note=[note]
    SELECT *
    FROM modlog
    WHERE
        (roomid = 'global-%' OR roomid = 'global') AND
        action = $2 AND
        action_taker = $3 AND
        (user_id = $4 OR autoconfirmed = $4) AND
        ip = $5
    ORDER BY timestamp DESC
  • I didn't necessarily give the go-ahead to these, there was a lot of discussion about indexing and query planning which I'm not sure will carry over to your work. You may want to look it over and see what you agree/disagree with before giving me a new proposal.
  • I also gave him some modlogs that I can give you too.

There's also the question of whether the user=, ip=, note=, action= etc syntax would be too confusing to staff (I think it's fine, although /modlog (text) with no specific parameter could perhaps use some intelligence to guess what field is being searched (e.g. queries matching the IP regex are probably IPs, queries in ALL CAPS are probably actions, queries <19 characters might be usernames, and so on), which would also make the transition smoother for staff.

I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.

@AnnikaCodes
Copy link
Collaborator

CREATE INDEX ml_index_3 ON modlog(action_taker);

What is action_taker? Is that the same as logged_by?

(room_id = $1 OR roomid = 'global-' || $1)

I was thinking of a separate boolean (or rather tinyint since SQLite doesn't support real booleans) to represent whether a modlog entry should be in the global modlog (since something can be in a room modlog and the global modlog).

The alts field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts to alt1,alt2,alt3 and then do WHERE alts LIKE '%' || $1 || '%'.

Other than that jetou's last setup looks good.

I don't use modlog very much so you may want to see what the rest of the staff regularly searches for. This will help design the indexes too.

Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?

@monsanto
Copy link
Member

monsanto commented Aug 18, 2020

What is action_taker? Is that the same as logged_by?

I believe so

The alts field is also something to be considered - a user can have multiple alts marked in the modlog when they're punished and SQLite as far as I know doesn't support list datatypes, so we'd have to do something like set alts to alt1,alt2,alt3 and then do WHERE alts LIKE '%' || $1 || '%'.

I'd add a modlog_id field to modlog, and have another table alts that maps modlog_id -> alt

Sure, is this an appropriate thing to post in the staff forum on Smogon or should I do it more informally?

Whatever you would feel is most helpful to you

@thejetou
Copy link
Contributor Author

thejetou commented Aug 18, 2020

@AnnikaCodes - https://github.com/TheJetOU/Pokemon-Showdown/tree/modlog-sqlite

IDK how well up to date this with PS code since I wrote it four months ago.

What is action_taker? Is that the same as logged_by?

Yes.

Thanks for carrying my will forward and good luck! :).

@AnnikaCodes
Copy link
Collaborator

AnnikaCodes commented Aug 18, 2020

Thanks! That looks like a good start. Like I said, I'd prefer to stop supporting FS modlogs because of the extra maintenance burden. (Also, I'd rather make modlog writes take an object with optional properties to avoid the modlog(undefined, undefined, message) thing.)

@monsanto - you meant modlog_id be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid)
Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated), but if you think it's the best I can work on it. (We'd have to either LEFT JOIN and then combine duplicate records or run a separate query to get alts.)

@monsanto
Copy link
Member

you meant modlog_id be a primary key type thing, incrementing on each entry, correct? (i.e. not a userid)

Yes

Also it seems a bit overkill to me to have an entire separate table for alts (which would make searching more complicated)

This is just how you store arrays in SQL.

We'd have to either LEFT JOIN and then combine duplicate records or run a separate query to get alts.

Use an EXISTS subquery to test for an alt.

@AnnikaCodes
Copy link
Collaborator

AnnikaCodes commented Aug 20, 2020

So far this is what I have for a query (alternately with (roomid = 'global' OR roomid LIKE 'global-%') for searching global modlogs, just like jetou):

SELECT *, (SELECT group_concat(userid, ',') FROM alts WHERE entryid = modlog.entryid) as alts
FROM modlog
WHERE
    roomid IN ($1, $2) AND
    action = $3 AND
    (
        userid REGEXP $4 OR
        autoconfirmed_userid REGEXP $4 OR
        EXISTS(SELECT * FROM alts WHERE entryid = modlog.entryid AND userid REGEXP $4)
    ) AND
    ip REGEXP $5 AND
    action_taker REGEXP $6 AND
    note REGEXP $7
ORDER BY timestamp DESC
LIMIT $8

And for a schema:

CREATE TABLE modlog (
    entryid INTEGER NOT NULL PRIMARY KEY,
    timestamp INTEGER NOT NULL,
    roomid TEXT NOT NULL,
    action TEXT NOT NULL,
    visual_roomid TEXT,
    action_taker TEXT,
    userid TEXT,
    autoconfirmed_userid TEXT,
    ip TEXT,
    note TEXT
);
CREATE TABLE alts (
    entryid INTEGER NOT NULL,
    userid TEXT NOT NULL
);
CREATE INDEX ml_index_1 ON modlog(timestamp);
CREATE INDEX ml_index_2 ON modlog(roomid, timestamp);
CREATE INDEX ml_index_3 ON modlog(action, timestamp);
CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp);
CREATE INDEX ml_index_5 ON modlog(userid, timestamp);
CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp);
CREATE INDEX ml_index_7 ON modlog(ip, timestamp);

@AnnikaCodes
Copy link
Collaborator

AnnikaCodes commented Aug 21, 2020

Hm actually, I think it might be better to sort by entryid because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp.

@monsanto
Copy link
Member

Hm actually, I think it might be better to sort by entryid because of SQLite's ROWID and the corresponding performance increase. That'd also eliminate the need for all the indices on timestamp.

I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.

@AnnikaCodes
Copy link
Collaborator

I wouldn't try to microoptimize the schema. There isn't a guarantee that a larger entryid implies a later timestamp. Even if you restricted your SQLite usage enough to preserve this invariant you are ruling out restoring old modlog data after deployment.

Good point.. Does the rest of the schema look OK?

@monsanto
Copy link
Member

REGEXP

The REGEXP operator is a special syntax for the regexp() user function. No regexp() user function is defined by default and so use of the REGEXP operator will normally result in an error message. If an application-defined SQL function named "regexp" is added at run-time, then the "X REGEXP Y" operator will be implemented as a call to "regexp(Y,X)".

I'd just explicitly use the function you want to use instead of relying on this sugar.

CREATE INDEX ml_index_4 ON modlog(action_taker, timestamp);
CREATE INDEX ml_index_5 ON modlog(userid, timestamp);
CREATE INDEX ml_index_6 ON modlog(autoconfirmed_userid, timestamp);
CREATE INDEX ml_index_7 ON modlog(ip, timestamp);

These will never be used unless you special case exact matches when generating SQL; the query-planner won't be able to look inside your user-defined JS function.

visual_roomid TEXT,

What is this?

entryid INTEGER NOT NULL PRIMARY KEY,

I generally prefer to call these like <table>_id so you can use USING in joins without worrying about column ambiguity. If you're going to name it something generic i'd just call it id.

roomid IN ($1, $2) AND
action = $3 AND

If you intend to have both of these use an index you need a multicolumn index like (roomid, action, timestamp). I don't know how worthwhile this is. I also don't know how worthwhile it is to have an (action, timestamp) index. You should think carefully about which parts of the query are expensive and focus on indexing that. If you want a flexible search interface you won't be able to index everything.

@AnnikaCodes
Copy link
Collaborator

What is this?

visual_roomid is used for cases where the ID to search for is different from the name to be displayed. Currently, the only use case is room tournaments, where the roomid will be, for example, battle-gen8randombattle-1 but the visual_roomid will be battle-gen8randombattle-1 tournament: lobby.

I don't think actions are very common search terms. Probably (roomid, userid, timestamp) and (roomid, ip, timestamp) -- those seem like the most common search combos.

The other suggestions seem fine, I'll implement them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants