Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg_repack breaks logical decoding #135

Open
jfinzel opened this issue Jun 21, 2017 · 11 comments
Open

pg_repack breaks logical decoding #135

jfinzel opened this issue Jun 21, 2017 · 11 comments

Comments

@jfinzel
Copy link

jfinzel commented Jun 21, 2017

We are running pglogical replication from a 9.4 to 9.6 cluster, and we had an issue last couple days running pg_repack on several provider tables that are in replication. We have done pg_repack before in this way but not had an issue, however this time some larger tables were involved.

I posted this issue on pglogical's git repo here: 2ndQuadrant/pglogical#113

According to PJMODOS:

the way pg_repack does table rewrite is almost guaranteed to break logical decoding unless there is zero unconsumed changes for a given table as it does not build the necessary mappings info for logical decoding that standard heap rewrite in postgres does.

Thoughts?

@dvarrazzo
Copy link
Member

@MasahikoSawada I think this problem should be addressed in next release. I had too a logical replication breakage using pg_repack and we had to drop replications and resync tables sort-of-manually.

I can try to take a look into it, at least to try to reproduce. Help from @jfinzel or other people would be welcome.

@MasahikoSawada
Copy link
Contributor

I think this problem should be addressed in next release.
I agreed.

@jfinzel
Copy link
Author

jfinzel commented Sep 25, 2017

@dvarrazzo what precise issue did you have? I have actually used pg_repack successfully on lots of tables in replication with pglogical. But I found that the issue was specifically when the walsender hits a timeout and I get walreceiver terminating due to replication timeout.

I am wondering if I removed the wal sender/receiver timeout for our replication role, I would never have the issue again. I can tell you for a fact that pg_repack seems to usually work with logical decoding in our experience. So I don't understand what the precise problem really is, but please let me know if this explanation sheds any light on the issue.

@dvarrazzo
Copy link
Member

I had two logical replication slot streaming, and they both stopped roughly at the time someone ran pg_repack on one of the replicated tables. The slots just stopped sending data, pg_logical_slot_peek_changes() on them didn't terminate; newly created replication slots worked ok so I had to drop and recreate the previous ones and recover manually the missing data.

Unfortunately I didn't check the server log details. My client was just hung as the peek_changes function (maybe that caused a timeout for you?): if we manage to reproduce the issue maybe we can take a look there.

The sending server is a postgres 9.4.14; the receiving is a 9.6 but it shouldn't matter.

@jfinzel
Copy link
Author

jfinzel commented Sep 25, 2017

Interesting. Again, I have repacked busy tables that are in logical replication without issue. I can't tell you why that normally works. I only had breakage when repacking a table that had several 100 millions rows, and after 3+ hours, I got the walreceiver terminating due to replication timeout in the server logs, which was the precise time the logical decoding broke.

@bryanconlon
Copy link

@dvarrazzo Have you had time to address the issue with pg_repack breaking logical replication?

@dvarrazzo
Copy link
Member

@bryanconlon not much, no

@bryanconlon
Copy link

I mostly use pg_repack to rebuild indexes. Is this issue related to repacking tables only or does it also affect index rebuilds?

@jfinzel
Copy link
Author

jfinzel commented Oct 7, 2019

@dvarrazzo @bryanconlon see my new comments on 2ndQuadrant/pglogical#113 - I'm curious your opinion about this.

@dvarrazzo
Copy link
Member

My opinion is now about 2 years old: I had then cases of replication interrupting, and the only way to restore it, was to drop and recreate the replication slot. The project using replication hasn't gone further (part because of this problems, but also part because of different business requirements and constraints).

That might have been related to postgres bugs that have been solved since (might have been a nowadays-old 9.4 thing?). I don't have a more recent set of experiences to share, and I'm happy to not be too superstitious about it. However the function pg_repack use to swap relfilenode has not had all the updates that the current postgres server support and I'm not comfortable at the moment that all the necessary bookkeeping is done (see #152).

@jfinzel
Copy link
Author

jfinzel commented Oct 7, 2019

Thanks for the reply. Do you have any idea if your issues you had before were correlated with a single high WAL-generating transaction?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants