pg_repack breaks logical decoding #135

jfinzel · 2017-06-21T10:21:02Z

We are running pglogical replication from a 9.4 to 9.6 cluster, and we had an issue last couple days running pg_repack on several provider tables that are in replication. We have done pg_repack before in this way but not had an issue, however this time some larger tables were involved.

I posted this issue on pglogical's git repo here: 2ndQuadrant/pglogical#113

According to PJMODOS:

the way pg_repack does table rewrite is almost guaranteed to break logical decoding unless there is zero unconsumed changes for a given table as it does not build the necessary mappings info for logical decoding that standard heap rewrite in postgres does.

Thoughts?

dvarrazzo · 2017-09-25T10:13:19Z

@MasahikoSawada I think this problem should be addressed in next release. I had too a logical replication breakage using pg_repack and we had to drop replications and resync tables sort-of-manually.

I can try to take a look into it, at least to try to reproduce. Help from @jfinzel or other people would be welcome.

MasahikoSawada · 2017-09-25T11:51:23Z

I think this problem should be addressed in next release.
I agreed.

jfinzel · 2017-09-25T13:25:49Z

@dvarrazzo what precise issue did you have? I have actually used pg_repack successfully on lots of tables in replication with pglogical. But I found that the issue was specifically when the walsender hits a timeout and I get walreceiver terminating due to replication timeout.

I am wondering if I removed the wal sender/receiver timeout for our replication role, I would never have the issue again. I can tell you for a fact that pg_repack seems to usually work with logical decoding in our experience. So I don't understand what the precise problem really is, but please let me know if this explanation sheds any light on the issue.

dvarrazzo · 2017-09-25T14:09:13Z

I had two logical replication slot streaming, and they both stopped roughly at the time someone ran pg_repack on one of the replicated tables. The slots just stopped sending data, pg_logical_slot_peek_changes() on them didn't terminate; newly created replication slots worked ok so I had to drop and recreate the previous ones and recover manually the missing data.

Unfortunately I didn't check the server log details. My client was just hung as the peek_changes function (maybe that caused a timeout for you?): if we manage to reproduce the issue maybe we can take a look there.

The sending server is a postgres 9.4.14; the receiving is a 9.6 but it shouldn't matter.

jfinzel · 2017-09-25T21:30:33Z

Interesting. Again, I have repacked busy tables that are in logical replication without issue. I can't tell you why that normally works. I only had breakage when repacking a table that had several 100 millions rows, and after 3+ hours, I got the walreceiver terminating due to replication timeout in the server logs, which was the precise time the logical decoding broke.

bryanconlon · 2018-03-05T22:41:57Z

@dvarrazzo Have you had time to address the issue with pg_repack breaking logical replication?

dvarrazzo · 2018-03-07T10:52:01Z

@bryanconlon not much, no

bryanconlon · 2018-06-27T19:51:31Z

I mostly use pg_repack to rebuild indexes. Is this issue related to repacking tables only or does it also affect index rebuilds?

jfinzel · 2019-10-07T15:55:58Z

@dvarrazzo @bryanconlon see my new comments on 2ndQuadrant/pglogical#113 - I'm curious your opinion about this.

dvarrazzo · 2019-10-07T16:20:41Z

My opinion is now about 2 years old: I had then cases of replication interrupting, and the only way to restore it, was to drop and recreate the replication slot. The project using replication hasn't gone further (part because of this problems, but also part because of different business requirements and constraints).

That might have been related to postgres bugs that have been solved since (might have been a nowadays-old 9.4 thing?). I don't have a more recent set of experiences to share, and I'm happy to not be too superstitious about it. However the function pg_repack use to swap relfilenode has not had all the updates that the current postgres server support and I'm not comfortable at the moment that all the necessary bookkeeping is done (see #152).

jfinzel · 2019-10-07T18:32:10Z

Thanks for the reply. Do you have any idea if your issues you had before were correlated with a single high WAL-generating transaction?

dvarrazzo mentioned this issue Oct 7, 2017

Doubts about repacked tables relfrozenxid #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg_repack breaks logical decoding #135

pg_repack breaks logical decoding #135

jfinzel commented Jun 21, 2017

dvarrazzo commented Sep 25, 2017

MasahikoSawada commented Sep 25, 2017

jfinzel commented Sep 25, 2017

dvarrazzo commented Sep 25, 2017

jfinzel commented Sep 25, 2017

bryanconlon commented Mar 5, 2018

dvarrazzo commented Mar 7, 2018

bryanconlon commented Jun 27, 2018

jfinzel commented Oct 7, 2019 •

edited

Loading

dvarrazzo commented Oct 7, 2019

jfinzel commented Oct 7, 2019

pg_repack breaks logical decoding #135

pg_repack breaks logical decoding #135

Comments

jfinzel commented Jun 21, 2017

dvarrazzo commented Sep 25, 2017

MasahikoSawada commented Sep 25, 2017

jfinzel commented Sep 25, 2017

dvarrazzo commented Sep 25, 2017

jfinzel commented Sep 25, 2017

bryanconlon commented Mar 5, 2018

dvarrazzo commented Mar 7, 2018

bryanconlon commented Jun 27, 2018

jfinzel commented Oct 7, 2019 • edited Loading

dvarrazzo commented Oct 7, 2019

jfinzel commented Oct 7, 2019

jfinzel commented Oct 7, 2019 •

edited

Loading