Skip to content

Proposal for a new profile to repair files using 3 files. #977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SkyKwas opened this issue Apr 8, 2025 · 4 comments
Open

Proposal for a new profile to repair files using 3 files. #977

SkyKwas opened this issue Apr 8, 2025 · 4 comments

Comments

@SkyKwas
Copy link

SkyKwas commented Apr 8, 2025

Proposal

I have an idea of how to handle auto-repair on read and other features that use Btrfs's error detection of files.

As the time of writing this, Btrfs supports the profiles DUP, RAID1-like, and (experimentally) RAID5/6. These profiles can be used on the data, metadata, and system block groups. From what I understand, these all function similarly where:

  • Data is stored with 2 copies.
  • During a read operation, if one file fails the checksum, the other can be used if it passes the checksum.
  • During a read operation, if both files fail the checksum, the file is unrecoverable.

The problem is that if both files are corrupt (even if they are corrupt on different blocks), there is no way to recover the file. I would like to propose a new profile that uses 3 files to provide error detection AND error correction.

Execution

The execution is similar to the DUP profile, but with extra steps:

  • Data is stored with 3 copies.
  • During a read operation, if one file fails the checksum, any of the others can be used if they pass the checksum.
  • During a read operation, if all 3 files fail the checksum, Btrfs attempts to make a new file to pass the checksum using the 3 corrupted files.

If all 3 files are corrupt, Btrfs attempts to make a new file by using the majority rule on a block-by-block basis -- Btrfs compares the first block of each file and uses the one that appears at least two times for the new file. Btrfs then compares the second block of each file and so on until all of the blocks are compared. In the end, Btrfs makes a new file that then gets checksummed just like the original 3 files; if this new file passes the checksum, it is used and the original 3 files are replaced with this new one. If it fails, then it's unrecoverable.

If all three blocks are the same, use that block for the new file.
If two blocks are the same, but one is different, use a block from the matching two for the new file.
If all three blocks are different, Btrfs can stop here because the file is deemed unrecoverable.

Pros

  • Superior data corruption resistance. Unlike the DUP profile, this is not just adding another file (which is just more corruption detection), this is adding logic to repair files. This means that all 3 files can be corrupt and it's still possible to recover the data. The only time data becomes unrecoverable is when more than one block is corrupt at the same location -- which is far less likely to happen than the file having any corruption.
  • Performance should be the same as DUP when not making a new file.

Cons

  • We can only use 1/3 of the original space.

Additional comments

I figured I'd bring this idea to the Btrfs project because I think that sacrificing 2/3 of storage for error correction is better than sacrificing 1/2 of storage for error detection. I don't see too many people using this for the data block group, but I can 100% see this be the default for the metadata and system block groups as they're already set to DUP by default for single drives.

@kdave
Copy link
Owner

kdave commented Apr 8, 2025

The triple copy block group has been asked for in the past, though not with the additional majority rule repair strategy. I am not against it in principle, the usual answer for that amount of redundancy is to use more reliable hardware from the beginning, but there are situations where this can help. The example is an SD card in something like Raspberry pi where power spikes can damage the card although it's otherwise relatively stable hw environment.

The RAID1C3 profile provides the redundancy level but requires 3 devices, so the triple copy can also work on one device, which does not need to be a single physical device but some compound device.

The majority rule stragegy can be possibly applied to the RAID1C3 and C4 profiles too, though it would be good to hide that beind some configuration option as it would be able to ignore the failing checksums.

@SkyKwas
Copy link
Author

SkyKwas commented Apr 8, 2025

The majority rule stragegy can be possibly applied to the RAID1C3 and C4 profiles too...

I didn't even think of that.

...though it would be good to hide that beind some configuration option as it would be able to ignore the failing checksums.

I would imagine performing 3 checksums (in a worst case scenario) is faster than going straight to the majority rule repair strategy (MRRS). If we went straight to the MRRS, that would be a minimum of 3 system calls along with the logic of the MRRS. ...or did I misunderstand what you said?

@Zygo
Copy link

Zygo commented Apr 16, 2025

MRR is a poor fit for production use when firmware may drop writes. If a device loses a write after acknowledging it, all copies may be wrong in the same way. Voting doesn't help if they're identically wrong.

This failure mode already affects DUP and RAID1. A third copy doesn't fix it -- it just adds I/O and wear without improving reliability. MRR assumes independent failures, but silent write drops are systemic.

That said, MRR could be useful in a supervised recovery tool (e.g., btrfs restore). If a user can manually verify correctness -- say, for partially corrupted volumes -- it might help salvage data. But it must be opt-in, with clear warnings about its limits.

@Zygo
Copy link

Zygo commented Apr 16, 2025

I can 100% see this be the default for the metadata and system block groups as they're already set to DUP by default for single drives.

I see the opposite. Metadata is already well protected against dropped writes and corruption—so well, in fact, that applying MRR here would be more likely to introduce new failure modes than prevent them. It would be a disaster for the same reason running btrfs check on a device-level-corrupted filesystem is a disaster: both risk accepting demonstrably bad metadata in cases where it should be rejected—or reconstructed by inference (btrfs check) from uncorrupted structures.

Data, by contrast, is more exposed. There are valid cases for both MRR and bypassing csum verification:

  • After O_DIRECT is used incorrectly, resulting in out-of-sync csums and data due to post-flush page modification. In recovery scenarios, bypassing the csum to salvage usable data can be reasonable. MRR could help here—but only in a supervised recovery context, not normal reads.

  • When corruption consists of a few flipped bits, MRR voting at the byte level might reconstruct a block that matches the original csum, allowing confident recovery. This failure mode is common enough to consider supporting carefully.

  • When the file is tolerant of minor corruptions (e.g. plain text), a user may want to bypass csum checks entirely to manually recover usable content. MRR isn’t needed here: if any copy is intact, its csum would pass and we can easily fetch correct data, but if all copies are broken, MRR is only a choice between corrupt alternatives. A better mechanism like an ioctl, fcntl flag, or inode property to bypass csum verification or choose which mirror to read would be helpful, without any need for MRR.

  • When data has no csums at all (e.g. nodatacow), MRR could help with RAID1/RAID1C3/RAID1C4 recovery by resolving inconsistent copies. It's not ideal, but better than today's behavior of arbitrarily choosing a version. Even here, explicitly tracking which devices lose writes (e.g. with a "this drive lost writes, don't trust its copy" flag, or a tree that records lost writes at the extent or block level) is safer and more robust than majority vote.

I think that sacrificing 2/3 of storage for error correction is better than sacrificing 1/2 of storage for error detection

This may reflect confusion between datacow and nodatacow behavior.

In datacow mode, Btrfs stores checksums for file data, so error detection is possible even with a single copy. If a corruption is detected, a redundant copy (e.g. via raid1) allows correction. So raid1, raid1c3, and dup all provide error correction, not just detection. single mode offers detection only, as there’s nothing to correct against.

In nodatacow mode, there are no data checksums, so detection depends entirely on redundancy: two mismatching copies indicate an error, but give no guidance about which is correct. Three or more copies (e.g. raid1c3) make MRR-style voting possible—but even then, MRR is only meaningful when the failures are independent, which isn’t the case if a device drops writes.

So while MRR could help in some nodatacow scenarios, it’s no substitute for csums. And in datacow mode, checksums already provide stronger guarantees than majority voting can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants