Luke T. Shumaker » blog » btrfs-rec

Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem

I originally sent this email on 2023-07-10, but it has been caught by their bogofilter. Yes, it was plaintext. No, I didn't use GMail. Yes, I've successfully participated in vger lists in the past. Yes, I've reached out to postmaster; no, I haven't received a reply yet (as of 2023-07-14).

To: linux-btrfs@vger.kernel.org
From: Luke T. Shumaker <lukeshu@lukeshu.com>
Subject: btrfs-rec: Recover (data from) a broken btrfs filesystem
Date: Mon, 10 Jul 2023 21:23:41 -0600
Message-ID: <87jzv7uo5e.wl-lukeshu@lukeshu.com>

Inspired by a mis-typed dd command, for the last year I've been working on a tool for recovering corrupt btrfs filesystems; at first idly here and there, but more actively in the last few months. I hope to get it incorporated into btrfs-progs, though perhaps that is problematic for a few reasons I'll get to. If the code can't be incorporated into btrfs-progs, at least the ideas and algorithms should be.

https://git.lukeshu.com/btrfs-progs-ng/

Highlights:

Hopefully some folks will find it useful, or at least neat!

1. Motivation

Have you ever ended up with a corrupt btrfs filesystem (through no fault of btrfs itself, but perhaps a failing drive, or a mistaken dd invocation)? Surely losing less than 100MB of data from a drive should not render hundreds of GB of perfectly intact data unreadable! And yet, the existing tools are unable to even attempt to read that data:

$ btrfs check --repair --force dump-zero.1.img
enabling repair mode
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system

or

$ btrfs check --init-extent-tree --force dump-zero.1.img
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system

or

$ btrfs check --init-csum-tree --force dump-zero.1.img
Creating a new CRC tree
Opening filesystem to check...
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
ERROR: cannot open file system

or

$ btrfs rescue chunk-recover dump-zero.1.img
Scanning: DONE in dev0
corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0]
Couldn't read tree root
open with broken chunk error

or

$ btrfs rescue zero-log dump-zero.1.img
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
ERROR: cannot read chunk root
ERROR: could not open ctree

or

$ mkdir out
$ btrfs restore dump-zero.1.img out
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
Could not open root, trying backup super

or

$ btrfs restore --list-roots dump-zero.1.img
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
ERROR: cannot read chunk root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
Could not open root, trying backup super

or

$ btrfs-find-root dump-zero.1.img
WARNING: cannot read chunk root, continue anyway
Superblock thinks the generation is 6596071
Superblock thinks the level is 1

Well, have I got a tool for you!

(FWIW, I also tried manipulating the filesystem and patching to tools to try to get past those errors, only to get a different set of errors. Some of these patches I am separately submitting to btrfs-progs.)

2. Overview of use

There are two btrfs-rec sub-command groups: btrfs-rec inspect SUBCMD and btrfs-rec repair SUBCMD, and you can find out about various sub-commands with btrfs-rec help. These are both told about devices or images with the --pv flag.

btrfs-rec inspect SUBCMD commands open the filesystem read-only, and (generally speaking) write extracted or rebuilt information to stdout. btrfs-rec repair SUBCMD commands open the filesystem read+write, and consume information from btrfs-rec inspect SUBCMD commands to actually repair the filesystem (except I haven't actually implemented any repair commands yet... despite the lack of repair commands, I believe that btrfs-rec is already a useful because of the btrfs-rec inspect mount command to get data out of the broken filesystem). This split allows you to try things without being scared by WARNINGs about not using these tools unless you're an expert or have been told to by a developer.

In the broken dump-zero.1.img example above (which has a perfectly intact superblock, but a totally broken CHUNK_TREE), to "repair" it I'd:

  1. Start by using btrfs-rec inspect rebuild-mappings to rebuild the broken chunk/dev/blockgroup trees:

    $ btrfs-rec inspect rebuild-mappings \
        --pv=dump-zero.1.img \
        > mappings-1.json
  2. If it only mostly succeeds, but on stderr tells us about a few regions of the image that it wasn't able to figure out the chunks for. Using some human-level knowledge, you can write those yourself, inserting them into the generated mappings.json, and ask rebuild-mappings to normalize what you wrote:

    $ btrfs-rec inspect rebuild-mappings \
        --pv=dump-zero.1.img \
        --mappings=<(sed <mappings-1.json \
            -e '2a{"LAddr":5242880,"PAddr":{"Dev":1,"Addr":5242880},"Size":1},' \
            -e '2a{"LAddr":13631488,"PAddr":{"Dev":1,"Addr":13631488},"Size":1},') \
        > mappings-2.json
  3. Now that it has functioning chunk/dev/blockgroup trees, we can use btrfs-rec inspect rebuild-trees to rebuild other trees that rely on those:

    $ btrfs-rec inspect rebuild-mappings \
        --pv=dump-zero.1.img \
        --mappings=mappings-2.json \
        > trees.json
  4. Now that (hopefully) everything that was damaged has been reconstructed, we can use btrfs-rec inspect mount to mount the filesystem read-only and copy out our data:

    $ mkdir mnt
    $ sudo btrfs-rec inspect mount \
        --pv=dump-zero.1.img \
        --mappings=mappings-2.json \
        --trees=trees.json \
        ./mnt

This example is fleshed out more (and the manual edits to mappings.json explained more) in ./examples/main.sh.

3. Prior art

Comparing btrfs-rec inspect mount with the existing https://github.com/adam900710/btrfs-fuse project:

4. Internals/Design

4.1. Overview of the source tree layout

4.2. Base decisions: CLI structure, Go, JSON

I started with trying to enhance btrfs-progs, but ended up writing a wholy new program in Go, for several reasons:

It turned out that Go was perhaps not the best choice, but we'll come back to that.

I wanted to separate things into a pipeline. For instance: Instead of btrfs rescue chunk-recover trying to do everything to rebuild a broken chunk tree, I wanted to separate I/O from computation from repairs. So I have btrfs-rec inspect rebuild-mappings scan that reads all the info necessary to rebuild the chunk tree, then dump that as a 2GB glob of JSON. Then I can feed that JSON to btrfs-rec inspect rebuild-mappings process which actually rebuilds the mappings in the chunk tree, and dumps them as JSON. And then other commands can consume that mappings.json to use that instead of trying to read the chunk tree from the actual FS, so that you don't have to make potentially destructive writes to inspect an FS with a broken chunk tree, and can inspect it more forensically. Or then use btrfs-rec repair SOME_SUBCMD_I_HAVENT_WRITTEN_YET to write that chunk tree in mappings.json back to the filesystem.

(But also, the separate steps thing was useful just so I could iterate on the algorithms of rebuild-mappings process separately from having to scan the entire FS)

So, I made the decision that btrfs-rec inspect SUBCMD commands should all only open the FS read-only, and output their work to a separate file; that writing that info back to the FS should be separate in btrfs-rec repair SUBCMD.

For connecting those parts of the pipeline, I chose JSON, for a few reasons:

It turned out that JSON was perhaps not the best choice.

OK, so: Go and/or JSON maybe being mistakes:

4.3. Algorithms

There are 3 algorithms of note in btrfs-rec, that I think are worth getting into mainline btrfs-progs even if the code of btrfs-rec doesn't get in:

  1. The btrfs-rec inspect rebuild-mappings algoritithm to rebuild information from the CHUNK_TREE, DEV_TREE, and BLOCK_GROUP_TREE.

  2. The btrfs-rec --rebuild algorithm to cope with reading broken B+ trees.

  3. The btrfs-rec inspect rebuild-trees algorithm to re-attach lost branches to broken B+ trees.

4.3.1. The rebuild-mappings algorithm

(This step-zero scan is btrfs-rec inspect rebuild-mappings scan, and principally lives in ./lib/btrfsutil/scan.go and ./cmd/btrfs-rec/inspect/rebuildmappings/scan.go)

  1. Similar to btrfs rescue chunk-recover, scan each device for things that look like nodes; keep track of:
    • Checksums of every block on the device
    • Which physical addresses contain nodes that claim to be at a given logical addess.
    • Any found Chunk items, BlockGroup items, DevExtent, and CSum items. Keep track of the key for each of these, and for CSum items also track the generation.

Create a bucket of the data from Chunks, DevExtents, and BlockGroups; since these are mostly a Chunk and a DevExtent+BlockGroup store pretty much the same information; we can use one to reconstruct the other. How we "merge" these and handle conflicts is in ./lib/btrfs/btrfsvol/lvm.go:addMapping(), I don't think this part is particularly clever, but given that btrfs rescue chunk-recover crashes if it encounters two overlapping chunks, I suppose I should spell it out:

Now that we know how to "add a mapping", let's do that:

(The following main-steps are btrfs-rec inspect rebuild-mappings process, and principally live in ./cmd/btrfs-rec/inspect/rebuildmappings/process.go)

  1. Add all found Chunks.

  2. Add all found DevExtents.

  3. Add a phyical:logical mapping of length nodesize for each node that was found.

  4. Any mappings from steps 2 or 3 that are missing blockgroup flags (that is: they weren't able to be merged with a mapping from step 1), use the found BlockGroups to fill in those flags.

  5. Now we'll merge all found CSum items into a map of the sums of the logical address space. Sort all of the csum items by generation, then by address. Loop over them in that order, inserting their sums into the map. If two csum items overlap, but agree about the sums of the overlapping region, that's fine, just take their union. For overlaps that disagree, items with a newer generation kick out items with an older generation. If disagreeing items have the same generation... I don't think that can happen except by a filesystem bug (i.e. not by a failing drive or other external corruption), so I wasn't too concerned about it, so I just log an error on stderr and skip the later-processed item. See ./cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go.

    Look at regions of the logical address space that meet all the 3 criteria:

    • we have CSum items for them
    • we have a BlockGroup for them
    • we don't have a Chunk/DevExtent mapping them to the pysical address space.

    Pair those CSums up with BlockGroups, and for each BlockGroup, search the list of checksums of physical blocks to try to find a physical region that matches the logical csums (and isn't already mapped to a different logical region). I used a Knuth-Morris-Pratt search, modified to handle holes in the logical csum list as wildcards.

    Insert any found mappings into our bucket of mappings.

  6. Do the same again, but with a fuzzy search (we can re-use the csum map of the logical address space). My implementation of this is comparatively time and space intensive; I just walk over the entire unmapped physical address space, noting what % of match each BlockGroup has if placed at that location. I keep track of the best 2 matches for each BlockGroup. If the best match is better than a 50% match, and the second best is less than a 50% match, then I add the best match. In my experience, the best match is >90% (or at whatever the maximum percent is for how much of the BlockGroup has logical sums), and the second best is 0% or 1%. The point of tracking both is that if there isn't a clear-cut winner, I don't want it to commit to a potentially wrong choice.

4.3.2. The --rebuild algorithm

The --rebuild flag is implied by the --trees=trees.json flag, and triggers an algorithm that allows "safely" reading from a broken B+ tree, rather than the usual B+ tree lookup and search functions. I probably should have tried to understand the btrfs restore algorithm, maybe I reinvented the wheel...

This algorithm requires a list of all nodes on the filesystem; we find these using the same scan as above (./lib/btrfsutil/scan.go), the same procedure as btrfs rescue chunk-recover.

We walk all of those nodes, and build a reasonably lightweight in-memory graph of all nodes (./lib/btrfsutil/graph.go), tracking

4.3.2.1. rebuilt forrest behavior (looking up trees)

(see: ./lib/btrfsutil/rebuilt_forrest.go)

4.3.2.2. rebuilt individual tree behavior

(see: ./lib/btrfsutil/rebuilt_tree.go)

In order to read from a tree, we first have to build a few indexes. We store these indexes in an Adaptive Replacement Cache; they are all re-buildable based on the tree's list of roots and the above graph; if we have a bunch of trees we don't need to keep all of this in memory at once. Note that this is done 100% with the in-memory graph, we don't need to read anything from the filesystem during these procedures.

From there, it should be trivial to implement the usual B+ tree operations using those indexes; exact-lookup using the item index, and range-lookups and walks using the item index together with the error index. Efficiently searching the CSUM_TREE requires knowing item sizes, so that's why we recorded the item sizes into the graph.

4.3.3. The rebuild-trees algorithm

The btrfs inspect rebuild-trees algorithm finds nodes to attach as extra roots to trees. I think that conceptually it's the the simplest of the 3 algorithms, but turned out to be the hardest to get right. So... maybe more than the others reference the source code too (./cmd/btrfs-rec/inspect/rebuildtrees/) because I might forget some small but important detail.

The core idea here is that we're just going to walk each tree, inspecting each item in the tree, and checking for any items that are implied by other items (e.g.: a dir entry item implies the existence of inode item for the inode that it points at). If an implied item is not in the tree, but is in some other node, then we look at which potential roots we could add to the tree that would add that other node. Then, after we've processed all of the items in the filesystem, we go add those various roots to the various trees, keeping track of which items are added or updated. If any of those added/updated items have a version with a newer generation on a different node, see what roots we could add to get that newer version. Then add those roots, keeping track of items that are added/updated. Once we reach steady-state with the newest version of each item has been added, loop back and inspect all added/updated items for implied items, keeping track of roots we could add. Repeat until a steady-state is reached.

There are lots of little details in that process, some of which are for correctness, and some of which are for "it should run in hours instead of weeks."

4.3.3.1. initialization

First up, we're going to build and in-memory graph, same as above. But this time, while we're reading the nodes to do that, we're also going to watch for some specific items and record a few things about them.

(see: ./cmd/btrfs-rec/inspect/rebuildtrees/scan.go)

For each {nodeID, slotNumber} pair that matches one of these item types, we're going to record:

4.3.3.2. the main loop

(see: ./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go)

Start with that scan data (graph + info about items), and also a rebuilt forrest from the above algorithm, but with:

(The callbacks are in ./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go)

We have 5 unordered queues ("work lists"?); these are sets that when it's time to drain them we'll sort the members and process them in that order.

  1. the tree queue: a list of tree IDs that we need to crawl
  2. the retry-item queue: for each tree ID, a set of items that we should re-process if we add a root to that tree
  3. the added-item queue: a set of key/tree pairs identifying items that have been added by adding a root to a tree
  4. the settled-item-queue: a set of key/tree pairs that have have not just been added by adding a root, but we've also verified that they are the newest-generation item with that key that we could add to the tree.
  5. the augment queue: for each item that we want to add to a tree, the list of roots that we could add to get that item.

The roots all start out empty, except for the tree queue, which we seed with the ROOT_TREE, the CHUNK_TREE, and the BLOCK_GROUP_TREE (It is a "TODO" task that it should probably also be seeded with the TREE_LOG, but as I will say below in the "future work" section, I don't actually understand the TREE_LOG, so I couldn't implement it).

Now we're going to loop until the tree queue, added-item queue, settled-item queue, and augment queue are all empty (all queues except for the retry-item queue). Each loop "pass" has 3 substeps:

  1. Crawl the trees (drain the tree queue, fill the added-item queue).

  2. Either:

    1. if the added-item queue is non-empty: "settle" those items (drain the added-item queue, fill the augment queue and the settled-item queue).

    2. otherwise: process items (drain the settled-item queue, fill the augment queue and the tree queue)

  3. Apply augments (drain the augment queue and maybe the retry-item queue, fill the added-item queue).

OK, let's look at those 3 substeps in more detail:

  1. Crawl the trees; drain the tree queue, fill the added-item queue.

    We just look up the tree in the rebuilt forrest, which will (per the above --rebuild algorithm) will either fail to look up the tree, or succeed, and add to that tree the root node from the superblock/root-item. Because we set an item-added callback, when adding that root it will loop over the nodes added by that root, and call our callback for each item in one of the added nodes. Our callback inserts each item into the added-item queue. The forrest also calls our root-added callback, but because of the way this algorithm works, that turns out to be a no-op at this step.

    I mentioned that we added callbacks to intercept the forrest's looking up of root items and resolving UUIDs; we override the forrest's "lookup root item" routine and "resolve UUID" routine to instead of doing normal lookups on the ROOT_TREE and UUID_TREE, use the above WantXXX routines that we'll define below in the "graph callbacks" section.

    It shouldn't matter what order this queue is processed in, but I sort tree IDs numerically.

    The crawling is fairly fast because it's just in-memory, the only accesses to disk are looking up root items and resolving UUIDs.

  2. Either:

    1. Settle items from the added-item queue to the settled-item queue (and fill the augment queue).

      For each item in the queue, we look in the tree's item index to get the {node, slot} pair for it, then we do the same in the tree's potential-item index. If the potential-item index contains an entry for the item's key, then we check if the potential-item's node should "win" over the queue item's node, deciding the "winner" using the same routine as when building the item index. If the potential-item's node wins, then we add the potential node's set of roots to the augment queue. If the queue-item's node wins, then we add the item to the settled-item queue (except, as an optimization, if the item is of a type that cannot possibly imply the existence of another item, then we just drop it and don't add it to the settled-item queue).

      It shouldn't matter what order this queue is processed in, but I sort it numerically by treeID and then by item key.

      This step is fairly fast because it's entirely in-memory, making no accesses to disk.

    2. Process items from the settled-item queue (drain the settled-item queue, fill the augment queue and the tree queue).

      This step accesses disk, and so the order we process the queue in turns out to be pretty important in order to keep our disk access patterns cache-friendly. For the most part, we just sort each queue item by tree, then by key. But, we have special handling for EXTENT_ITEMs, METADATA_ITEMs, and EXTENT_DATA_REF items: We break EXTENT_ITEMs and METADATA_ITEMs in to "sub-items", treating each ref embedded in them as a separate item. For those embedded items that are EXTENT_DATA_REFs, and for stand-alone EXTENT_DATA_REF items, we sort them not with the EXTENT_TREE items, but with the items of the tree that the extent data ref points at. Recall that during the intitial scan step, we took note of which tree every extent data ref points at, so we can perform this sort without accessing disk yet. This splitting does mean that we may visit/read an EXTENT_ITEM or METADATA_ITEM multiple times as we process the queue, but to do otherwise is to solve MinLA, which is NP-hard and also an optimal MinLA solution I still think would perform worse than this; there is a reasonably lengthy discussion of this in a comment in ./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue().

      Now we loop over that sorted queue. In the code, this loop is deceptively simple. Read the item, then pass it to a function that tells us what other items are implied by it. That function is large, but simple; it's just a giant table. The trick is how it tells us about implied items; we give it set of callbacks that it calls to tell us these things; the real complexity is in the callbacks. These "graph callbacks" will be discussed in detail below, but as an illustrative example: It may call .WantOff() with a tree ID, object ID, item type, and offset to specify a precise item that it believes should exist.

      If we encounter a ROOT_ITEM, add the tree described by that item to the tree queue.

    (Both the "can this item even imply the existence of another item" check and the "what items are implied by this item" routine are in ./lib/btrfscheck/graph.go)

  3. Apply augments; drain the augment queue (and maybe the retry-item queue), fill the added-item queuee.

    It is at this point that I call out that the augment queue isn't implemented as a simple map/set like the others, the treeAugmentQueue struct has special handling for sets of different sizes; optimizing the space for empty and len()==1 sized sets, and falling back to normal the usual implementation for larger sets; this is important because those small sets are the overwhelming majority, and otherwise there's no way the program would be able to run on my 32GB RAM laptop. Now that I think about it, I bet it would even be worth it to add optimized storage for len()==2 sized sets.

    The reason is that each "want" from above is tracked in the queue separately; if we were OK merging them, then this optimized storage wouldn't be nescessary. But we keep them separate, so that:

    • For all "wants", including ones with empty sets, graph callbacks can check if a want has already been processed; avoiding re-doing any work (see the description of the graph callbacks below).

    • For "wants" with non-empty sets, we can see how many different "wants" could be satisfied with a given root, in order to decide which root to choose.

    Anyway, we loop over the trees in the augment queue. For each tree we look at that tree's augment queue and look at all the choices of root nodes to add (below), and decide on a list to add. The we add each of those roots to the tree; the adding of each root triggers several calls to our item-added callback (filling the added-item queue), and our root-added callback. The root-added callback moves any items from the retry-item queue for this tree to the added-item queue.

    How do we decide between choices of root nodes to add? ./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments() has a good comment explaining the criteria we'd like to optimize for, and then code that does an OK-ish job of actually optimizing for that:

    • It loops over the augment queue for that tree, building a list of possible roots, for each possible root making note of 3 things:

      1. how many "wants" that root satisfies,

      2. how far from treee the root's owner is (owner=tree is a distance of 0, owner=tree.parent is a distance of 1, owner=tree.parent.parent is a distance of 2, and so on), and

      3. what the generation of that root is.

    • We sort that list first by highest-count-first, then by lowest-distance-first, then by highest-generation-first.

    • We create a "return" set and an "illegal" set. We loop over the sorted list; for each possible root if it is in the illegal set, we skip it, otherwise we insert it into the return set and for each "want" that includes this root we all all roots that satisfy that want to the illegal list.

It is important that the rebuilt forrest have the flag set so that it refuses to look up a tree if it can't look up all of that tree's ancestors; otherwise the potential-items index would be garbage as we wouldn't have a good idea of which nodes are OK to consider; but this does have the downside that it won't even attempt to improve a tree with a missing parent. Perhaps the algorithm should flip the flag once the loop terminates, and then re-seed the tree queue with each ROOT_ITEM from the ROOT_TREE?

4.3.3.3. graph callbacks

(see: ./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go)

The graph callbacks are what tie the above together.

For each of these callbacks, whenever I say that it looks up something in a tree's item index or potential-item index, that implies looking the tree up from the forrest; if the forrest cannot look up that tree, then the callback returns early, after either:

The 6 methods in the brfscheck.GraphCallbacks interface are:

  1. FSErr(): There's an error with the filesystem; this callback just spits it out on stderr. I say such a trivial matter because, again, for a recovery tool I think it's worth putting care in to how you handle errors and where you expect them: We expect them here, so we have to check for them to avoid reading invalid data or whatever, but we don't actually need to do anything other than watch our step.

  2. Want(): We want an item in a given tree with a given object ID and item type, but we don't care about what the item's offset is.

    The callback works by searching the item index to see if it can find such an item; if so, it has nothing else to do and returns. Otherwise, it searches the potential-item index; for each matching item it finds it looks in the node index for the node containing that item, and adds the roots that would add that node, and adds those roots to a set. Once it has finished searching the potential-item index, it adds that set to the augment queue (even if that set is still empty).

  3. WantOff(): The same, but we want a specific offset.

  4. WantDirIndex(): We want a DIR_INDEX item for a given inode and filename, but we don't know what the offset of that item is.

    First we scan over the item index, looking at all DIR_INDEX items for that inode number. For each item, we can check the scan data to see what the filename in that DIR_INDEX is, so we can see if the item satisfies this want without accessing the disk. If there's a match, then there is nothing else to do, so we return. Otherwise, we do that same search over the potential-item index; if we find any matches, then we build the set of roots to add to the augment queue the same as in Want.

  5. WantFileExt(): We want 1 or more DATA_EXTENT items in the given tree for the given inode, and we want them to cover from 0 to a given size bytes of that file.

    First we walk that range in the item index, to build a list of the gaps that we need to fill ("Step 1" in rebuild_wantcb.go:_wantRange()). This walk (rebuild_wantcb.go:_walkRange()) requires knowing the size of each file extent; so doing this quickly without hitting disk is why we recorded the size of each file extent in our initialization step.

    Then ("Step 2" in _wantRange()) we iterate over each of the gaps, and for each gap do a very similar walk (again, by calling _walkRange(), but this time over the potential-item index. For each file extent we find that has is entirely within the gap, we "want" that extent, and move the beginning of of the gap forward to the end of that extent. This algorithm is dumb and greedy, potentially making sub-optimal selections; and so could probably stand to be improved; but in my real-world use, it seems to be "good enough".

  6. WantCSum(): We want 1 or more EXTENT_CSUM items to cover the half-open interval [lo_logical_addr, hi_logical_addr). Well, maybe. It also takes a subvolume ID and an inode number; and looks up in the scan data whether that inode has the INODE_NODATASUM flag set; if it does have the flag set, then it returns early without looking for any EXTENT_CSUM items. If it doesn't return early, then it performs the same want-range routine as WantFileExt, but with the appropriate tree, object ID, and item types for csums as opposed to data extents.

For each of these callbacks, we generate a "wantKey", a tuple representing the function and its arguments; we check the augment-queue to see if we've already enqueued a set of roots for that want, and if so, that callback can return early without checking the potential-item index.

5. Future work

It's in a reasonably useful place, I think; and so now I'm going to take a break from it for a while. But there's still lots of work to do:

6. Problems for merging this code into btrfs-progs

--
Happy hacking,
~ Luke Shumaker