Disk corruption, btrfs, and recovery
A few days ago, my monitoring started popping up alerts about a server not being up-to-date.
Which isn’t supposed to happen, since it has unattended-upgrades installed and configured to upgrade everything quite aggressively.
Okay, let’s have a look :
$ apt-get update
[...]
$ apt-get dist-upgrade
[...]
Fetched 139 MB in 2s (63.9 MB/s)
Bus error
Okay, erm. That’s really not supposed to happen. A quick look at dmesg confirmed my worst fears :
$ dmesg
[...]
[617604.704817] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[617604.704827] ata1.00: irq_stat 0x40000001
[617604.704832] ata1.00: failed command: READ DMA
[617604.704840] ata1.00: cmd c8/00:08:88:8c:cf/00:00:00:00:00/eb tag 11 dma 4096 in
res 51/40:00:88:8c:cf/00:00:0b:00:00/0b Emask 0x9 (media error)
[617604.704848] ata1.00: status: { DRDY ERR }
[617604.704852] ata1.00: error: { UNC }
[617604.795922] ata1.00: configured for UDMA/133
[617604.795939] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[617604.795943] sd 0:0:0:0: [sda] tag#11 Sense Key : Medium Error [current]
[617604.795946] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[617604.795949] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 0b cf 8c 88 00 00 08 00
[617604.795952] print_req_error: I/O error, dev sda, sector 198151304
[617604.795962] BTRFS error (device sda4): bdev /dev/sda4 errs: wr 0, rd 50834, flush 0, corrupt 0, gen 0
[617604.795983] ata1: EH complete
This error sequence was repeated again and again
That’s a bit noisy but basically : the kernel complains about not being able to read a certain place on the disk.
The disk is dying.
And we’ve lost data.
Step 1 : panic …
… and then realize panicking isn’t gonna do any good, so, calm down, and start thinking.
Step 2 : check your backups or make some
This is the moment where you need to worry about your backups.
Actually, this is the moment where you really hope you cared about backups in the past, because, at this point, you might have irremediably lost part of your data.
If you don’t have backups, you should save everything you can right now, while minimizing reads and writes to the dying disk (remember that the more you use a dying disk, the more you risk pushing it even faster in its grave).
Making a disk image with ddrescue
is usually a good way to do this, you can couple it with copying the data you’re most interesed in somewhere over the network, etc…
On a related note : If you have a RAID setup, especially one with parity raids (raid5 for example), change the disk as fast as possible, because the dead disk puts even more stress on the remainig disks, which might very well be getting tired too…
Anyway, I do have backups already, and can’t access the machine to put new disks before a few weeks. Let’s see what I can fix in the meantime.
A bit of basics about SMART data
So far we have kernel errors about the disk giving read errors, but what does the disk itself think about this ?
That’s where SMART reports come in.
Let’s have a look to mine :
$ smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Laptop HDD
Device Model: ST500LM021-1KJ152
Serial Number: W62DD2AZ
LU WWN Device Id: 5 000c50 09c3f8b1a
Firmware Version: 0003LIM1
User Capacity: 500,107,862,016 bytes [500 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Sun Jun 20 15:36:21 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 81) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1031) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 074 074 034 Pre-fail Always - 6391505
3 Spin_Up_Time 0x0003 099 098 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 106
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 080 060 030 Pre-fail Always - 4406996728
9 Power_On_Hours 0x0032 076 076 000 Old_age Always - 21892 (3 79 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 105
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 59352
188 Command_Timeout 0x0032 100 086 000 Old_age Always - 210460672117
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 058 048 045 Old_age Always - 42 (Min/Max 36/52)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 13
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 42
193 Load_Cycle_Count 0x0032 080 080 000 Old_age Always - 41519
194 Temperature_Celsius 0x0022 042 052 000 Old_age Always - 42 (0 17 0 0 0)
196 Reallocated_Event_Count 0x000f 078 078 030 Pre-fail Always - 19997 (49479 0)
197 Current_Pending_Sector 0x0012 100 099 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0010 100 099 000 Old_age Offline - 16
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 59345 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 59345 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 88 8c cf 0b Error: UNC at LBA = 0x0bcf8c88 = 198151304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 8c cf eb 00 1d+22:19:33.864 READ DMA
ca 00 08 b8 03 8c e0 00 1d+22:19:33.508 WRITE DMA
ca 00 10 78 29 36 ec 00 1d+22:19:23.270 WRITE DMA
ca 00 10 98 28 36 ec 00 1d+22:19:23.269 WRITE DMA
ca 00 08 98 56 6b e0 00 1d+22:19:23.269 WRITE DMA
Error 59344 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 88 8c cf 0b Error: UNC at LBA = 0x0bcf8c88 = 198151304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 8c cf eb 00 1d+22:18:33.853 READ DMA
ca 00 08 50 0a 8c e0 00 1d+22:18:33.760 WRITE DMA
ea 00 00 00 00 00 a0 00 1d+22:18:20.693 FLUSH CACHE EXT
ca 00 08 80 f0 1e e0 00 1d+22:18:20.693 WRITE DMA
ea 00 00 00 00 00 a0 00 1d+22:18:20.653 FLUSH CACHE EXT
Error 59343 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 88 8c cf 0b Error: UNC at LBA = 0x0bcf8c88 = 198151304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 8c cf eb 00 1d+22:17:33.843 READ DMA
ca 00 10 d8 21 36 ec 00 1d+22:17:30.630 WRITE DMA
ca 00 10 a0 21 36 ec 00 1d+22:17:20.390 WRITE DMA
ca 00 08 b0 1e 36 ec 00 1d+22:17:20.390 WRITE DMA
ea 00 00 00 00 00 a0 00 1d+22:17:19.454 FLUSH CACHE EXT
Error 59342 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 88 8c cf 0b Error: UNC at LBA = 0x0bcf8c88 = 198151304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 8c cf eb 00 1d+22:16:33.839 READ DMA
ca 00 40 e0 e5 4a e0 00 1d+22:16:33.839 WRITE DMA
ca 00 20 00 e5 4a e0 00 1d+22:16:33.838 WRITE DMA
ca 00 20 60 e2 4a e0 00 1d+22:16:33.838 WRITE DMA
ca 00 20 20 e2 4a e0 00 1d+22:16:33.838 WRITE DMA
Error 59341 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 88 8c cf 0b Error: UNC at LBA = 0x0bcf8c88 = 198151304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 88 8c cf eb 00 1d+22:15:33.821 READ DMA
ca 00 18 28 1f 36 ec 00 1d+22:15:32.874 WRITE DMA
ca 00 10 b0 1e 36 ec 00 1d+22:15:32.872 WRITE DMA
ca 00 08 30 1e 36 ec 00 1d+22:15:32.872 WRITE DMA
35 00 08 ff ff ff ef 00 1d+22:15:32.679 WRITE DMA EXT
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
… noisy, right ?
Here’s a nice wiki page explaining better than I how to read SMART reports in the general case.
Let’s start at the bottom : we have errors. Always in the same place.
Here is a source explaining how to read those.
Sooo, we have a defective sector, and an error on it that can’t be corrected (disks can have error-correction codes able to fix errors, but that’s not always enough).
We kinda knew that, but now we know where it is. Good news. We’ll se what to do about that later.
Now, about the previous section, there’s interesting stuff as well, specifically :
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
196 Reallocated_Event_Count 0x000f 078 078 030 Pre-fail Always - 19997 (49479 0)
197 Current_Pending_Sector 0x0012 100 099 000 Old_age Always - 16
198 Offline_Uncorrectable 0x0010 100 099 000 Old_age Offline - 16
There has been quite a lot of reallocations, which is no good sign, and a few sectors are waiting to be reallocated.
Not sure how to read the “16” raw value. 16 sectors ?
btrfs to the rescue
Knowing where the errors are on-disk is nice, but we need more than that : we need to find which files are affected.
Since we have a failing disk, we might as well check the integrity of the entire filesystem, just in case.
Enter btrfs, a modern filesystem providing, among other things, out-of-the-box checksumming of all contents.
It has the ability to check all those checksums, without taking the filesystem offline, and report issues to the kernel log, so let’s do just that :
$ btrfs scrub start -r /
scrub started on /, fsid 25c78072-8b16-46af-840d-87fd3b31a22b (pid=29388)
$ # This is asynchronous. Wait for it to complete, then :
$ btrfs scrub status /
scrub status for 25c78072-8b16-46af-840d-87fd3b31a22b
scrub started at Mon Jun 14 10:21:38 2021 and finished after 00:50:15
total bytes scrubbed: 316.01GiB with 4 errors
error details: read=4
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
4 Errors. Let’s have a look at dmesg :
[...]
[575294.624740] BTRFS warning (device sda4): i/o error at logical 99333246976 on dev /dev/sda4, physical 100415377408, root 5, inode 790035, offset 282624, length 4096, links 1 (path: var/lib/postgresql/11/main/base/16388/17654)
[...]
[575296.824851] BTRFS warning (device sda4): i/o error at logical 99315720192 on dev /dev/sda4, physical 100397850624, root 5, inode 6311771, offset 1159168, length 4096, links 1 (path: usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2)
[...]
[575524.906876] BTRFS warning (device sda4): i/o error at logical 126268129280 on dev /dev/sda4, physical 128424001536: metadata leaf (level 0) in tree 2
[575524.906880] BTRFS warning (device sda4): i/o error at logical 126268129280 on dev /dev/sda4, physical 128424001536: metadata leaf (level 0) in tree 2
[...]
Now we know exactly where we have errors :
- /var/lib/postgresql/11/main/base/16388/17654 : a postgresql file. This might be quite problematic, we’ll see about it later
- /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 : an apt lib file. That’s likely what was causing my inability to install stuff with apt, shouldn’t be too hard to fix, just reinstall the corresponding package.
- 2 metadata nodes
The metadata nodes are pretty easy to fix : btrfs always keeps two copies of them even on a single disk. Let’s re-scrub without asking for a read-only mode :
$ btrfs scrub start /
[...]
$ btrfs scrub status /
scrub status for 25c78072-8b16-46af-840d-87fd3b31a22b
scrub started at Mon Jun 14 11:52:47 2021 and finished after 00:48:59
total bytes scrubbed: 316.01GiB with 4 errors
error details: read=4
corrected errors: 2, uncorrectable errors: 2, unverified errors: 0
Sure enough, the metadata errors are gone. Now, let’s have a look at the other two !
Reallocating the logical sectors
Strictly speaking, this part is not necessary.
HDD/SSDs have some spare sectors, and when they detect one of them is dead, they will reallocate it on a spare one.
The reallocation can only happen when writing to the sector (because re-writing a sector means we don’t care anymore about its contents), which is why we’re currently getting errors.
So, to fix those read errors, all we have to do is write to the corresponding sectors. It won’t fix the filesystem-level errors, but should replace the read errors by checksum errors. And stop SMART and kernel logs from spamming with read error logs.
So, why is it not really necessary ?
Well, I’ll still have to fix the contents of those files. And that should do one of the following, depending on the filesystem type :
- overwrite the sector, thus triggering a reallocation
- write the new version of the file somewhere else, and stop trying to read the dead sector until we randomly reallocate it for some other file
I did try to do this, zeroing out the corresponding sectors with hdparm
, but couldn’t get rid of the kernel errors.
postgres data
One of the files I have to fix is /var/lib/postgresql/11/main/base/16388/17654
$ cat /var/lib/postgresql/11/main/base/16388/17654
cat: /var/lib/postgresql/11/main/base/16388/17654: Input/output error
That’s bad news, since it is a file used by a pg database. I do have backups, but a complete postgres rollback would mean loss of recent data, and a partial restore is difficult, when possible at all.
We first need to identify which database and table uses this file.
Here is a nice post explaining how to do this.
$ psql
psql (11.11 (Debian 11.11-0+deb10u1))
Type "help" for help.
postgres=# select datname from pg_database where oid = 16388 ;
datname
-----------
nextcloud
(1 row)
postgres=# \q
$ psql nextcloud
psql (11.11 (Debian 11.11-0+deb10u1))
Type "help" for help.
nextcloud=# SELECT pg_filenode_relation(0, 17654);
pg_filenode_relation
----------------------
fs_mtime
(1 row)
nextcloud=# SELECT * FROM fs_mtime;
ERROR: "fs_mtime" is an index
LINE 1: SELECT * FROM fs_mtime;
That’s … an index ?
Great news ! I haven’t lost data after all !
Regenerating an index is easy :
nextcloud=# REINDEX INDEX fs_mtime;
REINDEX
Now let’s check again the corrupted file :
$ cat /var/lib/postgresql/11/main/base/16388/17654
$
To my surprise, the file still exists but is now empty (I was expecting it to be deleted)
Doesn’t matter, problem solved !
Re-executing btrfs scrub
confirms this, there’s only one unrecoverable error left.
Fixing the apt lib
The other corrupted file is /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
:
$ cat /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 > /dev/null
cat: /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2: Input/output error
This is a file installed by a package, so, regenerating it should be doable just by reinstalling the corresponding package. Let’s start by identifying the corresponding package :
$ dpkg -S /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
libapt-pkg5.0:amd64: /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
$ apt-cache policy libapt-pkg5.0
libapt-pkg5.0:
Installed: 1.8.2.3
Candidate: 1.8.2.3
Version table:
*** 1.8.2.3 500
500 http://ftp.fr.debian.org/debian buster/main amd64 Packages
500 http://ftp.fr.debian.org/debian buster-updates/main amd64 Packages
100 /var/lib/dpkg/status
1.8.2.2 500
500 http://security.debian.org/debian-security buster/updates/main amd64 Packages
Of course, that’s one of the core apt packages, and my ability to install packages with apt is currently broken. But I can still use it to download the package :
$ apt-get install --reinstall --download-only libapt-pkg5.0
Reading package lists... Done
Building dependency tree
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 58 not upgraded.
Need to get 967 kB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://ftp.fr.debian.org/debian buster/main amd64 libapt-pkg5.0 amd64 1.8.2.3 [967 kB]
Fetched 967 kB in 0s (4211 kB/s)
Download complete and in download only mode
Then, let’s go and get it in the apt cache, and install it manually with dpkg :
$ dpkg -i /var/cache/apt/archives/libapt-pkg5.0_1.8.2.3_amd64.deb
(Reading database ... 111546 files and directories currently installed.)
Preparing to unpack .../libapt-pkg5.0_1.8.2.3_amd64.deb ...
Unpacking libapt-pkg5.0:amd64 (1.8.2.3) over (1.8.2.3) ...
Setting up libapt-pkg5.0:amd64 (1.8.2.3) ...
Processing triggers for libc-bin (2.28-10) ...
Looks good !
Let’s test this :
$ cat /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 > /dev/null
$ apt-get dist-upgrade
# normally installs packages without issues
The system is back to normal, btrfs scrub
doesn’t show any errors anymore, and I’m happy :)
Conclusion
First of all, I was VERY lucky :
- No actual data loss, every corrupted piece of data could be rebuilt
- One of the corruptions broke apt, catching my attention. The problem could have gone unnoticed way longer without this, and actually cause data loss in the meantime. I have backups, but still.
This conforts me in the design of my system on several points :
- backups, backups, backups. And backups. You never have too many of them.
- btrfs rules ! Without it, I would have been left guessing what was corrupted and what was not, including stuff already sent to backups. Instead, I could identify precisely and fix data corruptions, and rule out any silent data corruption.
I also have a few things to improve :
- Monitor SMART reports. There were failure indications in there since … forever.
- Maybe add a periodic
btrfs scrub
, and monitor its results