A few days ago, my monitoring started popping up alerts about a server not being up-to-date.

Which isn’t supposed to happen, since it has unattended-upgrades installed and configured to upgrade everything quite aggressively.

Okay, let’s have a look :

$ apt-get update
[...]
$ apt-get dist-upgrade
[...]
Fetched 139 MB in 2s (63.9 MB/s)
Bus error

Okay, erm. That’s really not supposed to happen. A quick look at dmesg confirmed my worst fears :

$ dmesg
[...]
[617604.704817] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[617604.704827] ata1.00: irq_stat 0x40000001
[617604.704832] ata1.00: failed command: READ DMA
[617604.704840] ata1.00: cmd c8/00:08:88:8c:cf/00:00:00:00:00/eb tag 11 dma 4096 in
                         res 51/40:00:88:8c:cf/00:00:0b:00:00/0b Emask 0x9 (media error)
[617604.704848] ata1.00: status: { DRDY ERR }
[617604.704852] ata1.00: error: { UNC }
[617604.795922] ata1.00: configured for UDMA/133
[617604.795939] sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[617604.795943] sd 0:0:0:0: [sda] tag#11 Sense Key : Medium Error [current]
[617604.795946] sd 0:0:0:0: [sda] tag#11 Add. Sense: Unrecovered read error - auto reallocate failed
[617604.795949] sd 0:0:0:0: [sda] tag#11 CDB: Read(10) 28 00 0b cf 8c 88 00 00 08 00
[617604.795952] print_req_error: I/O error, dev sda, sector 198151304
[617604.795962] BTRFS error (device sda4): bdev /dev/sda4 errs: wr 0, rd 50834, flush 0, corrupt 0, gen 0
[617604.795983] ata1: EH complete

This error sequence was repeated again and again

That’s a bit noisy but basically : the kernel complains about not being able to read a certain place on the disk.

The disk is dying.

And we’ve lost data.

Step 1 : panic …

… and then realize panicking isn’t gonna do any good, so, calm down, and start thinking.

Step 2 : check your backups or make some

This is the moment where you need to worry about your backups.

Actually, this is the moment where you really hope you cared about backups in the past, because, at this point, you might have irremediably lost part of your data.

If you don’t have backups, you should save everything you can right now, while minimizing reads and writes to the dying disk (remember that the more you use a dying disk, the more you risk pushing it even faster in its grave).

Making a disk image with ddrescue is usually a good way to do this, you can couple it with copying the data you’re most interesed in somewhere over the network, etc…

On a related note : If you have a RAID setup, especially one with parity raids (raid5 for example), change the disk as fast as possible, because the dead disk puts even more stress on the remainig disks, which might very well be getting tired too…

Anyway, I do have backups already, and can’t access the machine to put new disks before a few weeks. Let’s see what I can fix in the meantime.

A bit of basics about SMART data

So far we have kernel errors about the disk giving read errors, but what does the disk itself think about this ?

That’s where SMART reports come in.

Let’s have a look to mine :

$ smartctl -a /dev/sda
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-16-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Laptop HDD
Device Model:     ST500LM021-1KJ152
Serial Number:    W62DD2AZ
LU WWN Device Id: 5 000c50 09c3f8b1a
Firmware Version: 0003LIM1
User Capacity:    500,107,862,016 bytes [500 GB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      2.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Sun Jun 20 15:36:21 2021 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 (  81) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x1031)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   074   074   034    Pre-fail  Always       -       6391505
  3 Spin_Up_Time            0x0003   099   098   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       106
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   080   060   030    Pre-fail  Always       -       4406996728
  9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21892 (3 79 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       105
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       59352
188 Command_Timeout         0x0032   100   086   000    Old_age   Always       -       210460672117
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   058   048   045    Old_age   Always       -       42 (Min/Max 36/52)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       13
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       42
193 Load_Cycle_Count        0x0032   080   080   000    Old_age   Always       -       41519
194 Temperature_Celsius     0x0022   042   052   000    Old_age   Always       -       42 (0 17 0 0 0)
196 Reallocated_Event_Count 0x000f   078   078   030    Pre-fail  Always       -       19997 (49479 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       16
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
ATA Error Count: 59345 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 59345 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 8c cf 0b  Error: UNC at LBA = 0x0bcf8c88 = 198151304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 8c cf eb 00   1d+22:19:33.864  READ DMA
  ca 00 08 b8 03 8c e0 00   1d+22:19:33.508  WRITE DMA
  ca 00 10 78 29 36 ec 00   1d+22:19:23.270  WRITE DMA
  ca 00 10 98 28 36 ec 00   1d+22:19:23.269  WRITE DMA
  ca 00 08 98 56 6b e0 00   1d+22:19:23.269  WRITE DMA

Error 59344 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 8c cf 0b  Error: UNC at LBA = 0x0bcf8c88 = 198151304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 8c cf eb 00   1d+22:18:33.853  READ DMA
  ca 00 08 50 0a 8c e0 00   1d+22:18:33.760  WRITE DMA
  ea 00 00 00 00 00 a0 00   1d+22:18:20.693  FLUSH CACHE EXT
  ca 00 08 80 f0 1e e0 00   1d+22:18:20.693  WRITE DMA
  ea 00 00 00 00 00 a0 00   1d+22:18:20.653  FLUSH CACHE EXT

Error 59343 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 8c cf 0b  Error: UNC at LBA = 0x0bcf8c88 = 198151304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 8c cf eb 00   1d+22:17:33.843  READ DMA
  ca 00 10 d8 21 36 ec 00   1d+22:17:30.630  WRITE DMA
  ca 00 10 a0 21 36 ec 00   1d+22:17:20.390  WRITE DMA
  ca 00 08 b0 1e 36 ec 00   1d+22:17:20.390  WRITE DMA
  ea 00 00 00 00 00 a0 00   1d+22:17:19.454  FLUSH CACHE EXT

Error 59342 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 8c cf 0b  Error: UNC at LBA = 0x0bcf8c88 = 198151304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 8c cf eb 00   1d+22:16:33.839  READ DMA
  ca 00 40 e0 e5 4a e0 00   1d+22:16:33.839  WRITE DMA
  ca 00 20 00 e5 4a e0 00   1d+22:16:33.838  WRITE DMA
  ca 00 20 60 e2 4a e0 00   1d+22:16:33.838  WRITE DMA
  ca 00 20 20 e2 4a e0 00   1d+22:16:33.838  WRITE DMA

Error 59341 occurred at disk power-on lifetime: 21892 hours (912 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 8c cf 0b  Error: UNC at LBA = 0x0bcf8c88 = 198151304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 88 8c cf eb 00   1d+22:15:33.821  READ DMA
  ca 00 18 28 1f 36 ec 00   1d+22:15:32.874  WRITE DMA
  ca 00 10 b0 1e 36 ec 00   1d+22:15:32.872  WRITE DMA
  ca 00 08 30 1e 36 ec 00   1d+22:15:32.872  WRITE DMA
  35 00 08 ff ff ff ef 00   1d+22:15:32.679  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

… noisy, right ?

Here’s a nice wiki page explaining better than I how to read SMART reports in the general case.

Let’s start at the bottom : we have errors. Always in the same place.

Here is a source explaining how to read those.

Sooo, we have a defective sector, and an error on it that can’t be corrected (disks can have error-correction codes able to fix errors, but that’s not always enough).

We kinda knew that, but now we know where it is. Good news. We’ll se what to do about that later.

Now, about the previous section, there’s interesting stuff as well, specifically :

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
196 Reallocated_Event_Count 0x000f   078   078   030    Pre-fail  Always       -       19997 (49479 0)
197 Current_Pending_Sector  0x0012   100   099   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0010   100   099   000    Old_age   Offline      -       16

There has been quite a lot of reallocations, which is no good sign, and a few sectors are waiting to be reallocated.

Not sure how to read the “16” raw value. 16 sectors ?

btrfs to the rescue

Knowing where the errors are on-disk is nice, but we need more than that : we need to find which files are affected.

Since we have a failing disk, we might as well check the integrity of the entire filesystem, just in case.

Enter btrfs, a modern filesystem providing, among other things, out-of-the-box checksumming of all contents.

It has the ability to check all those checksums, without taking the filesystem offline, and report issues to the kernel log, so let’s do just that :

$ btrfs scrub start -r /

scrub started on /, fsid 25c78072-8b16-46af-840d-87fd3b31a22b (pid=29388)
$ # This is asynchronous. Wait for it to complete, then :
$ btrfs scrub status /
scrub status for 25c78072-8b16-46af-840d-87fd3b31a22b
        scrub started at Mon Jun 14 10:21:38 2021 and finished after 00:50:15
        total bytes scrubbed: 316.01GiB with 4 errors
        error details: read=4
        corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

4 Errors. Let’s have a look at dmesg :

[...]
[575294.624740] BTRFS warning (device sda4): i/o error at logical 99333246976 on dev /dev/sda4, physical 100415377408, root 5, inode 790035, offset 282624, length 4096, links 1 (path: var/lib/postgresql/11/main/base/16388/17654)
[...]
[575296.824851] BTRFS warning (device sda4): i/o error at logical 99315720192 on dev /dev/sda4, physical 100397850624, root 5, inode 6311771, offset 1159168, length 4096, links 1 (path: usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2)
[...]
[575524.906876] BTRFS warning (device sda4): i/o error at logical 126268129280 on dev /dev/sda4, physical 128424001536: metadata leaf (level 0) in tree 2
[575524.906880] BTRFS warning (device sda4): i/o error at logical 126268129280 on dev /dev/sda4, physical 128424001536: metadata leaf (level 0) in tree 2
[...]

Now we know exactly where we have errors :

  • /var/lib/postgresql/11/main/base/16388/17654 : a postgresql file. This might be quite problematic, we’ll see about it later
  • /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 : an apt lib file. That’s likely what was causing my inability to install stuff with apt, shouldn’t be too hard to fix, just reinstall the corresponding package.
  • 2 metadata nodes

The metadata nodes are pretty easy to fix : btrfs always keeps two copies of them even on a single disk. Let’s re-scrub without asking for a read-only mode :

$ btrfs scrub start /
[...]
$ btrfs scrub status /
scrub status for 25c78072-8b16-46af-840d-87fd3b31a22b
        scrub started at Mon Jun 14 11:52:47 2021 and finished after 00:48:59
        total bytes scrubbed: 316.01GiB with 4 errors
        error details: read=4
        corrected errors: 2, uncorrectable errors: 2, unverified errors: 0

Sure enough, the metadata errors are gone. Now, let’s have a look at the other two !

Reallocating the logical sectors

Strictly speaking, this part is not necessary.

HDD/SSDs have some spare sectors, and when they detect one of them is dead, they will reallocate it on a spare one.

The reallocation can only happen when writing to the sector (because re-writing a sector means we don’t care anymore about its contents), which is why we’re currently getting errors.

So, to fix those read errors, all we have to do is write to the corresponding sectors. It won’t fix the filesystem-level errors, but should replace the read errors by checksum errors. And stop SMART and kernel logs from spamming with read error logs.

So, why is it not really necessary ?

Well, I’ll still have to fix the contents of those files. And that should do one of the following, depending on the filesystem type :

  • overwrite the sector, thus triggering a reallocation
  • write the new version of the file somewhere else, and stop trying to read the dead sector until we randomly reallocate it for some other file

I did try to do this, zeroing out the corresponding sectors with hdparm, but couldn’t get rid of the kernel errors.

postgres data

One of the files I have to fix is /var/lib/postgresql/11/main/base/16388/17654

$ cat /var/lib/postgresql/11/main/base/16388/17654
cat: /var/lib/postgresql/11/main/base/16388/17654: Input/output error

That’s bad news, since it is a file used by a pg database. I do have backups, but a complete postgres rollback would mean loss of recent data, and a partial restore is difficult, when possible at all.

We first need to identify which database and table uses this file.

Here is a nice post explaining how to do this.

$ psql
psql (11.11 (Debian 11.11-0+deb10u1))
Type "help" for help.
postgres=# select datname from pg_database where oid = 16388 ;
  datname
-----------
 nextcloud
(1 row)
postgres=# \q
$ psql nextcloud
psql (11.11 (Debian 11.11-0+deb10u1))
Type "help" for help.

nextcloud=# SELECT pg_filenode_relation(0, 17654);
 pg_filenode_relation
----------------------
 fs_mtime
(1 row)

nextcloud=# SELECT * FROM fs_mtime;
ERROR:  "fs_mtime" is an index
LINE 1: SELECT * FROM fs_mtime;

That’s … an index ?

Great news ! I haven’t lost data after all !

Regenerating an index is easy :

nextcloud=# REINDEX INDEX fs_mtime;
REINDEX

Now let’s check again the corrupted file :

$ cat /var/lib/postgresql/11/main/base/16388/17654
$

To my surprise, the file still exists but is now empty (I was expecting it to be deleted)

Doesn’t matter, problem solved !

Re-executing btrfs scrub confirms this, there’s only one unrecoverable error left.

Fixing the apt lib

The other corrupted file is /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 :

$ cat /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 > /dev/null
cat: /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2: Input/output error

This is a file installed by a package, so, regenerating it should be doable just by reinstalling the corresponding package. Let’s start by identifying the corresponding package :

$ dpkg -S /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
libapt-pkg5.0:amd64: /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2
$ apt-cache policy libapt-pkg5.0
libapt-pkg5.0:
  Installed: 1.8.2.3
  Candidate: 1.8.2.3
  Version table:
 *** 1.8.2.3 500
        500 http://ftp.fr.debian.org/debian buster/main amd64 Packages
        500 http://ftp.fr.debian.org/debian buster-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1.8.2.2 500
        500 http://security.debian.org/debian-security buster/updates/main amd64 Packages

Of course, that’s one of the core apt packages, and my ability to install packages with apt is currently broken. But I can still use it to download the package :

$ apt-get install --reinstall --download-only libapt-pkg5.0
Reading package lists... Done
Building dependency tree
Reading state information... Done
0 upgraded, 0 newly installed, 1 reinstalled, 0 to remove and 58 not upgraded.
Need to get 967 kB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://ftp.fr.debian.org/debian buster/main amd64 libapt-pkg5.0 amd64 1.8.2.3 [967 kB]
Fetched 967 kB in 0s (4211 kB/s)
Download complete and in download only mode

Then, let’s go and get it in the apt cache, and install it manually with dpkg :

$ dpkg -i /var/cache/apt/archives/libapt-pkg5.0_1.8.2.3_amd64.deb
(Reading database ... 111546 files and directories currently installed.)
Preparing to unpack .../libapt-pkg5.0_1.8.2.3_amd64.deb ...
Unpacking libapt-pkg5.0:amd64 (1.8.2.3) over (1.8.2.3) ...
Setting up libapt-pkg5.0:amd64 (1.8.2.3) ...
Processing triggers for libc-bin (2.28-10) ...

Looks good !

Let’s test this :

$ cat /usr/lib/x86_64-linux-gnu/libapt-pkg.so.5.0.2 > /dev/null
$ apt-get dist-upgrade
# normally installs packages without issues

The system is back to normal, btrfs scrub doesn’t show any errors anymore, and I’m happy :)

Conclusion

First of all, I was VERY lucky :

  • No actual data loss, every corrupted piece of data could be rebuilt
  • One of the corruptions broke apt, catching my attention. The problem could have gone unnoticed way longer without this, and actually cause data loss in the meantime. I have backups, but still.

This conforts me in the design of my system on several points :

  • backups, backups, backups. And backups. You never have too many of them.
  • btrfs rules ! Without it, I would have been left guessing what was corrupted and what was not, including stuff already sent to backups. Instead, I could identify precisely and fix data corruptions, and rule out any silent data corruption.

I also have a few things to improve :

  • Monitor SMART reports. There were failure indications in there since … forever.
  • Maybe add a periodic btrfs scrub, and monitor its results