Bayle Shanks's website: notes-computer-backups

Commands to backup to another hard drive

sudo rsync -axW / /l/backup/sda/date

while excluding some directories (for example, those that themselves contain backups):

sudo rsync -ax --progress --exclude=backup /media/t1/ /media/t3
sudo rsync -ax --exclude=/l2/backup /media/l2 /media/l3/backup/l2/date
sudo rsync -ax --exclude=/l/backup --exclude=/l/autobackup /media/l /media/l3/backup/l/date
sudo rsync -axW / /l/backup/sda/date --progress --exclude=backup --exclude=big

sudo du -schx /home/bshanks/* --exclude=/proc --exclude=/sys

sudo du -schx /home/bshanks/.[^.]* --exclude=/proc --exclude=/sys

sort -h

while excluding some directories (for example, those that themselves contain backups):

sudo rsync -ax --exclude=/home/bshanks/aba --no-specials --no-devices --no-links /home/bshanks /media/bshanks/backup/bshanks

from EXT to VFAT:

sudo mkdir /media/Eposix
sudo mount.posixovl -S /media/E /media/Eposix/
sudo rsync -vv --modify-window=2 -axui --progress --timeout=240 --outbuf=N --exclude=/home/bshanks/aba --no-devices --no-specials --no-links /home/bshanks /media/Eposix/backup/

if you are backing up this way you lose information about the creator of the old files. You can do:

(find / -xdev -type f -exec ls -l {} \;) > /tmp/dirlist.txt

to at least have a list of these.

for f in *; do (sudo tar --create $f > $f.tar); done for f in *.tar; do xz $f; done

DVDs

You need to have some OFFLINE backups (e.g. to a DVD) in addition to backups to hard drives which are always connected (like external USB drives including Time Machine).

I'm going to try the 25GB M-Disc Blu-ray. Unlike pressed DVDs that you buy in stores, most consumer-writable DVDs (i assume this includes blu-rays) can decay over short periods of time (a few years). M-Disc is supposedly a consumer-writable DVD that won't decay so easily. It has both DVD and Blu-ray variants.

note: DO NOT WRITE OVER THE DATA STORAGE SPACE OF CDS OR DVDS, EVEN WITH A CD/DVD-SAFE MARKER. Write only in the non-data-storing centre

Checking hard drives for errors

Timing out

When you try to do something on the hard drive and it times out, often it's because there's a problem. Look at dmesg.

Looking thru dmesg

'dmesg' is a log from the Linux kernel. It only lasts until you reboot.

To check dmesg, do:

sudo dmesg | less

When i had errors, they looked like this:

[  150.775878] sd 5:0:0:0: [sdb] Unhandled sense code
[  150.775891] sd 5:0:0:0: [sdb]  
[  150.775896] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[  150.775901] sd 5:0:0:0: [sdb]  
[  150.775906] Sense Key : Medium Error [current] 
[  150.775914] sd 5:0:0:0: [sdb]  
[  150.775921] Add. Sense: Unrecovered read error
[  150.775926] sd 5:0:0:0: [sdb] CDB: 
[  150.775929] Read(10): (omitted)
[  150.775945] end_request: critical medium error, dev sdb, sector 2621445056

I think the crucial part is probably 'critical medium error'; so you may want to do just:

sudo dmesg

grep 'critical medium error'

To save the current dmesg to a file:

sudo dmesg > file_name.txt

Checking a partition with fsck

See the fsck section below, because fsck can also do some repairs.

Some seagate enclosures don't work with uas and hence SMART , so blacklist them

look in /var/log/syslog (or mb lsusb) to get the "ID", then do:

cat /sys/module/usb_storage/parameters/quirks

make sure the result is empty b/c the next command will overwrite it

echo "0x0bc2:0x2344:u" > /sys/module/usb_storage/parameters/quirks

(from https://www.smartmontools.org/wiki/SAT-with-UAS-Linux )

(note: after unplugging such a drive, before doing the quirks thingee, i still see it in lsusb, i couldn't easily find a way short of rebooting to make it disappear) (via https://forums.linuxmint.com/viewtopic.php?t=322252 )

to make it permanent,

sudo vi /etc/modprobe.d/disable_uas.conf

and add line like:

options usb-storage quirks=0bc2:2344:u

then rebuilding your initramdisk. On Pop OS:

sudo update-initramfs -u sudo kernelstub -k /boot/vmlinuz -i /boot/initrd.img

Checking a drive with SMART

Print all SMART information about a drive:

sudo smartctl -a /dev/sda | less

(for even more, do -xa instead of -a)

NOTE: If you get 'Unknown USB bridge" "Please specify device type with the -d option.", try '-d sat' and '-d scsi'.

NOTE: there is a line in this like "SMART overall-health self-assessment test result: PASSED"; i have seen "PASSED" even on a drive that seemed to me to be pretty obviously failing (a ton of bad blocks), so i would take PASSED with a big grain of salt, and look at the most important SMART attributes, too:

Print just the SMART attributes, highlighting the most important ones for spinning disks (see below) (thanks Benjamin Schweizer):

sudo smartctl -A /dev/sda | grep -E --color "^( *5| *10|184|187|188|197|198|232|233).*|"

Doing SMART tests (warning, on some drives sometimes this seems to reset some of the SMART attributes). Only one test can be run at a time. Tests can usually be run simultaneously with using the drive (although sometimes using the drive may abort the test?). The smartctl -a listing says how long each test will take.

sudo smartctl -t short /dev/sdb
sudo smartctl -t conveyance /dev/sdb
sudo smartctl -t offline /dev/sdb
sudo smartctl -t long /dev/sdb

Backblaze's suggestions about which SMART attributes to watch

https://www.backblaze.com/blog/hard-drive-smart-stats/ suggests paying attention to these 5 stats (the above command highlights them for you):

SMART 5 – Reallocated_Sector_Count. SMART 187 – Reported_Uncorrectable_Errors. SMART 188 – Command_Timeout. SMART 197 – Current_Pending_Sector_Count. SMART 198 – Offline_Uncorrectable

https://kb.acronis.com/ and some other sites explain what these mean:

SMART 5 – Reallocated_Sector_Count: "count of reallocated sectors (512 bytes). When the hard drive finds a read/write/verification error, it marks this sector as "reallocated" and transfers data to a special reserved area (spare area). This process is also known as remapping and "reallocated" sectors are called remaps. This is why, on a modern hard disks, you will not see "bad blocks" while testing the surface - all bad blocks are hidden in reallocated sectors. However, the more sectors that are reallocated, the more a sudden decrease (up to 10% and more) can be noticed in the disk read/write speed...This is a critical parameter. Degradation of this parameter may indicate imminent drive failure. Urgent data backup and hardware replacement is recommended."
SMART 187 – Reported_Uncorrectable_Errors: "number of errors that could not be recovered using hardware ECC (error-correcting code)...Although this parameter is not considered critical by the most hardware vendors, degradation of this parameter may indicate electromechanical problems of the disk. Regular backup is recommended. If no other (critical) parameters report a problem, hardware replacement is recommended on mission critical systems only."
SMART 188 – Command_Timeout: "number of aborted operations due to hard disk timeout...This is a critical parameter. Degradation of this parameter may indicate serious problems with power supply or an oxidized data cable. Urgent data backup and hardware replacement is recommended."
SMART 197 – Current_Pending_Sector_Count: "a critical parameter and indicates the current count of unstable sectors (waiting for remapping). The raw value of this attribute indicates the total number of sectors waiting for remapping. Later, when some of these sectors are read successfully, the value is decreased. If errors still occur when reading some sector, the hard drive will try to restore the data, transfer it to the reserved disk area (spare area) and mark this sector as remapped... This is a critical parameter. Degradation of this parameter may indicate imminent drive failure. Urgent data backup and hardware replacement is recommended." [1]
SMART 198 – Offline_Uncorrectable: "the number of sectors that the drive has attempted to correct itself, but failed. Running the offline self-test should cause the drive to test the sectors and attempt to fix them. Not all drives support this though." -- [2]

in summary:

5 – Reallocated_Sector_Count should be at 0 (raw) and between 94 and 130 (normalized). When the raw value gets above zero, replace the drive, and when it gets more than 16 (raw) or less than 94 (normalized) (or more than 130 normalized!), it's real bad (by 'real bad' i mean annual failure rate over 20%).
187 – Reported_Uncorrectable_Errors should be at 0 (raw) and between 91 and 104 (normalized). When the raw value gets above zero, replace the drive, it's bad.
197 – Current_Pending_Sector_Count should be at 0 (raw) (the normalized value is not useful). When the raw value gets above zero, it's real bad.
198 – Offline_Uncorrectable should be at 0 (raw) (the normalized value is not useful). When the raw value gets above zero, it's real bad.

they also mention this one, but it's harder to interpret imo, i just see '1 1 1' for all of my drives which report it at all:

188 – Command_Timeout should be below 13G (13000000000) (raw) (the normalized value is not useful) (if zero, even better). When the raw value gets above 13G, it's real bad.

that being said, i have 4 hard drives, only one of which appears to be failing, but ALL of them have at least one non-zero value in one or more of the four attributes listed above that should be 0 (5, 187, 197, 198).

for charts involving other SMART attributes, see https://www.backblaze.com/blog-smart-stats-2014-8.html , and for a walkthrough of these sorts of charts, see https://www.backblaze.com/blog/hard-drive-smart-stats/

I don't know which attributes are most important for SSDs. Anyone know of an analysis like this for SSDs? My best guess is that raw "5 Reallocated_Sector_Ct" and normalized "233 Media_Wearout_Indicator" are the ones to watch. http://serverfault.com/a/283272/86461 also mentions 184 End-to-End_Error and 232 Available_Reservd_Space.

defns of these:

SMART 184 – End-to-End_Error: "after transferring through the cache RAM data buffer, the parity data between the host and the hard drive did not match...This is a critical parameter. Degradation of this parameter may indicate imminent drive failure. Urgent data backup and hardware replacement is recommended." [3]
SMART 232 – Available_Reservd_Space: "number of reserve blocks remaining" [4] (normalized only, at least on my drive)
SMART 233 – 233 Media_Wearout_Indicator: "number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles." [5]

In addition to the ones mentioned by Backblaze, https://kb.acronis.com/content/9264 thinks the following are 'critical': 10 - Spin Retry Count, 184 - End-to-End Error, 196 - Reallocation Event Count (and 201 - Soft Read Error Rate, but my drives dont report that and it's not in Backblaze's dataset). Backblaze did not observe nonzero Spin Retry Counts, did not observe non-zero raw End-to-End errors and found an unclear connection between normalized End-to-End errors and failure, found that Reallocation Event Count has a relatively moderate correlation to failure, and doesn't include 232 or 233 in their dataset (presumably these are SSD only).

more informations about the various attributes are at: https://en.wikipedia.org/wiki/S.M.A.R.T.#Known_ATA_S.M.A.R.T._attributes

If errors are found

if a drive has media errors, use ddrescue on it, don't wear it further by trying to repair it:

https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html

If it's just filesystem errors (eg not hardware errors), then you can try to use fsck to repair it:

Repairing with fsck

As noted above, if the drive is failing, don't repair with fsck! The more you use a failing drive, the worse it gets. Use ddrescue to copy the partition to another drive, and then repair the copy.

On ext2/ext3/ext4:

sudo e2fsck -f -y -v /dev/sda1

(the -f option just means to go ahead and fsck even if the drive is marked 'clean'; the -y option means to try and repair things without asking further; the -v is just for verbose)

(of course, replace /dev/sda1 with the path to the partition you want to check)

You might want to ask fsck to scan for bad blocks, and to tell the filesystem where they are if it finds any:

sudo e2fsck -f -kc -y -v /dev/sda1

(this does only reads while looking for bad blocks; the 'k' means to also remember any bad blocks that were previously in the case. If you want to do 'non-destructive' writes, use -cc instead of -c; i think that 'non-destructive' writes may destroy if bad blocks are found, though [6])

warning, this takes forever! You can quit in the middle with cntl-C, though. I'm not sure if this is worth the extra wear it puts on the drive ( http://superuser.com/a/525868/33599 ).

Compression

use 'lzip' for long-term archiving

more notes on smartctl:

if a drive isn't recognized, try -d sat or -d scsi

if -d scsi works but -t short gives 'Short offline self test failed [unsupported field in scsi command]', then try disabling uas. First, to get the vendor and product ID, use -d scsi and write down the serial number. Then, use 'usb-devices' and find the device with that serial number, and write down the listed vendor and product ID. Now, umount all usb storage drives, then do:

sudo modprobe -r uas
sudo modprobe -r usb-storage

sudo modprobe usb-storage quirks=0bc2:3312:u

but where 0bc2 and 3312 are replaced by your vendor and product id.

(thanks [7])

---

sudo diff -qr --no-dereference FILEPATH1 FILEPATH2

grep -v 'is a socket'

grep -v 'is a fifo'

grep -v 'Only in .*: backup'

---

to create a (potentially future bootable), drive for backup with an encrypted main partition (WORK IN PROGRESS)

GParted

Create:

a 2048MiB? (2 GiB?) FAT32 partition (for /boot/efi)
a 65536MiB? (64 GiB?) NTFS partition (for sharing files with Windows)
a 5000MiB? (or, >4.4 GiB?) FAT32 partition (for a pop_os recovery partition)
an ext4 partition with most of the rest of the free space
leave a bit unallocated at the end (maybe about 5%)

Click the checkbox and apply. Take note of the "/dev" name of the big partition. Exit Gparted.

To encrypt your new partition and choose your initial encryption password, enter the following commands, but if your big partition "/dev" name is other than /dev/sdc4, replace that with your "/dev" name. And replace "cryptdata1" and "vgrp1" with a unique name:s s sudo -i cryptsetup luksFormat /dev/sdc4 cryptsetup open /dev/sdc4 cryptdata1 pvcreate /dev/mapper/cryptdata1 vgcreate vgrp1 /dev/mapper/cryptdata1 lvcreate -L 32G vgrp1 -n swap lvcreate -l 100%FREE vgrp1 -n root mkfs.ext4 /dev/vgrp1/root mkswap /dev/vgrp1/swap

to mount on future boots, (assuming that on future boots the OS is identifying this drive as /dev/sdb instead of /dev/sdc), you do:

mkdir /media/t2 # only have to do this the first time

sudo cryptsetup open /dev/sdb4 cryptdata1 sudo cryptsetup open /dev/sdc4 cryptdata1 sudo mount /dev/vgrp1/root /media/t2
to unmount: sudo umount /media/t2 sudo vgchange -a n vgrp1 sudo cryptsetup close cryptdata1

to restore from backup (todo process these notes)

install pop_os onto replacement drive ("clean install") using an updated official installer on a flash drive

install OS updates using the pop shop app. Restart.

Alt-f2; gnone-terminal; sudo apt install gparted; sudo gparted # figure out the device holding the encrypted volume with your home folder backup, for me this was /dev/sda3; sudo mkdir /media/b0; sudo cryptsetup open /dev/sda3 cryptdata1; (You may need to do sudo vgs; sudo vgrename ; sudo vgscan; sudo lvscan; sudo lvchange -ay /dev/datab0/root in case the volume group on the volume that we're calling b0 has the same name as the root install, that is, data; You can still boot off the other drive that way, You don't have to do anything to get it to recognize the name change) Sudo mount /dev/datab0/root /media/b0; sudo cp -ra /media/b0/home/bshanks /home/bshanks-real. Now reboot onto the backup drive, then sudo mkdir /media/a0; sudo cryptsetup open /dev/nvme0n1p3; sudo mount /dev/data/root /media/a0; sudo mv /media/a0/home/bshanks /media/a0/home/bshanks-empty; sudo mv /media/a0/home/bshanks-real /media/a0/home/bshanks; now restart into the new drive again

sudo apt install fasd emacs xterm neomutt wmctrl apt-file python3 ipython3 notmuch python3-notmuch python3-googleapi python3-oauth2client python3-tqdm plocate python3-pip gimp imagemagick urlview postfix qiv python-is-python3 pwgen vlc hugo zathura default-jre screen tmux gparted ncdu cp -ra ~/.local/lib/python3.9/site-packages/lieer NEW; resync email (cd ~/Mail/bsgmail; gmi sync; cd ~/Mail/bscovgmail; gmi sync;); cp -ra old /etc/postfix dir

accept the defaults for the configuration options for installing post fix. mount /media/b0 as before, then sudo mv /etc/postfix /etc/postfix-empty; sudo cp -ra /media/b0/etc/postfix /etc/

sudo apt install libpango1.0-0 # libpango1.0-0 needed for dropbox sudo dpkg -i dropbox_2020.03.04_amd64.deb

note: use that, or download a newer version from the dropbox website, rather than installing nautilus-dropbox

mv Dropbox Dropbox-old. Now install Dropbox as above (do the above steps only after doing this mv) and then when the web page pops up asking you if you want to connect dropbox, do.

upon upgrading, maybe something like: pip install google-api-python-client notmuch oauth2client tqdm cp -ar ~/.local/lib/python3.9/site-packages/lieer ~/.local/lib/python3.10/site-packages/

Reboot

test if email sending works

---