Skip to content

Instantly share code, notes, and snippets.

@tommybutler
Last active June 3, 2024 09:02
Show Gist options
  • Save tommybutler/7592005 to your computer and use it in GitHub Desktop.
Save tommybutler/7592005 to your computer and use it in GitHub Desktop.
Script to quickly scan the S.M.A.R.T. health status of all your hard drive devices in Linux (at least all the ones from /dev/sda to /dev/sdzz). You need smartctl installed on your system for this script to work, and your hard drives need to have S.M.A.R.T. capabilities (they probably do).
#!/bin/bash
# install the smartctl package first! (apt-get install smartctl)
if sudo true
then
true
else
echo 'Root privileges required'
exit 1
fi
for drive in /dev/sd[a-z] /dev/sd[a-z][a-z]
do
if [[ ! -e $drive ]]; then continue ; fi
echo -n "$drive "
smart=$(
sudo smartctl -H $drive 2>/dev/null |
grep '^SMART overall' |
awk '{ print $6 }'
)
[[ "$smart" == "" ]] && smart='unavailable'
echo "$smart"
done
@Jolly-Pirate
Copy link

Modify line 14 to include nvme:
for drive in /dev/sd[a-z] /dev/sd[a-z][a-z] /dev/nvme[0-9]n[0-9]

@tommybutler
Copy link
Author

tommybutler commented Jan 26, 2023 via email

@BloodBlight
Copy link

If anyone is interested, I ended up taking this a bit further. It lists all disks, supports outputting to JSON (I have a crontab job that sends the results to node-red and does alerting), but also supports showing the remaining life of a lot of SSDs (mostly enterprise):
https://github.com/BloodBlight/CephNotes/blob/main/SmartHealth

There is also a ListDisks script that shows a lot of details in one quick script:
https://github.com/BloodBlight/CephNotes/blob/main/ListDisks

@rickygm
Copy link

rickygm commented Jan 30, 2023

thank BloodBlight

@eightseventhreethree
Copy link

Why not just use lsblk

for disk in $(lsblk --json | jq -r '.blockdevices[].name'); do smartctl --all /dev/${disk}; don

@tommybutler
Copy link
Author

tommybutler commented Nov 4, 2023 via email

@laurus-lx
Copy link

FYI, smartctl -H might not catch a failing disk:

Just tried it on HGST disk that reported Ok status

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

but running extended report with smartctl -a can see that it's failing:

Elements in grown defect list: 20

@BloodBlight
Copy link

This script only returns (or should) what ever smartctl for the final status. I checked one of my HGST disks in Scrutiny and it shows that as being a "low" indicator of a disk failure (3%).

Link:
https://imgur.com/a/rWQbCAb

Also, Scrutiny (https://github.com/AnalogJ/scrutiny) is amazing and I am moving away from this script to it. Though I keep this around for quick checks on boxes that aren't part of my Scrutiny... "system"..

Do you want to post the larger output? We can at least confirm that smartctl thinks it is good. If you have any "pending" remaps, writing zeros to a file until the disk fills can help clear that.

@laurus-lx
Copy link

Thanks for the heads up!

That scrutiny utility looks awesome. Will try it out. Learned something new today.

I just scanned 28 drives I had in JBOD array, and saw 4 that had elements in the grown defect list (41, 11, 11, 20).
One drive with 11 elements had in the grown defect list, also had 3 uncorrected errors. Another drive with 20 elements in the grown defect list, had 1 uncorrected error.

Here's a report for the drive with 3 errors. Will be keeping an eye on it

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724030ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      xxxxxxxxxx
Serial number:        xxxxxxxxxx
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed May 22 21:33:06 2024 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 48860:30
Manufactured in week 08 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  56
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2089
Elements in grown defect list: 11

Vendor (Seagate Cache) information
  Blocks sent to initiator = 6241260582993920

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    6403595     2448         0   6406043    6203534      64830.642           3
write:         0        0         0         0     155897      65274.632           0
verify:        0        0         0         0      27257          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   44297                 - [-   -    -]

Long (extended) Self-test duration: 29637 seconds [8.2 hours]

@BloodBlight
Copy link

Ya, the uncorrected errors would defiantly land that disk on my "SUS" list! Even the volume fast ECC corrections would get it a "warn" from me. Sometimes uncorrected errors just happens though, and that isn't a big number. But I can totally see wanting that out of your array!

If it is a performance sensitive production environment, I would 100% yank that drive just because. Those retries can cause odd performance issues for customers that are almost impossible to pinpoint as the cause.

I do have some HGSTs that are 8 years old now and still going strong, some with almost as many fast ECC errors, but others that I have evicted just because some of those metrics were increasing at an unhealthy rate. I have the luxury of having 4 parity disks though. That plus regular scrubbing I might keep that one in my cluster unless it got worse. If I didn't have that, I would not trust it. But I am a cheapskate when it comes to my home lab!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment