-
-
Save tommybutler/7592005 to your computer and use it in GitHub Desktop.
#!/bin/bash | |
# install the smartctl package first! (apt-get install smartctl) | |
if sudo true | |
then | |
true | |
else | |
echo 'Root privileges required' | |
exit 1 | |
fi | |
for drive in /dev/sd[a-z] /dev/sd[a-z][a-z] | |
do | |
if [[ ! -e $drive ]]; then continue ; fi | |
echo -n "$drive " | |
smart=$( | |
sudo smartctl -H $drive 2>/dev/null | | |
grep '^SMART overall' | | |
awk '{ print $6 }' | |
) | |
[[ "$smart" == "" ]] && smart='unavailable' | |
echo "$smart" | |
done |
thank BloodBlight
Why not just use lsblk
for disk in $(lsblk --json | jq -r '.blockdevices[].name'); do smartctl --all /dev/${disk}; don
FYI, smartctl -H
might not catch a failing disk:
Just tried it on HGST disk that reported Ok status
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
but running extended report with smartctl -a
can see that it's failing:
Elements in grown defect list: 20
This script only returns (or should) what ever smartctl for the final status. I checked one of my HGST disks in Scrutiny and it shows that as being a "low" indicator of a disk failure (3%).
Link:
https://imgur.com/a/rWQbCAb
Also, Scrutiny (https://github.com/AnalogJ/scrutiny) is amazing and I am moving away from this script to it. Though I keep this around for quick checks on boxes that aren't part of my Scrutiny... "system"..
Do you want to post the larger output? We can at least confirm that smartctl thinks it is good. If you have any "pending" remaps, writing zeros to a file until the disk fills can help clear that.
Thanks for the heads up!
That scrutiny utility looks awesome. Will try it out. Learned something new today.
I just scanned 28 drives I had in JBOD array, and saw 4 that had elements in the grown defect list (41, 11, 11, 20).
One drive with 11 elements had in the grown defect list, also had 3 uncorrected errors. Another drive with 20 elements in the grown defect list, had 1 uncorrected error.
Here's a report for the drive with 3 errors. Will be keeping an eye on it
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: HGST
Product: HUS724030ALS640
Revision: A1C4
Compliance: SPC-4
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Logical block size: 512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: xxxxxxxxxx
Serial number: xxxxxxxxxx
Device type: disk
Transport protocol: SAS (SPL-4)
Local Time is: Wed May 22 21:33:06 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 29 C
Drive Trip Temperature: 85 C
Accumulated power on time, hours:minutes 48860:30
Manufactured in week 08 of year 2014
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 56
Specified load-unload count over device lifetime: 600000
Accumulated load-unload cycles: 2089
Elements in grown defect list: 11
Vendor (Seagate Cache) information
Blocks sent to initiator = 6241260582993920
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 6403595 2448 0 6406043 6203534 64830.642 3
write: 0 0 0 0 155897 65274.632 0
verify: 0 0 0 0 27257 0.000 0
Non-medium error count: 0
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 44297 - [- - -]
Long (extended) Self-test duration: 29637 seconds [8.2 hours]
Ya, the uncorrected errors would defiantly land that disk on my "SUS" list! Even the volume fast ECC corrections would get it a "warn" from me. Sometimes uncorrected errors just happens though, and that isn't a big number. But I can totally see wanting that out of your array!
If it is a performance sensitive production environment, I would 100% yank that drive just because. Those retries can cause odd performance issues for customers that are almost impossible to pinpoint as the cause.
I do have some HGSTs that are 8 years old now and still going strong, some with almost as many fast ECC errors, but others that I have evicted just because some of those metrics were increasing at an unhealthy rate. I have the luxury of having 4 parity disks though. That plus regular scrubbing I might keep that one in my cluster unless it got worse. If I didn't have that, I would not trust it. But I am a cheapskate when it comes to my home lab!
If anyone is interested, I ended up taking this a bit further. It lists all disks, supports outputting to JSON (I have a crontab job that sends the results to node-red and does alerting), but also supports showing the remaining life of a lot of SSDs (mostly enterprise):
https://github.com/BloodBlight/CephNotes/blob/main/SmartHealth
There is also a ListDisks script that shows a lot of details in one quick script:
https://github.com/BloodBlight/CephNotes/blob/main/ListDisks