tommybutler/smartcheck.sh

BloodBlight · 2023-01-26T21:50:13Z

If anyone is interested, I ended up taking this a bit further. It lists all disks, supports outputting to JSON (I have a crontab job that sends the results to node-red and does alerting), but also supports showing the remaining life of a lot of SSDs (mostly enterprise):
https://github.com/BloodBlight/CephNotes/blob/main/SmartHealth

There is also a ListDisks script that shows a lot of details in one quick script:
https://github.com/BloodBlight/CephNotes/blob/main/ListDisks

rickygm · 2023-01-30T17:41:47Z

thank BloodBlight

eightseventhreethree · 2023-11-04T16:27:20Z

Why not just use lsblk

for disk in $(lsblk --json | jq -r '.blockdevices[].name'); do smartctl --all /dev/${disk}; don

tommybutler · 2023-11-04T16:37:13Z

That option didn't exist at the time I wrote this code. Maybe it's time for an update? Send a pull request, my friend 😊

…

On Sat, Nov 4, 2023, 10:27 AM Rush ***@***.***> wrote: ***@***.**** commented on this gist. ------------------------------ Why not just use lsblk for disk in $(lsblk --json | jq -r '.blockdevices[].name'); do smartctl --all /dev/${disk}; don — Reply to this email directly, view it on GitHub <https://gist.github.com/tommybutler/7592005#gistcomment-4749529> or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYD7POZWYPYPMQ4OMYPOGTYCZUHHBFKMF2HI4TJMJ2XIZLTSKBKK5TBNR2WLJDHNFZXJJDOMFWWLK3UNBZGKYLEL52HS4DFQKSXMYLMOVS2I5DSOVS2I3TBNVS3W5DIOJSWCZC7OBQXE5DJMNUXAYLOORPWCY3UNF3GS5DZVRZXKYTKMVRXIX3UPFYGLK2HNFZXIQ3PNVWWK3TUUZ2G64DJMNZZDAVEOR4XAZNEM5UXG5FFOZQWY5LFU43TKOJSGAYDLJ3UOJUWOZ3FOKTGG4TFMF2GK> . You are receiving this email because you authored the thread. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub> .

laurus-lx · 2024-05-23T01:16:53Z

FYI, smartctl -H might not catch a failing disk:

Just tried it on HGST disk that reported Ok status

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

but running extended report with smartctl -a can see that it's failing:

Elements in grown defect list: 20

BloodBlight · 2024-05-23T01:26:47Z

This script only returns (or should) what ever smartctl for the final status. I checked one of my HGST disks in Scrutiny and it shows that as being a "low" indicator of a disk failure (3%).

Link:
https://imgur.com/a/rWQbCAb

Also, Scrutiny (https://github.com/AnalogJ/scrutiny) is amazing and I am moving away from this script to it. Though I keep this around for quick checks on boxes that aren't part of my Scrutiny... "system"..

Do you want to post the larger output? We can at least confirm that smartctl thinks it is good. If you have any "pending" remaps, writing zeros to a file until the disk fills can help clear that.

laurus-lx · 2024-05-23T01:44:18Z

Thanks for the heads up!

That scrutiny utility looks awesome. Will try it out. Learned something new today.

I just scanned 28 drives I had in JBOD array, and saw 4 that had elements in the grown defect list (41, 11, 11, 20).
One drive with 11 elements had in the grown defect list, also had 3 uncorrected errors. Another drive with 20 elements in the grown defect list, had 1 uncorrected error.

Here's a report for the drive with 3 errors. Will be keeping an eye on it

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.5.13-5-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS724030ALS640
Revision:             A1C4
Compliance:           SPC-4
User Capacity:        3,000,592,982,016 bytes [3.00 TB]
Logical block size:   512 bytes
LU is resource provisioned, LBPRZ=0
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      xxxxxxxxxx
Serial number:        xxxxxxxxxx
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Wed May 22 21:33:06 2024 EDT
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     29 C
Drive Trip Temperature:        85 C

Accumulated power on time, hours:minutes 48860:30
Manufactured in week 08 of year 2014
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  56
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2089
Elements in grown defect list: 11

Vendor (Seagate Cache) information
  Blocks sent to initiator = 6241260582993920

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    6403595     2448         0   6406043    6203534      64830.642           3
write:         0        0         0         0     155897      65274.632           0
verify:        0        0         0         0      27257          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                   -   44297                 - [-   -    -]

Long (extended) Self-test duration: 29637 seconds [8.2 hours]

BloodBlight · 2024-05-23T20:42:13Z

Ya, the uncorrected errors would defiantly land that disk on my "SUS" list! Even the volume fast ECC corrections would get it a "warn" from me. Sometimes uncorrected errors just happens though, and that isn't a big number. But I can totally see wanting that out of your array!

If it is a performance sensitive production environment, I would 100% yank that drive just because. Those retries can cause odd performance issues for customers that are almost impossible to pinpoint as the cause.

I do have some HGSTs that are 8 years old now and still going strong, some with almost as many fast ECC errors, but others that I have evicted just because some of those metrics were increasing at an unhealthy rate. I have the luxury of having 4 parity disks though. That plus regular scrubbing I might keep that one in my cluster unless it got worse. If I didn't have that, I would not trust it. But I am a cheapskate when it comes to my home lab!

Fl1pp3d0ff · 2026-01-14T12:43:42Z

This will allow for both SAS and SATA drives:

for drive in /dev/sd[a-z] /dev/sd[a-z][a-z]
do
if [[ ! -e $drive ]]; then continue ; fi

echo -n "$drive "
smart=$(
smartctl -H $drive 2>/dev/null |
grep '^SMART' |
awk '{print $NF}'
)
[[ "$smart" == "" ]] && smart='unavailable'
echo "$smart"
done

tommybutler/smartcheck.sh

Select an option

No results found

Select an option

No results found

BloodBlight commented Jan 26, 2023

Uh oh!

rickygm commented Jan 30, 2023

Uh oh!

eightseventhreethree commented Nov 4, 2023

Uh oh!

tommybutler commented Nov 4, 2023 via email

Uh oh!

laurus-lx commented May 23, 2024

Uh oh!

BloodBlight commented May 23, 2024

Uh oh!

laurus-lx commented May 23, 2024

Uh oh!

BloodBlight commented May 23, 2024

Uh oh!

Fl1pp3d0ff commented Jan 14, 2026 •

edited

Loading

Uh oh!

	#!/bin/bash

	# install the smartctl package first! (apt-get install smartctl)

	if sudo true
	then
	true
	else
	echo 'Root privileges required'

	exit 1
	fi

	for drive in /dev/sd[a-z] /dev/sd[a-z][a-z]
	do
	if [[ ! -e $drive ]]; then continue ; fi

	echo -n "$drive "

	smart=$(
	sudo smartctl -H $drive 2>/dev/null \|

	grep '^SMART overall' \|

	awk '{ print $6 }'
	)

	[[ "$smart" == "" ]] && smart='unavailable'

	echo "$smart"

	done

tommybutler/smartcheck.sh

BloodBlight commented Jan 26, 2023

Uh oh!

rickygm commented Jan 30, 2023

Uh oh!

eightseventhreethree commented Nov 4, 2023

Uh oh!

tommybutler commented Nov 4, 2023 via email

Uh oh!

laurus-lx commented May 23, 2024

Uh oh!

BloodBlight commented May 23, 2024

Uh oh!

laurus-lx commented May 23, 2024

Uh oh!

BloodBlight commented May 23, 2024

Uh oh!

Fl1pp3d0ff commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fl1pp3d0ff commented Jan 14, 2026 •

edited

Loading