Hard Drive Test in Linux

Posted: Sep 29, 2015 by Bryan Tong

Ever wanted to see how well your hard drive actually performs? Do you burn in your hard drives before putting servers / computers into production? Here are some tips and some quick commands to help ensure your disks are operating properly.

Using dd

Read testing

If you have a small and quick drive. I also do a burn-in especially on the OS drive before releasing a server into production. Initially I just want to read all the way across the disk to make sure if there are any bad blocks we blow up now and not later.

$ dd if=/dev/sda of=/dev/null bs=1M

This is assuming your drive is /dev/sda if you are not sure where your disk is. I try doing fdisk -l or ls -l /dev/sd*

This command will safely read the entire contents of the drive (including partition tables) into null or nothingness. If this command completes without error we can have a reasonably good idea that the drive is working well.

Here is the result on a 32GB Samsung mSATA drive.

[email protected]:~# dd if=/dev/sda of=/dev/null bs=1M
30533+1 records in
30533+1 records out
32017047552 bytes (32 GB) copied, 84.5127 s, 379 MB/s

A cool thing about this testing as it will also give you a fair benchmark on the read speed. (However, if you are benchmarking I would recommend add oflag=direct to skip the Linux VFS and its caching.

Write Testing

We have to be a lot more careful with write testing as there are a lot of factors going into making sure the disk is actually being written to.

Depending on file size limitations I use a combination of tools to test writing.

$ cd /root
$ dd if=/dev/zero of=testfile bs=1M count=10k oflag=direct

Here I am writing from /dev/zero which will produce only zeros and writing them to /root/testfile (this path needs to be somewhere on the disk you want to test) and I am going to write a 1 Megabyte block size (BS) and I am going to write 10,000 (10k) blocks. Thus, I am writing 10GB. Finally I add oflag=direct to tell dd to bypass the Linux VFS (and caching) and write directly to the block device. Make sure that there is enough free space to write the file before continuing and make sure that the path you are using to write the file is on the disk you are wanting to test.

Here is the result on my same mSATA disk.

[email protected]:~# dd if=/dev/zero of=testfile bs=1M count=2k oflag=direct
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 24.5838 s, 87.4 MB/s

Just as a note this gives you an idea of the write performance of the disk as well.

More Write Testing with Bonnie++

On Debian I install Bonnie++ first.

$ apt-get -y install bonnie++

Next I run a basic test.

cd /root
mkdir bonnie
bonnie++ -d /root/bonnie -u root

If you are running as root then you must add the -u root or Bonnie++ will complain. If you are another user this flag can be omitted.

Bonnie++ will read and write from the disk with intelligent patterns and again if there are any problems they should start to appear again now and not when the server is in production.

Checking Disk Health with DMESG

Now that we have done some testing on both ends. It is time to see if the disk we have been testing has any errors. The first place I check is dmesg.

$ dmesg | tail

If you see anything like this.

end_request: I/O error, dev sda, sector 63

This will tell you that the disk has a bad sector.

Or if you see something like this.

[  681.472852] ata1.00: failed command: READ DMA EXT
[  681.472856] ata1.00: cmd 25/00:00:f8:eb:bd/00:01:1d:00:00/e0 tag 0 dma 131072 in
[  681.472856]          res 51/84:b0:48:ec:bd/84:00:1d:00:00/e0 Emask 0x70 (host bus error)
 [  681.472858] ata1.00: status: { DRDY ERR }
[  681.472859] ata1.00: error: { ICRC ABRT }
[  681.472866] ata1.00: hard resetting link
[  681.791147] ata1.01: hard resetting link
 [  682.818130] ata1.01: failed to resume link (SControl 0)
[  682.974052] ata1.00: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[  682.974067] ata1.01: SATA link down (SStatus 0 SControl 0)
 [  682.998511] ata1.00: configured for UDMA/33
[  682.998861] ata1: EH complete
[  683.215898] ata1.00: exception Emask 0x50 SAct 0x0 SErr 0x280900 action 0x6
[  683.215901] ata1.00: BMDMA stat 0x26
 [  683.215904] ata1.00: SError: { UnrecovData HostInt 10B8B BadCRC }
[  683.215906] ata1.00: failed command: READ DMA EXT
[  683.215909] ata1.00: cmd 25/00:88:a0:16:cb/00:00:1d:00:00/e0 tag 0 dma 69632 in
 [  683.215909]          res 51/84:38:f0:16:cb/84:00:1d:00:00/e0 Emask 0x70 (host bus error)
[  683.215911] ata1.00: status: { DRDY ERR }
[  683.215912] ata1.00: error: { ICRC ABRT }
[  683.215918] ata1.00: hard resetting link

It is telling you it is having trouble communicating with the drive. In my case with my mSATA drives I see these when the adapter card isnt seated properly on the Motherboard. However, these problems can also occur with faulty drives. Either way, I dont recommend putting servers into production if any of these errors are present.

If you have a very long dmesg I recommend using grep to filter for errors.

$ dmesg | grep -i ata
$ dmesg | grep -i sector

In my case the server I was testing on did have a problem which can be seen in these messages

[ 1676.371513] ata1.00: exception Emask 0x10 SAct 0x783fffff SErr 0x400000 action 0x6 frozen
[ 1676.371597] ata1.00: irq_stat 0x08000000, interface fatal error
[ 1676.371666] ata1: SError: { Handshk }
[ 1676.371727] ata1.00: failed command: WRITE FPDMA QUEUED
[ 1676.371797] ata1.00: cmd 61/00:00:00:30:35/04:00:01:00:00/40 tag 0 ncq 524288 out
[ 1676.371798]          res 40/00:a8:00:c4:35/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
[ 1676.371946] ata1.00: status: { DRDY }
[ 1676.372007] ata1.00: failed command: WRITE FPDMA QUEUED
[ 1676.372077] ata1.00: cmd 61/00:08:00:34:35/04:00:01:00:00/40 tag 1 ncq 524288 out
[ 1676.372078]          res 40/00:a8:00:c4:35/00:00:01:00:00/40 Emask 0x10 (ATA bus error)
[ 1676.372329] ata1.00: status: { DRDY }

Checking Disk Health with SMART

Another very useful tool for checking if your disk is operating efficiently is the SMART tools that come with most all distributions of Linux.

On Debian they need to be installed.

$ apt-get -y install smartmontools

Once we are done with that we can check our disk out.

$ smartctl -a /dev/sda

I will post my output and analyze some variables.

[email protected]:~# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Device Model:     SAMSUNG MZMPC032HBCD-000H1
Serial Number:    S0Y6NSAC584XXX
LU WWN Device Id: 5 002538 043584XXX
Firmware Version: CXM12H1Q
User Capacity:    32,017,047,552 bytes [32.0 GB]
Sector Size:      512 bytes logical/physical
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ACS-2 revision 2
Local Time is:    Tue Sep 29 12:54:47 2015 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  180) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (   3) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   200   200   002    Pre-fail  Always       -       0
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1132
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       798
170 Unknown_Attribute       0x0013   086   086   010    Pre-fail  Always       -       352
171 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
173 Unknown_Attribute       0x0013   096   096   017    Pre-fail  Always       -       134
174 Unknown_Attribute       0x0032   099   099   000    Old_age   Always       -       14
183 Runtime_Bad_Block       0x0032   099   099   001    Old_age   Always       -       2
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   033   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0002   253   253   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       4

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short captive       Completed without error       00%       666         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The first thing I check is Raw_Read_Error_Rate and make sure this is 0

Next I check Reallocated_Sector_Ct and also make sure it is 0

I also check Runtime_Bad_Block and make sure it is 0 (in this case it wasnt as this server is having link errors unrelated to the drive.)

Offline_Uncorrectable should also be 0

UDMA_CRC_Error_Count should be 0 as well.

Most of the other attributes can be ignored when checking if your drive is truly malfunctioning. In fact, some of these counters can be above 0 on a healthy drive, however if the drive is new which can be gauged by looking at Power_On_Hours and any of the counters I mentioned are increasing there is reason for concern.

Well that is it! I have went over several methods to check and confirm drive health on servers and desktops before depending on the storage mediums in the machine.