欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  科技

storcli64和smartctl定位硬盘的故障信息

程序员文章站 2022-04-19 13:18:07
定位硬盘盘位和盘符的方法 From Lin.Wang [TOC] Section One : Introduction strocli是megacli的升级版本,针对于戴尔服务器是perccli,用法完全一致 smartctl可以查看磁盘的主控芯片smart信息 lsscsi可以查看系统的scsi信 ......
定位硬盘盘位和盘符的方法

from lin.wang

section one : introduction

strocli是megacli的升级版本,针对于戴尔服务器是perccli,用法完全一致

smartctl可以查看磁盘的主控芯片smart信息

lsscsi可以查看系统的scsi信息,数据来源/proc/scsi/scsi相关,该文档此处暂不介绍

这些工具都是查看磁盘相关信息的常用工具,对于排查磁盘状态和raid卡问题都有帮助

section two : install package

安装一下storcli或者perccli,并且将命令软连接到/usr/bin/目录下,方便使用命令:

ln -s /opt/megaraid/storcli/storcli64 /usr/bin/

ln -s /opt/megaraid/perccli/percclie64 /usr/bin/

section three : step

由系统磁盘盘符/dev/sdf定位对应的硬盘盘位思路如下:

  1. perccli64 /c0/eall/sall show 看到该磁盘有

    storcli64和smartctl定位硬盘的故障信息

    img-/c0/eall/sall

    从该图看到有四个jbod分区,根据经验一般人为jbod的分区系统盘符会在raid分区之前,也就是说jbod的分区会从/dev/sda > /dev/sdd,raid的分区从/dev/sde开始;

    dg代表drive group,是配置raid建分组的顺序,有图上看到32:4和32:5是一个卷组。

  2. perccli64 /c0/vall show看到该磁盘的dg与vd的对应关系如下

    storcli64和smartctl定位硬盘的故障信息

img-/c0/vall

​ 由图上看到dg/vd就是raid的卷组和系统里卷组的顺序对应关系,一般如果服务器只有raid卷组来说的话,vd0就是操作系统里的/dev/sda,以此类推;但是如果服务器包括了jbod卷组,则raid的卷组从jbod后开始排序,本例中也就是vd0=/dev/sde,则要定位/dev/sdf的话vd=1,对应dg=1;

​ 回到img-/c0/eall/sall上,dg为1时,did=6,did就是device id,这个概念后边有用;同时slot no.也就是slt = 6对应的服务器上盘位就是第7个(从0开始到6),此时即定位到了/dev/sdf的物理盘位。

反之从服务器上看到硬盘故障灯,可以反推对应的系统分区盘符

note:

​ 如果服务器没有jbod卷组,全是raid的,则此时/c0/vall找到对应关系即可定位关联关系

​ 实际操作时还可以通过 perccli64 /c0/e32/s6 start/stop locate点亮关闭磁盘灯,来判断定位是否正确

section four : storcli/perccli usage

查看控制器的信息

perccli64 show ctrlcount 查看有几个控制器即几个raid卡

perccli64 show 显示raid卡信息

[root@node-15 ~]# perccli64 show
status code = 0
status = success
description = none

number of controllers = 1
host name = node-15.domain.tld
operating system  = linux3.10.0-327.20.1.es2.el7.x86_64

system overview :
===============

------------------------------------------------------------------------
ctl model        ports pds dgs dnopt vds vnopt bbu spr ds ehs asos hlth 
------------------------------------------------------------------------
  0 perch730mini     8  16  11     0  11     0 opt on  3  n      0 opt  
------------------------------------------------------------------------

ctl=controller index|dgs=drive groups|vds=virtual drives|fld=failed
pds=physical drives|dnopt=dg notoptimal|vnopt=vd notoptimal|opt=optimal
msng=missing|dgd=degraded|ndatn=need attention|unkwn=unknown
spr=scheduled patrol read|ds=dimmerswitch|ehs=emergency hot spare
y=yes|n=no|asos=advanced software options|bbu=battery backup unit
hlth=health|safe=safe-mode boot

可以看到只有一个raid卡,ctrl 0也是就是/c0

storcli64 /c0 show

[root@node-15 ~]# perccli64 /c0 show
generating detailed summary of the adapter, it may take a while to complete.

controller = 0
status = success
description = none

product name = perc h730 mini
serial number = 663021z
sas address =  51866da066153000
pci address = 00:03:00:00
system time = 01/10/2019 20:48:38
mfg. date = 06/17/16
controller time = 01/10/2019 12:44:21
fw package build = 25.4.0.0017
bios version = 6.29.00.0_4.16.07.00_0x06120100
fw version = 4.260.00-6259
driver name = megaraid_sas
driver version = 06.807.10.00-rh1
current personality = raid-mode
vendor id = 0x1000
device id = 0x5d
subvendor id = 0x1028
subdevice id = 0x1f49
host interface = pci-e
device interface = sas-12g
bus number = 3
device number = 0
function number = 0
drive groups = 11

topology :
========

---------------------------------------------------------------------------
dg arr row eid:slot did type  state bt     size pdc  pi sed ds3  fspace tr 
---------------------------------------------------------------------------
 0 -   -   -        -   raid1 optl  n  931.0 gb dflt n  n   dflt n      n  
 0 0   -   -        -   raid1 optl  n  931.0 gb dflt n  n   dflt n      n  
 0 0   0   32:4     4   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 0 0   1   32:5     5   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 1 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 1 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 1 0   0   32:6     6   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 2 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 2 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 2 0   0   32:7     7   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 3 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 3 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 3 0   0   32:8     8   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 4 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 4 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 4 0   0   32:9     9   drive onln  n  931.0 gb dflt n  n   dflt -      n  
 5 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 5 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 5 0   0   32:10    10  drive onln  n  931.0 gb dflt n  n   dflt -      n  
 6 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 6 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 6 0   0   32:11    11  drive onln  n  931.0 gb dflt n  n   dflt -      n  
 7 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 7 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 7 0   0   32:12    12  drive onln  n  931.0 gb dflt n  n   dflt -      n  
 8 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 8 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 8 0   0   32:13    13  drive onln  n  931.0 gb dflt n  n   dflt -      n  
 9 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 9 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
 9 0   0   32:14    14  drive onln  n  931.0 gb dflt n  n   dflt -      n  
10 -   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
10 0   -   -        -   raid0 optl  n  931.0 gb dflt n  n   dflt n      n  
10 0   0   32:15    15  drive onln  n  931.0 gb dflt n  n   dflt -      n  
---------------------------------------------------------------------------

dg=disk group index|arr=array index|row=row index|eid=enclosure device id
did=device id|type=drive type|onln=online|rbld=rebuild|dgrd=degraded
pdgd=partially degraded|offln=offline|bt=background task active
pdc=pd cache|pi=protection info|sed=self encrypting drive|frgn=foreign
ds3=dimmer switch 3|dflt=default|msng=missing|fspace=free space present
tr=transport ready

virtual drives = 11

vd list :
=======

-------------------------------------------------------------
dg/vd type  state access consist cache cac scc     size name 
-------------------------------------------------------------
0/0   raid1 optl  rw     yes     rwbd  -   off 931.0 gb      
1/1   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
2/2   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
3/3   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
4/4   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
5/5   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
6/6   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
7/7   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
8/8   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
9/9   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
10/10 raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
-------------------------------------------------------------

cac=cachecade|rec=recovery|ofln=offline|pdgd=partially degraded|dgrd=degraded
optl=optimal|ro=read only|rw=read write|hd=hidden|trans=transportready|b=blocked|
consist=consistent|r=read ahead always|nr=no read ahead|wb=writeback|
fwb=force writeback|wt=writethrough|c=cached io|d=direct io|scc=scheduled
check consistency

physical drives = 16

pd list :
=======

----------------------------------------------------------------------------
eid:slt did state dg      size intf med sed pi sesz model                sp 
----------------------------------------------------------------------------
32:0      0 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:1      1 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:2      2 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:3      3 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:4      4 onln  0   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:5      5 onln  0   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:6      6 onln  1   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:7      7 onln  2   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:8      8 onln  3   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:9      9 onln  4   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:10    10 onln  5   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:11    11 onln  6   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:12    12 onln  7   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:13    13 onln  8   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:14    14 onln  9   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:15    15 onln  10  931.0 gb sata hdd n   n  512b st91000640ns         u  
----------------------------------------------------------------------------

eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup
dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare
ubad-unconfigured bad|onln-online|offln-offline|intf-interface
med-media type|sed-self encryptive drive|pi-protection info
sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign
ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded
cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded


bbu_info :
========

----------------------------------------------
model state   retentiontime temp mode mfgdate 
----------------------------------------------
bbu   optimal 0 hour(s)     38c  -    0/00/00 
----------------------------------------------
看磁盘的device id、slot no. 以及drivegroup
[root@node-15 ~]# perccli64 /c0/eall/sall show
controller = 0
status = success
description = show drive information succeeded.


drive information :
=================

----------------------------------------------------------------------------
eid:slt did state dg      size intf med sed pi sesz model                sp 
----------------------------------------------------------------------------
32:0      0 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:1      1 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:2      2 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:3      3 jbod  -  185.75 gb sata ssd n   n  512b intel ssdsc2bx200g4r u  
32:4      4 onln  0   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:5      5 onln  0   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:6      6 onln  1   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:7      7 onln  2   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:8      8 onln  3   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:9      9 onln  4   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:10    10 onln  5   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:11    11 onln  6   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:12    12 onln  7   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:13    13 onln  8   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:14    14 onln  9   931.0 gb sata hdd n   n  512b st91000640ns         u  
32:15    15 onln  10  931.0 gb sata hdd n   n  512b st91000640ns         u  
----------------------------------------------------------------------------

eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup
dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare
ubad-unconfigured bad|onln-online|offln-offline|intf-interface
med-media type|sed-self encryptive drive|pi-protection info
sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign
ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded
cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded

note:

​ 根据经验,jbod的分区在raid的分区之前

查看指定硬盘的信息
[root@node-15 ~]# perccli64 /c0/e32/s6 show all
controller = 0
status = success
description = show drive information succeeded.


drive /c0/e32/s6 :
================

-------------------------------------------------------------------
eid:slt did state dg     size intf med sed pi sesz model        sp 
-------------------------------------------------------------------
32:6      6 onln   1 931.0 gb sata hdd n   n  512b st91000640ns u  
-------------------------------------------------------------------

eid-enclosure device id|slt-slot no.|did-device id|dg-drivegroup
dhs-dedicated hot spare|ugood-unconfigured good|ghs-global hotspare
ubad-unconfigured bad|onln-online|offln-offline|intf-interface
med-media type|sed-self encryptive drive|pi-protection info
sesz-sector size|sp-spun|u-up|d-down/powersave|t-transition|f-foreign
ugunsp-unsupported|ugshld-unconfigured shielded|hspshld-hotspare shielded
cfshld-configured shielded|cpybck-copyback|cbshld-copyback shielded


drive /c0/e32/s6 - detailed information :
=======================================

drive /c0/e32/s6 state :
======================
shield counter = 0
media error count = 46431               *** 很明显的问题发生了46431次介质错误 ***
other error count = 0
drive temperature =  31c (87.80 f)  
predictive failure count = 126          *** 预测故障次数126次 ***
s.m.a.r.t alert flagged by drive = yes


drive /c0/e32/s6 device attributes :
==================================
sn = 9xga228l
manufacturer id = ata     
model number = st91000640ns
nand vendor = na
wwn = 5000c500918f2f8a
firmware revision =     aa63
raw size = 931.512 gb [0x74706db0 sectors]
coerced size = 931.0 gb [0x74600000 sectors]
non coerced size = 931.012 gb [0x74606db0 sectors]
device speed = 6.0gb/s
link speed = 12.0gb/s
ncq setting = n/a
write cache = enabled
logical sector size = 512b
physical sector size = 512b
connector name = 00 


drive /c0/e32/s6 policies/settings :
==================================
drive position = drivegroup:1, span:0, row:0
enclosure position = 0
connected port number = 0(path0) 
sequence number = 2
commissioned spare = no
emergency spare = no
last predictive failure event sequence number = 95183    *** 上一次预测错误的序号95183 ***
successful diagnostics completion on = n/a
sed capable = no
sed enabled = no
secured = no
cryptographic erase capable = no
locked = no
needs ekm attention = no
pi eligible = no
certified = yes
wide port capable = no

port information :
================

-----------------------------------------
port status linkspeed sas address        
-----------------------------------------
   0 active 12.0gb/s  0x500056b33fefe586 
-----------------------------------------


inquiry data = 
5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00 
00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20 
58 39 41 47 32 32 4c 38 00 00 00 00 04 00 20 20 
20 20 41 41 33 36 54 53 31 39 30 30 36 30 30 34 
53 4e 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 10 80 
00 40 00 2f 00 40 00 02 00 02 07 00 ff 3f 10 00 
3f 00 10 fc fb 00 10 00 ff ff ff 0f 00 00 07 00 

note:

通过单个卷组的信息查看,发现了media error,说明了硬盘是有问题的

查看磁盘与系统磁盘分区的对应
[root@node-15 ~]# perccli64 /c0/vall show
controller = 0
status = success
description = none


virtual drives :
==============

-------------------------------------------------------------
dg/vd type  state access consist cache cac scc     size name 
-------------------------------------------------------------
0/0   raid1 optl  rw     yes     rwbd  -   off 931.0 gb      
1/1   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
2/2   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
3/3   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
4/4   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
5/5   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
6/6   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
7/7   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
8/8   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
9/9   raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
10/10 raid0 optl  rw     yes     rwbd  -   off 931.0 gb      
-------------------------------------------------------------

cac=cachecade|rec=recovery|ofln=offline|pdgd=partially degraded|dgrd=degraded
optl=optimal|ro=read only|rw=read write|hd=hidden|trans=transportready|b=blocked|
consist=consistent|r=read ahead always|nr=no read ahead|wb=writeback|
fwb=force writeback|wt=writethrough|c=cached io|d=direct io|scc=scheduled
check consistency

note:

vd:一般认为是该硬盘在系统里的设备顺序,一般如果只有raid分区,那么vd=0的就是系统里的/dev/sda,vd=1就是/dev/sdb以此类推,但是如果有jbod的分区,先排列jbod分区,如jbod的到了/dev/sdc,vd0则是/dev/sdd,以此类推;
dg:是在raid卡里配置卷组的顺序;

raid卡日志收集相关命令

storcli64 /c0 show time 显示raid的时间

storcli64 /c0 show alilog logfile=node-x.alilog 获取alilog,所有的log都包括了

storcli64 /c0 show all logfile=node-x.all.log raid卡的信息

storcli64 /c0 show badblocks 磁盘坏道的信息

perccli64 /c0 show events filter=fatal 显示事件级别为fatal的,可以获取所有毁灭性事件的信息,发现磁盘故障或raid卡故障

perccli64 /c0 show cc 数据一致性检测,raid1以上的级别多个盘的数据是需要进行一致性检测的,但是单盘raid0可能是不需要的,是否影响性能不确定

section five : smartctl get error info of disks

common commands usage description

--scan scan for devices

--scan-open scan for devices and try to open each device

-x, --xall show all information for device

-a, --all show all smart information for device

-i, --info show identity information for device

-d type, --device=type specify device type to one of: ata, scsi, nvme[,nsid], sat[,auto][,n][+type], usbcypress[,x], usbjmicron[,p][,x][,n], usbprolific, usbsunplus, marvell, areca,n/e, 3ware,n, hpt,l/m/n, megaraid,n, aacraid,h,l,id, cciss,n, auto, test

-s value, --smart=value enable/disable smart on device (on/off)

-o value, --offlineauto=value(ata) enable/disable automatic offline testing on device (on/off)

-s value, --saveauto=value(ata) enable/disable attribute autosave on device (on/off)

-h, --health show device smart health status

-c, --capabilities(ata,nvme) show device smart capabilities

-a, --attributes show device smart vendor-specific attributes and values

-l type, --log=type show device log. type: error, selftest, selective, directory[,g|s],
​ xerror[,n][,error], xselftest[,n][,selftest],
​ background, sasphy[,reset], sataphy[,reset],
​ scttemp[sts,hist], scttempint,n[,p],
​ scterc[,n,m], devstat[,n], ssd,
​ gplog,n[,range], smartlog,n[,range],
​ nvmelog,n,size

-t test, --test=test run test. test: offline, short, long, conveyance, force, vendor,n,
​ select,m-n, pending,n, afterselect,[on|off]

-x, --abort abort any non-captive test on device

get info for /dev/sdf

查看所有设备列表
[root@node-15 ~]# smartctl --scan
/dev/sda -d scsi # /dev/sda, scsi device
/dev/sdb -d scsi # /dev/sdb, scsi device
/dev/sdc -d scsi # /dev/sdc, scsi device
/dev/sdd -d scsi # /dev/sdd, scsi device
/dev/sde -d scsi # /dev/sde, scsi device
/dev/sdf -d scsi # /dev/sdf, scsi device
/dev/sdg -d scsi # /dev/sdg, scsi device
/dev/sdh -d scsi # /dev/sdh, scsi device
/dev/sdi -d scsi # /dev/sdi, scsi device
/dev/sdj -d scsi # /dev/sdj, scsi device
/dev/sdk -d scsi # /dev/sdk, scsi device
/dev/sdl -d scsi # /dev/sdl, scsi device
/dev/sdm -d scsi # /dev/sdm, scsi device
/dev/sdn -d scsi # /dev/sdn, scsi device
/dev/sdo -d scsi # /dev/sdo, scsi device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], scsi device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], scsi device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], scsi device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], scsi device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], scsi device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], scsi device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], scsi device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], scsi device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], scsi device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], scsi device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], scsi device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], scsi device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], scsi device
/dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], scsi device
/dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], scsi device
/dev/bus/0 -d megaraid,15 # /dev/bus/0 [megaraid_disk_15], scsi device

note:

通过前面的章节我们定位到了磁盘/dev/sdf在perccli里的did即device_id为6,也就是/dev/bus/0 -d megaraid,6

查看磁盘信息
[root@node-15 ~]# smartctl -i -d megaraid,6 /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org

=== start of information section ===
model family:     seagate constellation.2 (sata)
device model:     st91000640ns
serial number:    9xga228l
lu wwn device id: 5 000c50 0918f2f8a
add. product id:  dell(tm)
firmware version: aa63
user capacity:    1,000,204,886,016 bytes [1.00 tb]
sector size:      512 bytes logical/physical
rotation rate:    7200 rpm
form factor:      2.5 inches
device is:        in smartctl database [for details use: -p show]
ata version is:   ata8-acs t13/1699-d revision 4
sata version is:  sata 3.0, 6.0 gb/s (current: 6.0 gb/s)
local time is:    fri jan 11 11:28:46 2019 cst
smart support is: available - device has smart capability.
smart support is: enabled
查看磁盘的属性信息

一般此处可以用来查看磁盘的整体健康状态指标参数

针对以下输出信息,字段的解释

  • id:属性id,通常是一个1到255之间的十进制或十六进制的数字。
  • attribute_name:硬盘制造商定义的属性名。
  • flag:属性操作标志(可以忽略)。
  • value:这是表格中最重要的信息之一,代表给定属性的标准化值,在1到253之间。253意味着最好情况,1意味着最坏情况。取决于属性和制造商,初始化value可以被设置成100或200.
  • worst:所记录的最小value。
  • thresh:在报告硬盘failed状态前,worst可以允许的最小值,也就是worst如果小于thresh,磁盘就会报告failed。
  • type:属性的类型(pre-fail或oldage)。pre-fail类型的属性可被看成一个关键属性,表示参与磁盘的整体smart健康评估(passed/failed)。如果任何pre-fail类型的属性故障,那么可视为磁盘将要发生故障。另一方面,oldage类型的属性可被看成一个非关键的属性(如正常的磁盘磨损),表示不会使磁盘本身发生故障。
  • updated:表示属性的更新频率。offline代表磁盘上执行离线测试的时间。
  • when_failed:如果value小于等于thresh,会被设置成“failing_now”;如果worst小于等于thresh会被设置成“in_the_past”;如果都不是,会被设置成“-”。在“failing_now”情况下,需要尽快备份重要文件,特别是属性是pre-fail类型时。“in_the_past”代表属性已经故障了,但在运行测试的时候没问题。“-”代表这个属性从没故障过。
  • raw_value:制造商定义的原始值,从value派生。
[root@node-15 ~]# smartctl -a -d megaraid,6 /dev/sdf  
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org

=== start of read smart data section ===
smart attributes data structure revision number: 10
vendor specific smart attributes with thresholds:
id# attribute_name          flag     value worst thresh type      updated  when_failed raw_value
  1 raw_read_error_rate     0x010f   081   038   044    pre-fail  always   in_the_past 151546765
  3 spin_up_time            0x0103   094   094   000    pre-fail  always       -       0
  4 start_stop_count        0x0032   100   100   020    old_age   always       -       21
  5 reallocated_sector_ct   0x0133   100   100   036    pre-fail  always       -       0
  7 seek_error_rate         0x000f   085   060   030    pre-fail  always       -       338813105
  9 power_on_hours          0x0032   079   079   000    old_age   always       -       18784
 10 spin_retry_count        0x0013   100   100   097    pre-fail  always       -       0
 12 power_cycle_count       0x0032   100   100   020    old_age   always       -       21
184 end-to-end_error        0x0032   100   100   099    old_age   always       -       0
187 reported_uncorrect      0x0032   001   001   000    old_age   always       -       1710
188 command_timeout         0x0032   100   100   000    old_age   always       -       0
189 high_fly_writes         0x003a   100   100   000    old_age   always       -       0
190 airflow_temperature_cel 0x0022   069   053   045    old_age   always       -       31 (min/max 24/40)
191 g-sense_error_rate      0x0032   100   100   000    old_age   always       -       0
192 power-off_retract_count 0x0032   100   100   000    old_age   always       -       19
193 load_cycle_count        0x0032   100   100   000    old_age   always       -       852
194 temperature_celsius     0x0022   031   047   000    old_age   always       -       31 (0 14 0 0 0)
195 hardware_ecc_recovered  0x001a   117   099   000    old_age   always       -       151546765
197 current_pending_sector  0x0012   084   084   000    old_age   always       -       688
198 offline_uncorrectable   0x0010   084   084   000    old_age   offline      -       688
199 udma_crc_error_count    0x003e   200   200   000    old_age   always       -       0
240 head_flying_hours       0x0000   100   253   000    old_age   offline      -       8093 (164 214 0)
241 total_lbas_written      0x0000   100   253   000    old_age   offline      -       1870535293
242 total_lbas_read         0x0000   100   253   000    old_age   offline      -       1530387871
查看磁盘的健康检测状态

note:

关于以下检测结果,说明检测结果是passed的,就是磁盘还可以使用,但是列出了一条检测异常的worst<thresh,type是pre-fail,when_failed是in_the_past,说明预测这个盘快坏了。

[root@node-15 ~]# smartctl -h -d megaraid,6 /dev/sdf  
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org

=== start of read smart data section ===
smart status not supported: ata return descriptor not supported by controller firmware
smart overall-health self-assessment test result: passed
warning: this result is based on an attribute check.
please note the following marginal attributes:
id# attribute_name          flag     value worst thresh type      updated  when_failed raw_value
  1 raw_read_error_rate     0x010f   081   038   044    pre-fail  always   in_the_past 151546765
查看磁盘的错误日志
[root@node-15 ~]# smartctl -l error -d megaraid,6 /dev/sdf
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-327.20.1.es2.el7.x86_64] (local build)
copyright (c) 2002-16, bruce allen, christian franke, www.smartmontools.org

=== start of read smart data section ===
smart error log version: 1
ata error count: 46431 (device log contains only the most recent five errors)
        cr = command register [hex]
        fr = features register [hex]
        sc = sector count register [hex]
        sn = sector number register [hex]
        cl = cylinder low register [hex]
        ch = cylinder high register [hex]
        dh = device/head register [hex]
        dc = device command register [hex]
        er = error register [hex]
        st = status register [hex]
powered_up_time is measured from power on, and printed as
ddd+hh:mm:ss.sss where dd=days, hh=hours, mm=minutes,
ss=sec, and sss=millisec. it "wraps" after 49.710 days.

error 46431 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  when the command that caused the error occurred, the device was active or idle.

  after command completion occurred, registers were:
  er st sc sn cl ch dh
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  error: unc at lba = 0x0fffffff = 268435455

  commands leading to the command that caused the error were:
  cr fr sc sn cl ch dh dc   powered_up_time  command/feature_name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+15:15:32.968  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:29.901  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:26.825  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:23.965  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:20.905  read verify sector(s) ext

error 46430 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  when the command that caused the error occurred, the device was active or idle.

  after command completion occurred, registers were:
  er st sc sn cl ch dh
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  error: unc at lba = 0x0fffffff = 268435455

  commands leading to the command that caused the error were:
  cr fr sc sn cl ch dh dc   powered_up_time  command/feature_name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+15:15:29.901  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:26.825  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:23.965  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:20.905  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:18.093  read verify sector(s) ext

error 46429 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  when the command that caused the error occurred, the device was active or idle.

  after command completion occurred, registers were:
  er st sc sn cl ch dh
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  error: unc at lba = 0x0fffffff = 268435455

  commands leading to the command that caused the error were:
  cr fr sc sn cl ch dh dc   powered_up_time  command/feature_name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+15:15:26.825  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:23.965  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:20.905  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:18.093  read verify sector(s) ext
  b0 da 00 00 4f c2 00 00  46d+15:15:17.838  smart return status

error 46428 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  when the command that caused the error occurred, the device was active or idle.

  after command completion occurred, registers were:
  er st sc sn cl ch dh
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  error: unc at lba = 0x0fffffff = 268435455

  commands leading to the command that caused the error were:
  cr fr sc sn cl ch dh dc   powered_up_time  command/feature_name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+15:15:23.965  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:20.905  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:18.093  read verify sector(s) ext
  b0 da 00 00 4f c2 00 00  46d+15:15:17.838  smart return status
  2f 00 01 e0 00 00 40 00  46d+15:15:17.703  read log ext

error 46427 occurred at disk power-on lifetime: 18640 hours (776 days + 16 hours)
  when the command that caused the error occurred, the device was active or idle.

  after command completion occurred, registers were:
  er st sc sn cl ch dh
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  error: unc at lba = 0x0fffffff = 268435455

  commands leading to the command that caused the error were:
  cr fr sc sn cl ch dh dc   powered_up_time  command/feature_name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  42 00 00 ff ff ff 4f 00  46d+15:15:20.905  read verify sector(s) ext
  42 00 00 ff ff ff 4f 00  46d+15:15:18.093  read verify sector(s) ext
  b0 da 00 00 4f c2 00 00  46d+15:15:17.838  smart return status
  2f 00 01 e0 00 00 40 00  46d+15:15:17.703  read log ext
  42 00 00 ff ff ff 4f 00  46d+15:15:15.276  read verify sector(s) ext
补充
  • 如果没有开启磁盘的smart可以通过-s on device开启
  • 一般来说如果samrtctl -i 获取info时没有什么信息输出且smart support是允许的可用的,那么说明可能需要做test才能获取到-t short/long,该测试不会破坏硬盘上的数据,但对于存储一般不适用离线offline测试
  • 收集时可以通过-x -a参数获取更全面的磁盘信息
  • smartctl是可以配置服务的/etc/smartmontools/smartd.conf,对此目前没有研究,后续有研究成果再更新

上一篇: 荐 5.2. Python

下一篇: python-c5-作业