HPE

INTERNAL USE ONLY

Image result for HP new logo

 

 

Analysis Code: 201

Severity: Warning

 

INEX is checking “showeventlog_-d_-debug_-oneline.out” for the string:

 "scsi_cmnd_retry: pd .* opcode .* rval 0x31"

We are looking for the event occurring 15 times per minute per unique pd and unique port and unique <N:S:P>.

Based upon the information provided look for the following:

·         Check issue exposed on single/multiple PDs?

·         Check issue exposed on single port or multiple ports

 

Plan Of Actions:

If issue seen on single pd:

For ex:-

    480 pd 1 port b0 on 0:0:1

 

- controlport rst –l <port>

- After 15 min, check issue still around? If then, reseat the drive.

- After 15 min, check issue still around? If then, elevate the issue.

 

If issue seen on multiple drives and pointing same port:

- controlport rst –l <port>

- After 15 min, check issue still around? If then, elevate the issue.

 

If issue seen on multiple drives and it is pointing to same <N:S> then:

For ex:-

   520 pd 240 port a0 on 3:0:1

   300 pd 264 port a0 on 3:0:1

   264 pd 274 port a0 on 3:0:2

   542 pd 276 port a0 on 3:0:2

   622 pd 294 port a0 on 3:0:2

- controlport rst –l <port>

 

If issue seen on multiple drives and it is pointing to multiple <N:S> then:

- This situation may require multiple controlport reset commands to be issued, this type of activity may have an impact on multiple hosts and their IOs. In a situation like this careful consideration would have to be given to the multiple port resets or consider issuance of a cluster shutdown to resolve the issue. Elevate the issue.

 

On Array check may be accomplished with the following command to collect and display the data:

Get last 15 min event log and check any events with "scsi_cmnd_retry: pd .* opcode .* rval 0x31" pattern?

# showeventlog -oneline -debug -min 15 -msg "scsi_cmnd_retry: pd .* opcode .* rval 0x31" |\

sed -e "s/.*scsi_cmnd_retry: //g" -e "s/ - opcode.*//g" | sort | uniq –c

 

Multiplication factor based on minutes covered.

    - 15 events/minute * 15 min = 225 events per port of the drive

       i.e., If events are more than >=200 then need corrective actions.

       For ex:- As per data below, pd 1 and pd 2 reported the issue.

 

    480 pd 1 port b0 on 0:0:1

    486 pd 3 port b0 on 0:0:1