标签归档:asm恢复

ORA-15196: invalid ASM block header [kfc.c:26368]故障恢复

有客户对asm的data磁盘组增加磁盘进行扩容,在做reblance的过程中重启了主机,结果导致data磁盘组mount之后自动dismount

Fri Oct 09 20:48:06 2020
NOTE: PST enabling heartbeating (grp 1)
Fri Oct 09 20:48:06 2020
NOTE: ASM did background COD recovery for group 1/0x739536c (DATA)
NOTE: starting rebalance of group 1/0x739536c (DATA) at power 10
Starting background process ARB0
Fri Oct 09 20:48:07 2020
ARB0 started with pid=28, OS id=39278 
NOTE: assigning ARB0 to group 1/0x739536c (DATA) with 10 parallel I/Os
cellip.ora not found.
WARNING:cache read a corrupt block:group=1(DATA) dsk=8 blk=7 disk=8(DATA_0008)incarn=3916014506 au=0 blk=7 count=1
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_39278.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
NOTE: a corrupted block from group DATA was dumped to /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_39278.trc
WARNING:cache read(retry)a corrupt block:group=1(DATA) dsk=8 blk=7 disk=8(DATA_0008)incarn=3916014506 au=0 blk=7 count=1
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_39278.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
ERROR: cache failed to read group=1(DATA) dsk=8 blk=7 from disk(s): 8(DATA_0008)
Fri Oct 09 20:48:13 2020
NOTE: GroupBlock outside rolling migration privileged region
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
NOTE: requesting all-instance membership refresh for group=1
NOTE: cache initiating offline of disk 8 group DATA
NOTE: process _arb0_+asm1 (39278) initiating offline of disk 8.3916014506 (DATA_0008) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 8/0xe969a3aa, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 7 for pid 28, osid 39278
ERROR: Disk 8 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Fri Oct 09 20:48:13 2020
NOTE: cache dismounting (not clean) group 1/0x0739536C (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 39346, image: oracle@rac1 (B000)
Fri Oct 09 20:48:13 2020
NOTE: halting all I/Os to diskgroup 1 (DATA)
Fri Oct 09 20:48:13 2020
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=32.4749 last written ABA 32.4749
WARNING: Offline for disk DATA_0008 in mode 0x7f failed.
Fri Oct 09 20:48:13 2020
kjbdomdet send to inst 2
detach from dom 1, sending detach message to inst 2
Fri Oct 09 20:48:13 2020
List of instances:
 1 2
Dirty detach reconfiguration started (new ddet inc 2, cluster inc 4)
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_39278.trc  (incident=337185):
ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0008" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
ORA-15196: invalid ASM block header [kfc.c:26368] [check_kfbh] [2147483656] [7] [2182009786 != 2190395015]
Incident details in: /u01/grid/diag/asm/+asm/+ASM1/incident/incdir_337185/+ASM1_arb0_39278_i337185.trc
 Global Resource Directory partially frozen for dirty detach
* dirty detach - domain 1 invalid = TRUE 
 2341 GCS resources traversed, 0 cancelled
Dirty Detach Reconfiguration complete
freeing rdom 1
Fri Oct 09 20:48:13 2020
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0x0739536C (DATA) 

错误信息比较明显dsk=8 blk=7 au=0 blk=7 的check值不对,本来应该是2190395015现在变为了2182009786,通过kfed分析确实如此

C:\Users\Administrator>kfed read f:/temp/xff/2.dd blkn=7|more
kfbh.endian:                          1 ; 0x000: 0x01
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          2 ; 0x003: 0x02
kfbh.block.blk:                       7 ; 0x004: blk=7
kfbh.block.obj:              2147483656 ; 0x008: disk=8
kfbh.check:                  2182009786 ; 0x00c: 0x820ed3ba
kfbh.fcn.base:                  2711248 ; 0x010: 0x00295ed0
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdatb.aunum:                      2240 ; 0x000: 0x000008c0
kfdatb.shrink:                      448 ; 0x004: 0x01c0
kfdatb.ub2pad:                        0 ; 0x006: 0x0000
kfdatb.auinfo[0].link.next:           8 ; 0x008: 0x0008
kfdatb.auinfo[0].link.prev:           8 ; 0x00a: 0x0008
kfdatb.auinfo[1].link.next:          12 ; 0x00c: 0x000c
kfdatb.auinfo[1].link.prev:          12 ; 0x00e: 0x000c
kfdatb.auinfo[2].link.next:          16 ; 0x010: 0x0010
kfdatb.auinfo[2].link.prev:          16 ; 0x012: 0x0010
kfdatb.auinfo[3].link.next:          20 ; 0x014: 0x0014
kfdatb.auinfo[3].link.prev:          20 ; 0x016: 0x0014
kfdatb.auinfo[4].link.next:          24 ; 0x018: 0x0018
kfdatb.auinfo[4].link.prev:          24 ; 0x01a: 0x0018
kfdatb.auinfo[5].link.next:          28 ; 0x01c: 0x001c
kfdatb.auinfo[5].link.prev:          28 ; 0x01e: 0x001c
kfdatb.auinfo[6].link.next:          32 ; 0x020: 0x0020
kfdatb.auinfo[6].link.prev:          32 ; 0x022: 0x0020
kfdatb.spare:                         0 ; 0x024: 0x00000000

修改该值之后,再次mount data磁盘组,报错如下

Sat Oct 10 13:49:22 2020
ARB0 started with pid=28, OS id=10329 
NOTE: assigning ARB0 to group 1/0x3759521c (DATA) with 10 parallel I/Os
cellip.ora not found.
Sat Oct 10 13:49:26 2020
NOTE: GroupBlock outside rolling migration privileged region
NOTE: requesting all-instance membership refresh for group=1
WARNING: cache read  a corrupt block: group=1(DATA) dsk=8 blk=8 disk=8 (DATA_0008) incarn=3916014011 au=0 blk=8 count=1
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_10329.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
NOTE: a corrupted block from group DATA was dumped to /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_10329.trc
WARNING:cache read(retry)a corrupt block: group=1(DATA)dsk=8 blk=8 disk=8(DATA_0008)incarn=3916014011 au=0 blk=8 count=1
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_10329.trc:
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
ERROR: cache failed to read group=1(DATA) dsk=8 blk=8 from disk(s): 8(DATA_0008)
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
NOTE: cache initiating offline of disk 8 group DATA
NOTE: process _arb0_+asm1 (10329) initiating offline of disk 8.3916014011 (DATA_0008) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 8/0xe969a1bb, mask = 0x6a, op = clear
GMON updating disk modes for group 1 at 64 for pid 28, osid 10329
ERROR: Disk 8 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Sat Oct 10 13:49:28 2020
NOTE: cache dismounting (not clean) group 1/0x3759521C (DATA) 
WARNING: Offline for disk DATA_0008 in mode 0x7f failed.
Sat Oct 10 13:49:28 2020
NOTE: halting all I/Os to diskgroup 1 (DATA)
NOTE: messaging CKPT to quiesce pins Unix process pid: 10346, image: oracle@rac1 (B000)
Errors in file /u01/grid/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_10329.trc  (incident=363107):
ORA-15335: ASM metadata corruption detected in disk group 'DATA'
ORA-15130: diskgroup "DATA" is being dismounted
ORA-15066: offlining disk "DATA_0008" in group "DATA" may result in a data loss
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
ORA-15196: invalid ASM block header [kfc.c:26368] [endian_kfbh] [2147483656] [8] [0 != 1]
Incident details in: /u01/grid/diag/asm/+asm/+ASM1/incident/incdir_363107/+ASM1_arb0_10329_i363107.trc

该报错为:dsk=8 blk=7 au=0 blk=8异常,通过kfed查看发现

C:\Users\Administrator>kfed read f:/temp/xff/2.dd blkn=8
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
006BE8C00 00000000 00000000 00000000 00000000  [................]
        Repeat 31 times
006BE8E00 012C0000 04AFFC07 003BFFCD 03F15BD0  [..,.......;..[..]
006BE8E10 012BFC30 00000000 00000002 00000002  [0.+.............]
006BE8E20 00008000 00008000 00002000 564F22AF  [......... ..."OV]
006BE8E30 5F805293 FFFF0002 0001EF53 00000001  [.R._....S.......]
006BE8E40 545AE384 00000000 00000000 00000001  [..ZT............]
006BE8E50 00000000 0000000B 00000100 0000003C  [............<...]
006BE8E60 00000242 0000007B 52438BA0 C44FFA90  [B...{.....CR..O.]
006BE8E70 33B6F381 919E2DBA 00000000 00000000  [...3.-..........]
006BE8E80 00000000 00000000 6361622F 0070756B  [......../backup.]
006BE8E90 00000000 00000000 00000000 00000000  [................]
        Repeat 2 times
006BE8EC0 00000000 00000000 00000000 03ED0000  [................]
006BE8ED0 00000000 00000000 00000000 00000000  [................]
006BE8EE0 00000008 00000000 00000000 AC3C87D6  [..............<.]
006BE8EF0 F1401174 F4F036BD 274FB92F 00000101  [t.@..6../.O'....]
006BE8F00 0000000C 00000000 545AE384 0002F30A  [..........ZT....]
006BE8F10 00000004 00000000 00000000 00007FFF  [................]
006BE8F20 02508000 00007FFF 00000001 0250FFFF  [..P...........P.]
006BE8F30 00000000 00000000 00000000 00000000  [................]
006BE8F40 00000000 00000000 00000000 08000000  [................]
006BE8F50 00000000 00000000 00000000 001C001C  [................]
006BE8F60 00000001 00000000 00000000 00000000  [................]
006BE8F70 00000000 00000004 A9AF72B9 0000003B  [.........r..;...]
006BE8F80 00000000 00000000 00000000 00000000  [................]
        Repeat 167 times
006BE9A00 00001CC4 00800101 00001CC9 00800101  [................]
006BE9A10 00001CCD 00800101 00001CD2 00800101  [................]
006BE9A20 00001CD7 00800101 00001CDE 00800101  [................]
006BE9A30 00001CE3 00800101 00001CE8 00800101  [................]
006BE9A40 00001CEC 00800101 00000000 00000000  [................]
006BE9A50 00000000 00000000 00000000 00000000  [................]
  Repeat 26 times
KFED-00322:Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

该block完全损坏,基本上无直接修复的可能,通过对data 磁盘组进行patch操作,让其mount之后不再dismount

NOTE: GMON heartbeating for grp 1
GMON querying group 1 at 76 for pid 27, osid 14466
NOTE: cache opening disk 0 of grp 1: DATA_0000 path:/dev/emcpowere
NOTE: F1X0 found on disk 0 au 2 fcn 0.2708382
NOTE: cache opening disk 1 of grp 1: DATA_0001 path:/dev/emcpowerf
NOTE: cache opening disk 2 of grp 1: DATA_0002 path:/dev/emcpowerg
NOTE: cache opening disk 3 of grp 1: DATA_0003 path:/dev/emcpowerh
NOTE: cache opening disk 4 of grp 1: DATA_0004 path:/dev/emcpoweri
NOTE: cache opening disk 5 of grp 1: DATA_0005 path:/dev/emcpowerj
NOTE: cache opening disk 6 of grp 1: DATA_0006 path:/dev/emcpowerk
NOTE: cache opening disk 7 of grp 1: DATA_0007 path:/dev/emcpowerl
NOTE: cache opening disk 8 of grp 1: DATA_0008 path:/dev/emcpowerc
NOTE: cache mounting (first) external redundancy group 1/0x47495222 (DATA)
Sat Oct 10 13:59:38 2020
* allocate domain 1, invalid = TRUE 
Sat Oct 10 13:59:38 2020
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=53.6778 group=1 (DATA)
NOTE: advancing ckpt for group 1 (DATA) thread=1 ckpt=53.6778
NOTE: cache recovered group 1 to fcn 0.2961429
NOTE: redo buffer size is 256 blocks (1053184 bytes)
Sat Oct 10 13:59:38 2020
NOTE: LGWR attempting to mount thread 1 for diskgroup 1 (DATA)
NOTE: LGWR found thread 1 closed at ABA 53.6777
NOTE: LGWR mounted thread 1 for diskgroup 1 (DATA)
NOTE: LGWR opening thread 1 at fcn 0.2961429 ABA 54.6778
NOTE: cache mounting group 1/0x47495222 (DATA) succeeded
NOTE: cache ending mount (success) of group DATA number=1 incarn=0x47495222
NOTE: Instance updated compatible.asm to 11.2.0.0.0 for grp 1
SUCCESS: diskgroup DATA was mounted
SUCCESS: alter diskgroup data mount

然后通过rman备份数据库,删除老磁盘组,创建新磁盘组,恢复数据,实现数据库完美恢复,数据0丢失.

发表在 Oracle备份恢复 | 标签为 , , | 留下评论

ERROR: diskgroup XXXX was not mounted

aix平台10.2.0.5 2节点RAC,由于节点2系统盘故障,通过节点1镜像系统,复制到节点2,结果由于节点2磁盘顺序和节点1不匹配,aix工程师进行了相关操作之后,节点1重启之后datadg磁盘组无法mount

SQL> alter diskgroup datadg mount 
Mon Jun 10 23:23:46 CST 2019
NOTE: cache registered group DATADG number=1 incarn=0x8cf61164
Mon Jun 10 23:23:46 CST 2019
NOTE: Hbeat: instance first (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: start heartbeating (grp 1)
Mon Jun 10 23:23:50 CST 2019
NOTE: cache dismounting group 1/0x8CF61164 (DATADG) 
NOTE: dbwr not being msg'd to dismount
ERROR: diskgroup DATADG was not mounted

检查datadg磁盘组相关信息

Tue Jan 29 19:21:45 CST 2019
NOTE: start heartbeating (grp 2)
NOTE: cache opening disk 0 of grp 2: DATADG_0000 path:/dev/rhdisk6
Tue Jan 29 19:21:45 CST 2019
NOTE: F1X0 found on disk 0 fcn 0.0
NOTE: cache opening disk 1 of grp 2: DATADG_0001 path:/dev/rhdisk7
NOTE: cache opening disk 2 of grp 2: DATADG_0002 path:/dev/rhdisk8
NOTE: cache opening disk 3 of grp 2: DATADG_0003 path:/dev/rhdisk9
NOTE: cache mounting (first) group 2/0x60E59155 (DATADG)
* allocate domain 2, invalid = TRUE 
Tue Jan 29 19:21:45 CST 2019
NOTE: attached to recovery domain 2
Tue Jan 29 19:21:45 CST 2019
NOTE: cache recovered group 2 to fcn 0.849668
Tue Jan 29 19:21:45 CST 2019
NOTE: LGWR attempting to mount thread 1 for disk group 2
NOTE: LGWR mounted thread 1 for disk group 2
NOTE: opening chunk 1 at fcn 0.849668 ABA 
NOTE: seq=21 blk=5394 
Tue Jan 29 19:21:46 CST 2019
NOTE: cache mounting group 2/0x60E59155 (DATADG) succeeded
SUCCESS: diskgroup DATADG was mounted

通过这里可以看出来datadg磁盘组是由rhdisk6-9 四块磁盘组成,查询相关磁盘信息发现
asm-foreign


这里确定rhdisk7磁盘异常,通过kfed分析磁盘情况

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                           34 ; 0x001: 0x22
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                   49407 ; 0x004: blk=49407
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                    58396 ; 0x010: 0x0000e41c
kfbh.fcn.wrap:                   131072 ; 0x014: 0x00020000
kfbh.spare1:                 4294967064 ; 0x018: 0xffffff18
kfbh.spare2:                 2105310074 ; 0x01c: 0x7d7c7b7a
005918A00 00002200 0000C0FF 00000000 00000000  [."..............]
005918A10 0000E41C 00020000 FFFFFF18 7D7C7B7A  [............z{|}]
005918A20 00000000 00000000 00000000 00000000  [................]
  Repeat 253 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=1
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                            0 ; 0x001: 0x00
kfbh.type:                            0 ; 0x002: KFBTYP_INVALID
kfbh.datfmt:                          0 ; 0x003: 0x00
kfbh.block.blk:                       0 ; 0x004: blk=0
kfbh.block.obj:                       0 ; 0x008: file=0
kfbh.check:                           0 ; 0x00c: 0x00000000
kfbh.fcn.base:                        0 ; 0x010: 0x00000000
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
006EF8A00 00000000 00000000 00000000 00000000  [................]
  Repeat 255 times
KFED-00322: Invalid content encountered during block traversal: [kfbtTraverseBlock][Invalid OSM block type][][0]

D:\BaiduNetdiskDownload\xifenfei>kfed read rhdisk7.dd blkn=2|more
kfbh.endian:                          0 ; 0x000: 0x00
kfbh.hard:                          130 ; 0x001: 0x82
kfbh.type:                            3 ; 0x002: KFBTYP_ALLOCTBL
kfbh.datfmt:                          1 ; 0x003: 0x01
kfbh.block.blk:                33554432 ; 0x004: blk=33554432
kfbh.block.obj:                16777344 ; 0x008: file=128
kfbh.check:                  3844041089 ; 0x00c: 0xe51f6981
kfbh.fcn.base:               1297484544 ; 0x010: 0x4d560b00
kfbh.fcn.wrap:                        0 ; 0x014: 0x00000000
kfbh.spare1:                          0 ; 0x018: 0x00000000
kfbh.spare2:                          0 ; 0x01c: 0x00000000
kfdatb10.aunum:                       0 ; 0x000: 0x00000000
kfdatb10.shrink:                  49153 ; 0x004: 0xc001
kfdatb10.ub2pad:                  20555 ; 0x006: 0x504b
kfdatb10.auinfo[0].link.next:      2048 ; 0x008: 0x0800
kfdatb10.auinfo[0].link.prev:      2048 ; 0x00a: 0x0800
kfdatb10.auinfo[0].free:              0 ; 0x00c: 0x0000
kfdatb10.auinfo[0].total:         49153 ; 0x00e: 0xc001
kfdatb10.auinfo[1].link.next:      4096 ; 0x010: 0x1000
kfdatb10.auinfo[1].link.prev:      4096 ; 0x012: 0x1000
kfdatb10.auinfo[1].free:              0 ; 0x014: 0x0000
kfdatb10.auinfo[1].total:             0 ; 0x016: 0x0000
kfdatb10.auinfo[2].link.next:      6144 ; 0x018: 0x1800
kfdatb10.auinfo[2].link.prev:      6144 ; 0x01a: 0x1800
kfdatb10.auinfo[2].free:              0 ; 0x01c: 0x0000
kfdatb10.auinfo[2].total:             0 ; 0x01e: 0x0000
kfdatb10.auinfo[3].link.next:      8192 ; 0x020: 0x2000
kfdatb10.auinfo[3].link.prev:      8192 ; 0x022: 0x2000
kfdatb10.auinfo[3].free:              0 ; 0x024: 0x0000

对比磁盘可能的损坏情况,由于在aix 平台asm disk的block有一个特征一般0082开头,通过工具打开磁盘,检索该标记对比
正常磁盘
0082-good


异常磁盘
0082-had

通过上述分析,大概评估rhdisk7 元数据部分损坏的不光是block 0和1,人工修复继续使用的可能性不太大,而且基于客户的数据库不大,采取方案是直接拷贝数据文件、redo、控制文件到文件系统,然后在本地文件系统open库
open_db

运气不错,实现完美恢复数据0丢失

发表在 Oracle ASM, 非常规恢复 | 标签为 , | 评论关闭

WARNING: Read Failed.导致asm磁盘组异常

有客户对asm dg进行扩容,一段时间之后,asm data 磁盘组直接dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638 
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27 
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

后续继续mount data 磁盘组成功,但是立马又dismount

Wed May 29 18:37:25 2019
SUCCESS: ALTER DISKGROUP DATA ADD  DISK '/dev/oracleasm/disks/DATA_0028' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0027' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0026' SIZE 511993M ,
'/dev/oracleasm/disks/DATA_0025' SIZE 511993M /* ASMCA */
NOTE: starting rebalance of group 1/0x9e18e2f1 (DATA) at power 1
Wed May 29 18:37:26 2019
Starting background process ARB0
Wed May 29 18:37:26 2019
ARB0 started with pid=34, OS id=96638 
NOTE: assigning ARB0 to group 1/0x9e18e2f1 (DATA) with 1 parallel I/O
NOTE: Attempting voting file refresh on diskgroup DATA
NOTE: Refresh completed on diskgroup DATA. No voting file found.
cellip.ora not found.
Wed May 29 19:21:43 2019
WARNING: Read Failed. group:1 disk:27 AU:0 offset:360448 size:4096
WARNING: cache failed reading from group=1(DATA) dsk=27 blk=88 count=1 from disk= 27 
(DATA_0027) kfkist=0x20 status=0x02 osderr=0x0 file=kfc.c line=11596
ERROR: cache failed to read group=1(DATA) dsk=27 blk=88 from disk(s): 27(DATA_0027)
ORA-15080: synchronous I/O operation to a disk failed
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 704
Additional information: -1
NOTE: cache initiating offline of disk 27 group DATA
NOTE: process _user31879_+asm1 (31879) initiating offline of disk 27.3915911747 (DATA_0027) with mask 0x7e in group 1
NOTE: initiating PST update: grp = 1, dsk = 27/0xe9681243, mask = 0x6a, op = clear
Wed May 29 19:21:43 2019
GMON updating disk modes for group 1 at 10 for pid 35, osid 31879
ERROR: Disk 27 cannot be offlined, since diskgroup has external redundancy.
ERROR: too many offline disks in PST (grp 1)
Wed May 29 19:21:43 2019
NOTE: cache dismounting (not clean) group 1/0x9E18E2F1 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 90256, image: oracle@ftz-db-o1 (B000)
Wed May 29 19:21:43 2019
NOTE: halting all I/Os to diskgroup 1 (DATA)
WARNING: Offline for disk DATA_0027 in mode 0x7f failed.
Wed May 29 19:21:43 2019
NOTE: LGWR doing non-clean dismount of group 1 (DATA)
NOTE: LGWR sync ABA=27.3207 last written ABA 27.3207
Wed May 29 19:21:43 2019
ERROR: ORA-15130 thrown in ARB0 for group number 1
Errors in file /oracle/grid_base/diag/asm/+asm/+ASM1/trace/+ASM1_arb0_96638.trc:
ORA-15130: diskgroup "" is being dismounted
ORA-15130: diskgroup "DATA" is being dismounted
Wed May 29 19:21:43 2019
NOTE: stopping process ARB0

对于上述的故障现象,本质原因是由于asm 磁盘组增加新磁盘之后,开始做rebalance,但是由于遭遇到 27号盘上有IO读错误,使得asm磁盘组无法正常完成rebalance,因而data磁盘组无法稳定的mount。解决该问题思路,通过patch asm磁盘组,禁止rebalance,从而使得data磁盘组不再dismount,再进行后续恢复

发表在 Oracle ASM, 非常规恢复 | 标签为 , , | 2 条评论