标签云
asm恢复 bbed bootstrap$ dul kcbzib_kcrsds_1 kccpb_sanity_check_2 kcratr_nab_less_than_odr kgegpa MySQL恢复 ORA-00312 ORA-00704 ORA-00742 ORA-01110 ORA-01200 ORA-01555 ORA-01578 ORA-01595 ORA-600 2662 ORA-600 3020 ORA-600 4000 ORA-600 4137 ORA-600 4193 ORA-600 4194 ORA-600 16703 ORA-600 kcbzib_kcrsds_1 ORA-600 KCLCHKBLK_4 ORA-600 kdsgrp1 ORA-15042 ORA-15196 ORACLE 12C oracle dul ORACLE PATCH Oracle Recovery Tools oracle加密恢复 oracle勒索 oracle勒索恢复 oracle异常恢复 ORACLE恢复 Oracle 恢复 ORACLE数据库恢复 oracle 比特币 OSD-04016 YOUR FILES ARE ENCRYPTED 勒索恢复 比特币加密文章分类
- Others (2)
- 中间件 (2)
- WebLogic (2)
- 操作系统 (106)
- 数据库 (1,805)
- DB2 (22)
- MySQL (80)
- Oracle (1,640)
- Data Guard (53)
- EXADATA (8)
- GoldenGate (24)
- ORA-xxxxx (166)
- ORACLE 12C (72)
- ORACLE 18C (6)
- ORACLE 19C (15)
- ORACLE 21C (3)
- Oracle 23ai (8)
- Oracle ASM (69)
- Oracle Bug (8)
- Oracle RAC (54)
- Oracle 安全 (6)
- Oracle 开发 (28)
- Oracle 监听 (29)
- Oracle备份恢复 (613)
- Oracle安装升级 (102)
- Oracle性能优化 (62)
- 专题索引 (5)
- 勒索恢复 (86)
- PostgreSQL (33)
- pdu工具 (7)
- PostgreSQL恢复 (11)
- SQL Server (32)
- SQL Server恢复 (13)
- TimesTen (7)
- 达梦数据库 (3)
- 达梦恢复 (1)
- 生活娱乐 (2)
- 至理名言 (11)
- 虚拟化 (2)
- VMware (2)
- 软件开发 (43)
- Asp.Net (9)
- JavaScript (12)
- PHP (2)
- 小工具 (26)
-
最近发表
- open数据库报ora-600 kdsgrp1故障处理
- expdp dmp 导出不完整导入ORA-39059 ORA-39246 故障抢救数据
- mysql drop database 恢复思路
- PRVG-11975 : The I/O scheduler parameter of device “/dev/sdm” did not match the expected value on nodes
- obet(Oracle Block Editor Tool)第二版发布
- Oracle数据块编辑工具( Oracle Block Editor Tool)-obet
- Oracle坏块修复工具:Patch_blk
- ORA-01172 ORA-01151故障处理
- C_OBJ#_INTCOL#坏块导致数据库无法open故障处理
- ORA-600 kkkicreatecgmap:!efn3
- Oracle 19c 202510补丁(RUs+OJVM)-19.29
- 记录一次raid恢复之后数据库故障处理(ora-01200,ORA-26101,ORA-600)
- nbu备份文件img格式直接rman恢复
- ORA-600 kokasgi1故障处理(sys被重命名)
- Patch_SCN for Linux 功能完善
- ORA-600 2662错误处理-202510
- system表空间丢失部分文件恢复
- arm环境vg损坏mysql数据库恢复
- redhat系列7/8进入单用户模式
- Failed to open \EFI\redhat\grubx64.efi – Not Found 故障处理
标签归档:asmlib
分享oracleasm createdisk重新创建asm disk后数据0丢失恢复案例
有客户反馈他们重启系统之后,发现asmlib创建的asmdisk丢失了,然后又使用oracleasm deletedisk和createdisk重新创建的asm disk,最后发现asm diskgroup无法mount。让客户通过dd 备份5m数据,然后使用kfed分析
kefd分析结果
E:\OneDrive\ORACLE\recover\no_backup\asm\kfedwin>kfed read H:\temp\asmlib\xx.img kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 kfbh.check: 3760689243 ; 0x00c: 0xe027905b kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 E:\OneDrive\ORACLE\recover\no_backup\asm\kfedwin>kfed read H:\temp\asmlib\xx.img blkn=1 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 E:\OneDrive\ORACLE\recover\no_backup\asm\kfedwin>kfed read H:\temp\asmlib\xx.img blkn=10 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 E:\OneDrive\ORACLE\recover\no_backup\asm\kfedwin>kfed read H:\temp\asmlib\xx.img blkn=255 kfbh.endian: 0 ; 0x000: 0x00 kfbh.hard: 0 ; 0x001: 0x00 kfbh.type: 0 ; 0x002: KFBTYP_INVALID kfbh.datfmt: 0 ; 0x003: 0x00 kfbh.block.blk: 0 ; 0x004: T=0 NUMB=0x0 kfbh.block.obj: 0 ; 0x008: TYPE=0x0 NUMB=0x0 kfbh.check: 0 ; 0x00c: 0x00000000 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 E:\OneDrive\ORACLE\recover\no_backup\asm\kfedwin>kfed read H:\temp\asmlib\xx.img blkn=256|more kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 17 ; 0x002: KFBTYP_PST_META kfbh.datfmt: 2 ; 0x003: 0x02 kfbh.block.blk: 256 ; 0x004: T=0 NUMB=0x100 kfbh.block.obj: 2147483648 ; 0x008: TYPE=0x8 NUMB=0x0 kfbh.check: 3925268785 ; 0x00c: 0xe9f6d931 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 kfdpHdrPairBv1.first.super.time.hi:32994098 ; 0x000: HOUR=0x12 DAYS=0x19 MNTH=0x c YEAR=0x7dd kfdpHdrPairBv1.first.super.time.lo:1614030848 ; 0x004: USEC=0x0 MSEC=0x10a SECS= 0x3 MINS=0x18 kfdpHdrPairBv1.first.super.last: 2 ; 0x008: 0x00000002 kfdpHdrPairBv1.first.super.next: 2 ; 0x00c: 0x00000002 kfdpHdrPairBv1.first.super.copyCnt: 1 ; 0x010: 0x01 kfdpHdrPairBv1.first.super.version: 1 ; 0x011: 0x01 kfdpHdrPairBv1.first.super.ub2spare: 0 ; 0x012: 0x0000 kfdpHdrPairBv1.first.super.incarn: 1 ; 0x014: 0x00000001 kfdpHdrPairBv1.first.super.copy[0]: 0 ; 0x018: 0x0000 kfdpHdrPairBv1.first.super.copy[1]: 0 ; 0x01a: 0x0000 kfdpHdrPairBv1.first.super.copy[2]: 0 ; 0x01c: 0x0000 ……
因为kfed默认每个block为4k,这里提示256是ok的,255是损坏的,从而推测出来,很可能oracleasm createdisk损坏了1M的数据。由于默认au是1m,而且数据库版本是11.2.0.3,而且第256个blkn开始没有损坏,因此初步判断可以考虑使用备份asm disk header来恢复磁盘头
检查还原磁盘头的asm disk
[grid@xifenfei1 disks]$ kfed read DATA1 kfbh.endian: 1 ; 0x000: 0x01 kfbh.hard: 130 ; 0x001: 0x82 kfbh.type: 1 ; 0x002: KFBTYP_DISKHEAD kfbh.datfmt: 1 ; 0x003: 0x01 kfbh.block.blk: 0 ; 0x004: blk=0 kfbh.block.obj: 2147483648 ; 0x008: disk=0 kfbh.check: 2776451033 ; 0x00c: 0xa57d47d9 kfbh.fcn.base: 0 ; 0x010: 0x00000000 kfbh.fcn.wrap: 0 ; 0x014: 0x00000000 kfbh.spare1: 0 ; 0x018: 0x00000000 kfbh.spare2: 0 ; 0x01c: 0x00000000 kfdhdb.driver.provstr: ORCLDISKDATA1 ; 0x000: length=13 kfdhdb.driver.reserved[0]: 1096040772 ; 0x008: 0x41544144 kfdhdb.driver.reserved[1]: 49 ; 0x00c: 0x00000031 kfdhdb.driver.reserved[2]: 0 ; 0x010: 0x00000000 kfdhdb.driver.reserved[3]: 0 ; 0x014: 0x00000000 kfdhdb.driver.reserved[4]: 0 ; 0x018: 0x00000000 kfdhdb.driver.reserved[5]: 0 ; 0x01c: 0x00000000 kfdhdb.compat: 186646528 ; 0x020: 0x0b200000 kfdhdb.dsknum: 0 ; 0x024: 0x0000 kfdhdb.grptyp: 1 ; 0x026: KFDGTP_EXTERNAL kfdhdb.hdrsts: 3 ; 0x027: KFDHDR_MEMBER kfdhdb.dskname: DATA_0000 ; 0x028: length=9 kfdhdb.grpname: DATA ; 0x048: length=4 kfdhdb.fgname: DATA_0000 ; 0x068: length=9 kfdhdb.capname: ; 0x088: length=0 kfdhdb.crestmp.hi: 32994099 ; 0x0a8: HOUR=0x13 DAYS=0x19 MNTH=0xc YEAR=0x7dd kfdhdb.crestmp.lo: 2797442048 ; 0x0ac: USEC=0x0 MSEC=0x365 SECS=0x2b MINS=0x29 kfdhdb.mntstmp.hi: 33022061 ; 0x0b0: HOUR=0xd DAYS=0x3 MNTH=0x8 YEAR=0x7df kfdhdb.mntstmp.lo: 816879616 ; 0x0b4: USEC=0x0 MSEC=0x26 SECS=0xb MINS=0xc kfdhdb.secsize: 512 ; 0x0b8: 0x0200 kfdhdb.blksize: 4096 ; 0x0ba: 0x1000 …………
证明磁盘头确实被比较完美的修复了,现在的任务是尝试mount磁盘组
mount磁盘组
[grid@xifenfei1 ~]$ sqlplus / as sysasm SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 6 20:54:53 2015 Copyright (c) 1982, 2011, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options SQL> alter diskgroup data mount; Diskgroup altered. SQL> exit Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Real Application Clusters and Automatic Storage Management options
asm diskgroup已经正常mount,使用asmcmd命令检查文件是否正常
分析磁盘组数据是否正常
[grid@xifenfei1 ~]$ asmcmd ASMCMD> lsdg State Type Rebal Sector Block AU Total_MB Free_MB Req_mir_free_MB Usable_file_MB Offline_disks Voting_files Name MOUNTED EXTERN N 512 4096 1048576 1622060 636493 0 636493 0 N DATA/ ASMCMD> cd data ASMCMD> ls ORCL/ ASMCMD> cd orcl ASMCMD> ls CONTROLFILE/ DATAFILE/ ONLINELOG/ PARAMETERFILE/ TEMPFILE/ spfileorcl.ora ASMCMD> cd datafile ASMCMD> ls XIFENFEI20130801.314.835191517 XIFENFEI20140101.321.835191571 XIFENFEI20140201.322.835191573 XIFENFEI20140301.323.835191573 ………… SYSAUX.270.835182535 SYSAUX.838.874669369 SYSTEM.271.835182533 SYSTEM.823.873555791 SYSTEM.945.883146947 …………
这里看到磁盘组里面的数据文件都正常,使用同样的方法,继续mount其他磁盘组。
尝试启动数据库
SQL> startup ORACLE 例程已经启动。 Total System Global Area 5010685952 bytes Fixed Size 2236968 bytes Variable Size 2013269464 bytes Database Buffers 2986344448 bytes Redo Buffers 8835072 bytes 数据库装载完毕。 ORA-16038: 日志 14 sequence# 21145 无法归档 ORA-19504: 无法创建文件"" ORA-00312: 联机日志 14 线程 2: '+DATA/orcl/onlinelog/group_14.284.835184569' ORA-00312: 联机日志 14 线程 2: '+ARCH/orcl/onlinelog/group_14.287.835184569'
查看数据库alert日志
ARC1: Archival started ARC2: Archival started ARC2: Becoming the 'no FAL' ARCH ARC2: Becoming the 'no SRL' ARCH ARC1: Becoming the heartbeat ARCH ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Thu Aug 06 21:04:06 2015 Thread 2 advanced to log sequence 21146 (thread recovery) Picked broadcast on commit scheme to generate SCNs Thread 2 advanced to log sequence 21147 (before internal thread enable) Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_27402.trc: ORA-19816: 警告: 文件可能存在于数据库未知的 db_recovery_file_dest 中。 ORA-17502: ksfdcre: 4 未能创建文件 +ARCH ORA-15196: invalid ASM block header [kfc.c:19572] [check_kfbh] [1] [47962] [1344818371 != 630731762] ORA-15130: diskgroup "ARCH" is being dismounted ORA-15066: offlining disk "ARCH_0000" in group "ARCH" may result in a data loss ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] ARCH: Error 19504 Creating archive log file to '+ARCH' NOTE: Deferred communication with ASM instance Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_27402.trc: ORA-15130: diskgroup "ARCH" is being dismounted NOTE: deferred map free for map id 754 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_27402.trc: ORA-16038: 日志 14 sequence# 21145 无法归档 ORA-19504: 无法创建文件"" ORA-00312: 联机日志 14 线程 2: '+DATA/orcl/onlinelog/group_14.284.835184569' ORA-00312: 联机日志 14 线程 2: '+ARCH/orcl/onlinelog/group_14.287.835184569' ORA-16038 signalled during: ALTER DATABASE OPEN... Thu Aug 06 21:04:10 2015 SUCCESS: diskgroup ARCH was dismounted SUCCESS: diskgroup ARCH was dismounted Thu Aug 06 21:04:10 2015 Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ckpt_27353.trc: ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '+ARCH/orcl/controlfile/current.256.835182531' ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ckpt_27353.trc: ORA-00221: error on write to control file ORA-00206: error in writing (block 3, # blocks 1) of control file ORA-00202: control file: '+ARCH/orcl/controlfile/current.256.835182531' ORA-15078: ASM diskgroup was forcibly dismounted ORA-15078: ASM diskgroup was forcibly dismounted Thu Aug 06 21:04:10 2015 System state dump requested by (instance=1, osid=27353 (CKPT)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_27318.trc CKPT (ospid: 27353): terminating the instance due to error 221 Instance terminated by CKPT, pid = 27353
查看asm alert日志
Thu Aug 06 21:04:07 2015 WARNING: cache read a corrupt block: group=2(ARCH) dsk=0 blk=1 disk=0 (ARCH_0000) incarn=3942486752 au=0 blk=1 count=1 Errors in file /u01/app/11.2.0/grid/log/diag/asm/+asm/+ASM1/trace/+ASM1_ora_27462.trc: ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] NOTE: a corrupted block from group ARCH was dumped to /u01/app/11.2.0/grid/log/diag/asm/+asm/+ASM1/trace/+ASM1_ora_27462.trc WARNING: cache read (retry) a corrupt block: group=2(ARCH) dsk=0 blk=1 disk=0 (ARCH_0000) incarn=3942486752 au=0 blk=1 count=1 Errors in file /u01/app/11.2.0/grid/log/diag/asm/+asm/+ASM1/trace/+ASM1_ora_27462.trc: ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] ERROR: cache failed to read group=2(ARCH) dsk=0 blk=1 from disk(s): 0(ARCH_0000) ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] ORA-15196: invalid ASM block header [kfc.c:26076] [endian_kfbh] [2147483648] [1] [0 != 1] NOTE: cache initiating offline of disk 0 group ARCH NOTE: process _user27462_+asm1 (27462) initiating offline of disk 0.3942486752 (ARCH_0000) with mask 0x7e in group 2 WARNING: Disk 0 (ARCH_0000) in group 2 in mode 0x7f is now being taken offline on ASM inst 1 NOTE: initiating PST update: grp = 2, dsk = 0/0xeafd92e0, mask = 0x6a, op = clear Thu Aug 06 21:04:07 2015 GMON updating disk modes for group 2 at 17 for pid 35, osid 27462 ERROR: Disk 0 cannot be offlined, since diskgroup has external redundancy. ERROR: too many offline disks in PST (grp 2) Thu Aug 06 21:04:07 2015 NOTE: cache dismounting (not clean) group 2/0x723D6245 (ARCH) NOTE: messaging CKPT to quiesce pins Unix process pid: 27089, image: oracle@xifenfei1 (B000) WARNING: Offline of disk 0 (ARCH_0000) in group 2 and mode 0x7f failed on ASM inst 1 Thu Aug 06 21:04:07 2015 NOTE: halting all I/Os to diskgroup 2 (ARCH) System State dumped to trace file /u01/app/11.2.0/grid/log/diag/asm/+asm/+ASM1/trace/+ASM1_ora_27462.trc NOTE: AMDU dump of disk group ARCH created at /u01/app/11.2.0/grid/log/diag/asm/+asm/+ASM1/trace Thu Aug 06 21:04:09 2015 NOTE: LGWR doing non-clean dismount of group 2 (ARCH) NOTE: LGWR sync ABA=126.806 last written ABA 126.806
这里可以看出来,报错的block为arch磁盘组的第一个磁盘的第一个au的第二个block,而我们在开始的时候,已经分析了asm disk的第一个au完全损坏,并且我们使用了备份磁盘头进行来还原,勉强可以让磁盘组mount起来,但是由于数据库在启动的时候,需要对redo进行归档,而归档的过程需要写到arch磁盘组里面,这个时候需要访问到au=0 blk=1,而这个块本身是坏的,因此这个时候该块盘的disk就被offline掉了,而这个磁盘组是外部冗余的,因此磁盘组dismount了,所以数据库无法启动.
分析第一个au里面到底有哪些东西
SQL> select DISK_NUMBER,path from v$asm_disk;
DISK_NUMBER PATH
----------- --------------------------------------------------
0 /dev/raw/raw1
2 /dev/raw/raw3
1 /dev/raw/raw2
[oracle@xifenfei raw]$ kfed read raw1 blkn=1|grep kfbh.type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
[oracle@xifenfei raw]$ kfed read raw1 blkn=2|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw1 blkn=3|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw1 blkn=255|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw2 blkn=1|grep kfbh.type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
[oracle@xifenfei raw]$ kfed read raw2 blkn=2|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw2 blkn=255|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw3 blkn=1|grep kfbh.type
kfbh.type: 2 ; 0x002: KFBTYP_FREESPC
[oracle@xifenfei raw]$ kfed read raw3 blkn=2|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
[oracle@xifenfei raw]$ kfed read raw3 blkn=255|grep kfbh.type
kfbh.type: 3 ; 0x002: KFBTYP_ALLOCTBL
通过一个测试机器的一个磁盘组进行分析,我们可以基本上确定asm 第一个au除了asm disk header的KFBTYP_DISKHEAD之外,其他主要是KFBTYP_FREESPC(Free Space Table)和KFBTYP_ALLOCTBL(allocator table),主要就是记录asm中au的分配情况,也就是进一步说明,如果我不对asm里面的数据使用更多的au分配或者回收au,在缺少第一个au的1-255个block的信息情况下,asm的磁盘组也不会dismount。根据这个思路,让数据库归档到本地,然后继续测试
继续open数据库
SQL> startup ORACLE 例程已经启动。 Total System Global Area 5010685952 bytes Fixed Size 2236968 bytes Variable Size 2013269464 bytes Database Buffers 2986344448 bytes Redo Buffers 8835072 bytes 数据库装载完毕。 SQL> alter database open; 数据库已更改。 LGWR: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES Fri Aug 07 02:43:13 2015 ARC1 started with pid=34, OS id=22778 Fri Aug 07 02:43:13 2015 ARC2 started with pid=35, OS id=22780 Fri Aug 07 02:43:13 2015 ARC3 started with pid=36, OS id=22782 ARC1: Archival started ARC2: Archival started ARC2: Becoming the 'no FAL' ARCH ARC2: Becoming the 'no SRL' ARCH ARC1: Becoming the heartbeat ARCH ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Fri Aug 07 02:43:24 2015 Thread 1 opened at log sequence 18604 Current log# 10 seq# 18604 mem# 0: /tmp/xifenfei/otherfile/group_10.273.835182533 Current log# 10 seq# 18604 mem# 1: /tmp/xifenfei/otherfile/group_10.263.835182533 Successful open of redo thread 1 Fri Aug 07 02:43:24 2015 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Fri Aug 07 02:43:25 2015 SMON: enabling cache recovery Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Fri Aug 07 02:43:26 2015 minact-scn: Inst 1 is now the master inc#:2 mmon proc-id:21328 status:0x7 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 Fri Aug 07 02:43:26 2015 Redo thread 2 internally disabled at seq 21147 (CKPT) [21341] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:96999124 end:97000624 diff:1500 (15 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed SMON: enabling tx recovery Database Characterset is ZHS16GBK No Resource Manager plan active Starting background process GTX0 Fri Aug 07 02:43:31 2015 GTX0 started with pid=37, OS id=22803 Starting background process RCBG Fri Aug 07 02:43:31 2015 RCBG started with pid=38, OS id=22805 replication_dependency_tracking turned off (no async multimaster replication found) Fri Aug 07 02:43:34 2015 Archived Log entry 73876 added for thread 2 sequence 21145 ID 0x513c613f dest 1: <---果然有归档操作发生 Starting background process QMNC Fri Aug 07 02:43:34 2015 QMNC started with pid=39, OS id=22812 Fri Aug 07 02:43:35 2015 Archived Log entry 73877 added for thread 2 sequence 21146 ID 0x513c613f dest 1: Fri Aug 07 02:43:35 2015 ARC0: Archiving disabled thread 2 sequence 21147 Archived Log entry 73878 added for thread 2 sequence 21147 ID 0x513c613f dest 1: Fri Aug 07 02:43:37 2015 Completed: alter database open
现在到了这一步,基本上可以确定,数据库是零丢失恢复。由于asm 第一个au丢失数据严重,想要彻底修复比较难,考虑把数据库启动到mount/read only状态然后使用rman备份数据,然后进行重建asm 磁盘组,再迁移回来。至此完美恢复asmlib的磁盘被oracleasm重写的故障恢复,实现数据0丢失.当然在整个恢复过程没有于此的简单,涉及到在votedisk损坏的情况下,如何mount磁盘组,vote diskgroup的损坏修复问题,磁盘组在10g/11.1和11.2还原磁盘头备份的问题等问题.
虽然本次的恢复案例中,由于asmlib的asm disk不可见就轻易使用oracleasm createdisk命令对磁盘进行了重建,犯了一个很大错误,但是在重建之后,发现磁盘组依旧异常,未继续操作(比如重建磁盘组等),为最后的数据完全恢复创造了必要条件,使得客户的没有任何数据损失。如果再对除磁盘组继续复写操作,可能会导致数据永久性丢失。这个教训告诉我们:遇到自己不能把握的事情,及时终止,不要让错误越走越远
asmlib异常报ORA-00600[kfklLibFetchNext00]
一个朋友的历史库出现故障,在linux 4的平台上asm的10.2.0.1的单库,asm使用asmlib来处理。
asm不能正常mount磁盘组,可以看到asmdisk,alert日志报ORA-00600[kfklLibFetchNext00]
操作系统内核是:2.6.9-78
oracleasmlib是:2.0.2-1
asm磁盘组mount失败
--以前故障 SQL> ALTER DISKGROUP ALL MOUNT Thu Sep 6 14:23:16 2012 NOTE: cache registered group DGARC number=1 incarn=0x2bf96274 NOTE: cache registered group DGDATA number=2 incarn=0x2c196275 NOTE: cache registered group DGSYS number=3 incarn=0x2c196276 Thu Sep 6 14:23:16 2012 Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc: ORA-15183: ASMLIB initialization error [driver/agent not installed] Thu Sep 6 14:23:16 2012 Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_10204.trc: ORA-15183: ASMLIB initialization error [/opt/oracle/extapi/64/asm/orcl/1/libasm.so] ORA-15183: ASMLIB initialization error [driver/agent not installed] Thu Sep 6 14:23:16 2012 ERROR: no PST quorum in group 1: required 2, found 0 Thu Sep 6 14:23:16 2012 NOTE: cache dismounting group 1/0x2BF96274 (DGARC) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGARC was not mounted Thu Sep 6 14:23:16 2012 ERROR: no PST quorum in group 2: required 2, found 0 Thu Sep 6 14:23:16 2012 NOTE: cache dismounting group 2/0x2C196275 (DGDATA) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGDATA was not mounted Thu Sep 6 14:23:16 2012 ERROR: no PST quorum in group 3: required 2, found 0 Thu Sep 6 14:23:16 2012 NOTE: cache dismounting group 3/0x2C196276 (DGSYS) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGSYS was not mounted --现在故障 Thu Jan 24 13:49:45 2013 SQL> ALTER DISKGROUP ALL MOUNT Thu Jan 24 13:49:45 2013 NOTE: cache registered group DGARC number=1 incarn=0xf388cee9 NOTE: cache registered group DGDATA number=2 incarn=0xf3a8ceea NOTE: cache registered group DGSYS number=3 incarn=0xf3a8ceeb Thu Jan 24 13:49:45 2013 Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc: ORA-00600: internal error code, arguments: [kfklLibFetchNext00], [18446744073709551614], [0], [], [], [], [], [] Thu Jan 24 13:49:46 2013 Errors in file /opt/app/oracle/admin/+ASM/bdump/+asm_rbal_13449.trc: ORA-00600: internal error code, arguments: [kfklLibFetchNext00], [18446744073709551614], [0], [], [], [], [], [] Thu Jan 24 13:49:46 2013 ERROR: no PST quorum in group 1: required 2, found 0 Thu Jan 24 13:49:46 2013 NOTE: cache dismounting group 1/0xF388CEE9 (DGARC) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGARC was not mounted Thu Jan 24 13:49:46 2013 ERROR: no PST quorum in group 2: required 2, found 0 Thu Jan 24 13:49:46 2013 NOTE: cache dismounting group 2/0xF3A8CEEA (DGDATA) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGDATA was not mounted Thu Jan 24 13:49:46 2013 ERROR: no PST quorum in group 3: required 2, found 0 Thu Jan 24 13:49:46 2013 NOTE: cache dismounting group 3/0xF3A8CEEB (DGSYS) NOTE: dbwr not being msg'd to dismount ERROR: diskgroup DGSYS was not mounted Shutting down instance: further logons disabled
trace文件信息
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31 call ksedst1() 000000000 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
ksedmp()+610 call ksedst() 000000000 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
ksfdmp()+21 call ksedmp() 000000003 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
kgerinv()+161 call ksfdmp() 000000003 ? 000000001 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
kgesinv()+33 call kgerinv() 006469D40 ? 0064E1C58 ?
000000000 ? 000000000 ?
000000001 ? 000000001 ?
kgesinw()+166 call kgesinv() 006469D40 ? 0064E1C58 ?
000000000 ? 000000000 ?
000000001 ? 000000001 ?
kfklLibScanNext()+2 call kgesinw() 006469D40 ? 000000000 ?
39 000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfkLibFetchNext()+3 call kfklLibScanNext() 0064DDD70 ? 7FBFFFDCD0 ?
43 000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfuitrnInit()+524 call kfkLibFetchNext() 006469D40 ? 2A971DFF90 ?
000000001 ? 000000000 ?
FFFFFFFFFFFFFFFE ?
000000000 ?
kfkLibIterInit()+18 call kfuitrnInit() 006469D40 ? 2A971DFCB0 ?
0 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfkLoadAllLibs()+36 call kfkLibIterInit() 000000000 ? 00646C7E0 ?
3 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfkDiscoverString() call kfkLoadAllLibs() 000000000 ? 00646C7E0 ?
+107 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
Cannot find symbol
Cannot find symbol
Cannot find symbol
kfdDiscoverString() call kfkDiscoverString() 067A53768 ? 00646C7E0 ?
+28 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfdDiscoverShallow( call kfdDiscoverString() 067A53768 ? 000000000 ?
)+315 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
kfgbDriver()+1174 call kfdDiscoverShallow( 000000180 ? 000000000 ?
) 2A971DFF90 ? 000000009 ?
000000009 ? 000000000 ?
ksbabs()+564 call kfgbDriver() 7FBFFFE5C0 ? 000000048 ?
000000000 ? 000000009 ?
000000009 ? 000000000 ?
ksbrdp()+727 call ksbabs() 7FBFFFE5C0 ? 000000048 ?
000000000 ? 000000009 ?
000000009 ? 000000000 ?
opirip()+616 call ksbrdp() 7FBFFFE5C0 ? 000000048 ?
000000001 ? 06002C770 ?
000000009 ? 000000000 ?
opidrv()+582 call opirip() 000000032 ? 000000004 ?
7FBFFFF6C8 ? 06002C770 ?
000000009 ? 000000000 ?
sou2o()+114 call opidrv() 000000032 ? 000000004 ?
7FBFFFF6C8 ? 06002C770 ?
000000009 ? 000000000 ?
opimai_real()+317 call sou2o() 7FBFFFF6A0 ? 000000032 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
main()+116 call opimai_real() 000000003 ? 7FBFFFF730 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
<0x3c9fb1c40b> call main() 000000003 ? 7FBFFFF730 ?
000000004 ? 7FBFFFF6C8 ?
000000009 ? 000000000 ?
--------------------- Binary Stack Dump ---------------------
因为客户的库是一个历史库,基本上不怎么使用,在2012年启动asm就出现了ORA-15183错误,然后在2013年重启机器后,再次启动asm就出现了ORA-00600[kfklLibFetchNext00]错误,通过2012年的错误提示,我们大概可以判断出来该问题和ASMLIB有关系,查询mos发现429945.1,发现Call Stack Trace完全一致,可以定位是该问题(如果想深入分析,可以通过strace继续分析)
ORA-600: [kfklLibFetchNext00], [18446744073709551614], [0] when mounting diskgroup in ASM
Applies to:
Linux OS - Version: 2.0.1-1 and later [Release: RHEL4 and later ]
Information in this document applies to any platform.
Linux Kernel - Version: 2.0.1
Symptoms
3 RAC db.
2 nodes are up and functioning except for 1 node - ASM did not come back up after
the reboot eventhough all disks show available from asmlib's perspective:
Changes
All that was done with resources were stopped on Node1 and an extra LUN added.
A reboot was then performed.
Cause
The cause of the issue is libasm.o corruption
Ran the following to confirm that disks are ok:
/dev/oracleasm listdisks
/usr/sbin/asmtool -I -l /dev/oracleasm -n /dev/sdg1 -a label
/usr/sbin/oracleasm-discover 'ORCL:*'
dd if=/dev/sdg1 bs=8192 count=1 | od -c
==> output checked out fine
.
kfod asm_diskstring='ORCL:*'
==> this failed on Node1
KFOD-00600: file not found; argument [610][kfklLibFetchNext00] even though libasm.o exists
You might see the following call stack as well
----- Call Stack Trace -----
kfklLibScanNext
kfkLibFetchNext
kfuitrnInit
kfkLibIterInit
kfkLoadAllLibs
kfkDiscoverString
kfdDiscoverString
kfdDiscoverShallow
kfgbDriver
strace showed
Node1-failing
-------
stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0777, st_size=19344, ...}) = 0
getdents64(4, /* 0 entries */, 4096) = 0 <<<<
close(4) = 0
open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/opt/oracle/product/10.2.0/db_1/rdbms/mesg/kfodus.msb", O_RDONLY) = -1
ENOENT (No such file or directory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2a9750d000
write(1, "KFOD-00600: file not found; argu"..., 69) = 69
Node2-working
-----
stat("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
open("/opt/oracle/extapi/64/asm/orcl/1/libasm.so", O_RDONLY) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\23\0"..., 832) = 832
fstat(4, {st_mode=S_IFREG|0755, st_size=19344, ...}) = 0
mmap(NULL, 1066104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) 0x2a9750d000
通过MOS的描述,可以明确定位到问题是:libasm.o异常导致
解决方案
To implement the solution, reinstall the ASMlib RPM >rpm -Uvh oracleasmlib-2.0.0-1 This replaces the /opt/oracle/extapi/64/asm/orcl/1/libasm.so

加我QQ(107644445)

