网卡异常导致数据库实例启动异常

一套集群,一个节点启动正常,另外一个节点无法正常启动实例,启动异常节点alert日志

Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Receiver ospid 6386 [
Tue Mar 07 19:07:29 2023
Errors in file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_lms0_6386.trc:
IPC Send timeout detected. Receiver ospid 6402 [
Tue Mar 07 19:07:29 2023
Errors in file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_lms4_6402.trc:
Tue Mar 07 19:07:29 2023
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
System state dump requested by (instance=2, osid=6384 (LMD0)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/xff/xff2/trace/xff2_diag_6374_20230307190729.trc
LMD0 (ospid: 6384): terminating the instance due to error 481
Dumping diagnostic data in directory=[cdmp_20230307190729],
      requested by (instance=2, osid=6384 (LMD0)), summary=[abnormal instance termination].
Instance terminated by LMD0, pid = 6384

正常节点alert日志

Tue Mar 07 19:02:07 2023
Reconfiguration started (old inc 20, new inc 22)
List of instances:
 1 2 (myinst: 1)
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Tue Mar 07 19:02:08 2023
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
 LMS 7: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Tue Mar 07 19:02:08 2023
Tue Mar 07 19:02:08 2023
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Tue Mar 07 19:02:27 2023
IPC Send timeout detected. Sender: ospid 6936 [oracle@xffnode1.localdomain (PING)]
Receiver: inst 2 binc 441249706 ospid 59731
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6946 [oracle@xffnode1.localdomain (LMS0)]
Receiver: inst 2 binc 429479852 ospid 6386
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6962 [oracle@xffnode1.localdomain (LMS4)]
Receiver: inst 2 binc 429479854 ospid 6402
Tue Mar 07 19:07:29 2023
IPC Send timeout detected. Sender: ospid 6966 [oracle@xffnode1.localdomain (LMS5)]

通过上述日志,可以确认主要由于两个节点之间无法正常通讯,从而使得新节点无法加入到集群(无法完成集群重组),从而使得实例启动异常.一般出现这类情况最检查的就是私网异常,通过分析oswnetstat记录发现packet reassembles failed特别严重
20230307230341


一般出现该问题,考虑是由于ipfrag_*_thresh默认值不足导致,通过设置

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640

临时请库成功,但是数据库实例重组时间依旧过长
20230307230658


packet reassembles failed依旧在增加,通过分析网卡情况发现网卡异常,采用haip(双万兆网卡)的其中一块网卡异常
20230307230813

为了数据库性能不收太大影响,临时禁用异常网卡,重启库正常
20230308001033

后续等网络层面解决之后再启用该网卡

发表在 Oracle RAC | 评论关闭

最新版oracle dul工具

oracle官方dul工具继续更新,现在已经更新到12.2.0.2.5版本,可以支持oracle 6及其以上的所有版本,是oracle数据库在极端情况下恢复利器

[oracle@xifenfei dul]$ ./dul

Data UnLoader: 12.2.0.2.5 - Internal Only - on Sun Mar  5 15:12:11 2023
with 64-bit io functions and the decompression option

Copyright (c) 1994 2023 Bernard van Duijnen All rights reserved.

 Strictly Oracle Internal Use Only


DUL: Warning: Could not open parameter file <init.dul>
DUL: Warning: Compatible is set to 11 Values can be 6|7|8|9|10|11|12|17|18
DUL: Warning: no parameter file means no logfile
DUL> 

配置init.dul文件之后

[oracle@iZbp1hx0enix3hix1kvyrxZ dul]$ ./dul

Data UnLoader: 12.2.0.2.5 - Internal Only - on Sun Mar  5 15:22:26 2023
with 64-bit io functions and the decompression option

Copyright (c) 1994 2023 Bernard van Duijnen All rights reserved.

 Strictly Oracle Internal Use Only


Found db_id = 1588579327
Found db_name = ORCL
DUL> show datafiles;
ts# rf# start   blocks offs open  err file name
  0   1     0    97281    0    1    0 /u01/app/oracle/oradata/orcl/system01.dbf
  1   2     0   387841    0    1    0 /u01/app/oracle/oradata/orcl/sysaux01.dbf
  2   3     0    37761    0    1    0 /u01/app/oracle/oradata/orcl/undotbs01.dbf
  4   4     0     5761    0    1    0 /u01/app/oracle/oradata/orcl/users01.dbf
  7   5     0    16385    0    1    0 /u01/app/oracle/oradata/orcl/t_xifenfei01.dbf
DUL> bootstrap;
Probing file = 1, block = 520
. unloading table                BOOTSTRAP$
DUL: Warning: block number is non zero but marked deferred trying to process it anyhow
      59 rows unloaded
Reading BOOTSTRAP.dat 59 entries loaded
Parsing Bootstrap$ contents
Generating dict.ddl for version 11
 OBJ$: segobjno 18, file 1 block 240
 TAB$: segobjno 2, tabno 1, file 1  block 144
 COL$: segobjno 2, tabno 5, file 1  block 144
 USER$: segobjno 10, tabno 1, file 1  block 208
Running generated file "@dict.ddl" to unload the dictionary tables
. unloading table                      OBJ$   86411 rows unloaded
. unloading table                      TAB$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
    2904 rows unloaded
. unloading table                      COL$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
   94714 rows unloaded
. unloading table                     USER$      88 rows unloaded
Reading USER.dat 88 entries loaded
Reading OBJ.dat 86411 entries loaded and sorted 86411 entries
Reading TAB.dat 2904 entries loaded
Reading COL.dat 94714 entries loaded and sorted 94714 entries
Reading BOOTSTRAP.dat 59 entries loaded

DUL: Warning: Recreating file "dict.ddl"
Generating dict.ddl for version 11
 OBJ$: segobjno 18, file 1 block 240
 TAB$: segobjno 2, tabno 1, file 1  block 144
 COL$: segobjno 2, tabno 5, file 1  block 144
 USER$: segobjno 10, tabno 1, file 1  block 208
 TABPART$: segobjno 591, file 1 block 4000
 INDPART$: segobjno 596, file 1 block 4040
 TABCOMPART$: segobjno 613, file 1 block 4176
 INDCOMPART$: segobjno 618, file 1 block 4216
 TABSUBPART$: segobjno 603, file 1 block 4096
 INDSUBPART$: segobjno 608, file 1 block 4136
 IND$: segobjno 2, tabno 3, file 1  block 144
 ICOL$: segobjno 2, tabno 4, file 1  block 144
 LOB$: segobjno 2, tabno 6, file 1  block 144
 COLTYPE$: segobjno 2, tabno 7, file 1  block 144
 TYPE$: segobjno 518, tabno 1, file 1  block 3464
 COLLECTION$: segobjno 518, tabno 2, file 1  block 3464
 ATTRIBUTE$: segobjno 518, tabno 3, file 1  block 3464
 LOBFRAG$: segobjno 624, file 1 block 4264
 LOBCOMPPART$: segobjno 627, file 1 block 4288
 UNDO$: segobjno 15, file 1 block 224
 TS$: segobjno 6, tabno 2, file 1  block 176
 PROPS$: segobjno 98, file 1 block 800
Running generated file "@dict.ddl" to unload the dictionary tables
. unloading table                      OBJ$
DUL: Warning: Recreating file "OBJ.ctl"
   86411 rows unloaded
. unloading table                      TAB$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335

DUL: Warning: Recreating file "TAB.ctl"
    2904 rows unloaded
. unloading table                      COL$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335

DUL: Warning: Recreating file "COL.ctl"
   94714 rows unloaded
. unloading table                     USER$
DUL: Warning: Recreating file "USER.ctl"
      88 rows unloaded
. unloading table                  TABPART$     143 rows unloaded
. unloading table                  INDPART$     124 rows unloaded
. unloading table               TABCOMPART$       1 row  unloaded
. unloading table               INDCOMPART$       0 rows unloaded
. unloading table               TABSUBPART$      32 rows unloaded
. unloading table               INDSUBPART$       0 rows unloaded
. unloading table                      IND$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
    4931 rows unloaded
. unloading table                     ICOL$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
    7644 rows unloaded
. unloading table                      LOB$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
    1031 rows unloaded
. unloading table                  COLTYPE$
DUL: Warning: Block has been marked soft corrupt
DUL: Error: While processing ts# 0 file# 1 block# 13335
    2565 rows unloaded
. unloading table                     TYPE$    2909 rows unloaded
. unloading table               COLLECTION$    1002 rows unloaded
. unloading table                ATTRIBUTE$   11328 rows unloaded
. unloading table                  LOBFRAG$       1 row  unloaded
. unloading table              LOBCOMPPART$       0 rows unloaded
. unloading table                     UNDO$      21 rows unloaded
. unloading table                       TS$       8 rows unloaded
. unloading table                    PROPS$      36 rows unloaded
Reading USER.dat 88 entries loaded
Reading OBJ.dat 86411 entries loaded and sorted 86411 entries
Reading TAB.dat 2904 entries loaded
Reading COL.dat 94714 entries loaded and sorted 94714 entries
Reading TABPART.dat 143 entries loaded and sorted 143 entries
Reading TABCOMPART.dat 1 entries loaded and sorted 1 entries
Reading TABSUBPART.dat 32 entries loaded and sorted 32 entries
Reading INDPART.dat 124 entries loaded and sorted 124 entries
Reading INDCOMPART.dat 0 entries loaded and sorted 0 entries
Reading INDSUBPART.dat 0 entries loaded and sorted 0 entries
Reading IND.dat 4931 entries loaded
Reading LOB.dat
DUL: Notice: Increased the size of DC_LOBS from 1024 to 8192 entries
 1031 entries loaded
Reading ICOL.dat 7644 entries loaded
Reading COLTYPE.dat 2565 entries loaded
Reading TYPE.dat 2909 entries loaded
Reading ATTRIBUTE.dat 11328 entries loaded
Reading COLLECTION.dat 1002 entries loaded
Reading BOOTSTRAP.dat 59 entries loaded
Reading LOBFRAG.dat 1 entries loaded and sorted 1 entries
Reading LOBCOMPPART.dat 0 entries loaded and sorted 0 entries
Reading UNDO.dat 21 entries loaded
Reading TS.dat 8 entries loaded
Reading PROPS.dat 36 entries loaded
Database character set is ZHS16GBK
Database national character set is AL16UTF16
DUL> 
发表在 Oracle | 标签为 , , | 评论关闭

误删除asm disk导致磁盘组无法mount数据库恢复

客户误删除asm disk两个lun(由于这个是这个存储的特殊性,删除lun之后,存储层面无法恢复出来对应的lun数据,导致客户彻底放弃了硬件层面恢复的可能性.),导致asm磁盘组无法正常mount

SQL> ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {1:27928:40938} */ 
NOTE: cache registered group DATA number=3 incarn=0x60fa38b1
NOTE: cache began mount (first) of group DATA number=3 incarn=0x60fa38b1
NOTE: Assigning number (3,0) to disk (/dev/rdisk/VD02_DBF)
NOTE: Assigning number (3,1) to disk (/dev/rdisk/VD03_DBF)
NOTE: Assigning number (3,2) to disk (/dev/rdisk/VD04_DBF)
NOTE: Assigning number (3,3) to disk (/dev/rdisk/VD05_DBF)
NOTE: Assigning number (3,4) to disk (/dev/rdisk/VD06_DBF)
Thu Dec 29 10:21:20 2022
NOTE: GMON heartbeating for grp 3
GMON querying group 3 at 29 for pid 29, osid 3770
NOTE: Assigning number (3,5) to disk ()
NOTE: Assigning number (3,6) to disk ()
GMON querying group 3 at 30 for pid 29, osid 3770
NOTE: cache dismounting (clean) group 3/0x60FA38B1 (DATA) 
NOTE: messaging CKPT to quiesce pins Unix process pid: 3770, image: oracle@db1 (TNS V1-V3)
NOTE: dbwr not being msg'd to dismount
NOTE: lgwr not being msg'd to dismount
NOTE: cache dismounted group 3/0x60FA38B1 (DATA) 
NOTE: cache ending mount (fail) of group DATA number=3 incarn=0x60fa38b1
NOTE: cache deleting context for group DATA 3/0x60fa38b1
GMON dismounting group 3 at 31 for pid 29, osid 3770
NOTE: Disk DATA_0000 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0001 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0002 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0003 in mode 0x7f marked for de-assignment
NOTE: Disk DATA_0004 in mode 0x7f marked for de-assignment
NOTE: Disk  in mode 0x7f marked for de-assignment
NOTE: Disk  in mode 0x7f marked for de-assignment
ERROR: diskgroup DATA was not mounted
ORA-15032: not all alterations performed
ORA-15040: diskgroup is incomplete
ORA-15042: ASM disk "6" is missing from group number "3" 
ORA-15042: ASM disk "5" is missing from group number "3" 
ERROR: ALTER DISKGROUP DATA MOUNT  /* asm agent *//* {1:27928:40938} */

这个客户应该有三个磁盘组存放数据文件,其中data磁盘组的7个磁盘被删除了2个lun,导致data磁盘组无法mount,客户希望尽可能恢复其中数据,对于这种情况,由于2个lun完全丢失,直接通过dul之类的工具拷贝asm数据文件恢复不可行(因为很多asm的元数据也会在丢失的lun里面,导致拷贝出来的数据文件异常太多,恢复效果会很差),对于这种情况采用asm disk header 彻底损坏恢复的恢复方法,尽可能的从block层面恢复出来所有可以恢复的数据块中的数据
20230209100350


由于这个其中涉及了system表空间(oracle损坏严重),结合客户几年前的一个system历史备份文件,恢复出来字典,然后尽可能的恢复数据文件,最终最大限度给客户恢复数据,让客户的损失降到最低.

发表在 Oracle ASM | 标签为 , , , | 评论关闭