标签归档:tab$清空恢复

一次非常幸运的ORA-600 16703(tab$被清空)故障恢复

这次的ORA-600 16703的故障比较特殊,客户的一套rac运行了5年多没有重启,这次由于异常导致其中一个节点重启,然后触发了tab$被清空,异常节点启动报ORA-600 16703错误.朋友在故障之后,第一时间没有对在运行的节点进行重启(虽然也无法对外提供业务服务),使得恢复工作相对简单一些,恢复效果也是最完美的.这个是我在对于软件安装介质注入恶意脚本,300天之后重启触发tab$被清空的相关恢复case中,最完美的一次(以前遇到过一次客户是虚拟化环境通过cdp回退然后类似方法处理ORA-600 16703直接把orachk备份表插入到tab$恢复),凸显了这位朋友在故障发生之后对于问题的准确判断和果断的应对能力.
有朋友和我反馈,他们数据库突然报大量ORA-600错误,业务无法正常操作,我分析相关日志确认:节点2重启之后节点1开始报大量ORA-600错误,但是节点一直处于open状态

Fri Jul 25 15:28:53 2025
Decreasing number of real time LMS from 3 to 0
Fri Jul 25 15:29:18 2025
Reconfiguration started (old inc 13, new inc 15)
List of instances:
 1 2 (myinst: 1) 
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Jul 25 15:29:18 2025
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jul 25 15:29:18 2025
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jul 25 15:29:18 2025
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Fri Jul 25 15:29:20 2025
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:15 errcnt:0
Fri Jul 25 15:30:07 2025
Errors in file /u01/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_252634.trc  (incident=77234):
ORA-00600: internal error code, arguments: [ktsircinfo_num1], [0], [0], [0], [],[],[],[],[],[],[],[]
Incident details in: /u01/oracle/diag/rdbms/orcl/orcl1/incident/incdir_77234/orcl1_ora_252634_i77234.trc
Fri Jul 25 15:30:18 2025
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Jul 25 15:30:19 2025
Sweep [inc][77234]: completed
Sweep [inc2][77234]: completed
Fri Jul 25 15:30:27 2025
Errors in file /u01/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_261587.trc  (incident=76487):
ORA-00600: internal error code, arguments: [ktsircinfo_num1],[0],[0],[0], [], [], [], [], [], [], [], []

通过grep筛选报错信息

[root@iZbp11c0qyuuo1gr7j98upZ tmp]# egrep "ORA-00600|ORA-07445" alert_1.txt |sort -u
ORA-00600: internal error code, arguments: [25027], [0], [0], [], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [kkpo_rcinfo_defstg:delseg], [28941391], [], [], [], []
ORA-00600: internal error code, arguments: [ktsircinfo_num1], [0], [0], [0], [], [], [], [], [], []
ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [DEL], [0], [35038924], [], [], [], [], [], [], []
ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [INS], [0], [277736], [], [], [], [], [], [], []
ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [INS], [0], [28829570], [], [], [], [], [], [], []
ORA-07445: exception encountered: core dump [qknSetParent()+9] [SIGSEGV] [ADDR:0x10354] 
   [PC:0x1A48B9B] [Address not mapped to object] []
ORA-07445: exception encountered: core dump [qksxaMoveQbAnnotations()+168] [SIGSEGV]
   [ADDR:0x20304] [PC:0x1594954] [Address not mapped to object] []
ORA-07445: 出现异常错误: 核心转储 [qknExpRegIni_int()+87] [SIGSEGV] [ADDR:0x8C] 
   [PC:0x1A4D729] [Address not mapped to object] []
ORA-07445: 出现异常错误: 核心转储 [qksxaMoveQbAnnotations()+168] [SIGSEGV] [ADDR:0x0] 
   [PC:0x1594954] [SI_KERNEL(general_protection)] []

既然是由于节点2重启导致节点1报错,那分析节点2重启相关情况,第一次重启成功之后,数据库开始报ORA-600错误

Fri Jul 25 15:29:29 2025
QMNC started with pid=46, OS id=363757 
Fri Jul 25 15:29:31 2025
minact-scn: Inst 2 is a slave inc#:15 mmon proc-id:363622 status:0x2
minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000
Fri Jul 25 15:29:33 2025
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc  (incident=248519):
ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248519/orcl2_m003_363779_i248519.trc
Starting background process SMCO
Fri Jul 25 15:29:35 2025
SMCO started with pid=57, OS id=363802 
Fri Jul 25 15:29:35 2025
Completed: ALTER DATABASE OPEN /* db agent *//* {2:23784:2} */
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc  (incident=248520):
ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248520/orcl2_m003_363779_i248520.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Exception [type: SIGSEGV, Address not mapped to object][ADDR:0x10] [PC:0x2FDA4BB,kgmdelsis()+219][flags:0x0,count:1]
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc  (incident=248521):
ORA-07445: exception encountered: core dump [kgmdelsis()+219] [SIGSEGV] 
  [ADDR:0x10] [PC:0x2FDA4BB] [Address not mapped to object] []
ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248521/orcl2_m003_363779_i248521.trc
Use ADRCI or Support Workbench to package the incident.
Fri Jul 25 15:29:47 2025
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_363808.trc  (incident=248559):
ORA-00600: internal error code, arguments: [kkposds2], [18446744073709551615], [18446744073709551615], 
  [18446744073709551615], [], [], [], [], [], [], [], []

然后第二次重启数据库无法open成功,而是报ORA-600 16703错误

ALTER DATABASE OPEN /* db agent *//* {2:21799:2} */
Picked broadcast on commit scheme to generate SCNs
ARCH: STARTING ARCH PROCESSES
Fri Jul 25 15:41:23 2025
ARC0 started with pid=39, OS id=369231 
ARC0: Archival started
ARCH: STARTING ARCH PROCESSES COMPLETE
ARC0: STARTING ARCH PROCESSES
Fri Jul 25 15:41:24 2025
ARC1 started with pid=40, OS id=369242 
Fri Jul 25 15:41:24 2025
ARC2 started with pid=41, OS id=369244 
Fri Jul 25 15:41:24 2025
ARC3 started with pid=42, OS id=369246 
ARC1: Archival started
ARC2: Archival started
ARC1: Becoming the 'no FAL' ARCH
ARC1: Becoming the 'no SRL' ARCH
ARC2: Becoming the heartbeat ARCH
Thread 2 opened at log sequence 33585
  Current log# 7 seq# 33585 mem# 0: +DATA/orcl/onlinelog/group_7.269.1011373611
Successful open of redo thread 2
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
SMON: enabling cache recovery
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc  (incident=260494):
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []
Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_260494/orcl2_ora_369210_i260494.trc
ARC3: Archival started
ARC0: STARTING ARCH PROCESSES COMPLETE
SUCCESS: diskgroup FRA was mounted
Fri Jul 25 15:41:30 2025
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []
Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], []
Error 704 happened during db open, shutting down database
USER (ospid: 369210): terminating the instance due to error 704
Instance terminated by USER, pid = 369210
ORA-1092 signalled during: ALTER DATABASE OPEN /* db agent *//* {2:21799:2} */...
opiodr aborting process unknown ospid (369210) as a result of ORA-1092
Fri Jul 25 15:41:31 2025
ORA-1092 : opitsk aborting process

到这一步基本上就清晰了,大概率是遭遇到以前恢复的类似case,tab$数据被清空导致,类似案例
ORA-600 16703故障解析—tab$表被清空
警告:互联网中有oracle介质被注入恶意程序导致—ORA-600 16703
通过在故障主机上找到安装介质,验证md5确认该程序是被注入恶意代码程序
md5


这个库由于还有一个节点处于open状态,相对处理比较简单,直接把备份的表数据反向插入回去即可

SYS@orcl1> select count(1) from ORACHK3C08C86E063530510ACD937;

  COUNT(1)
----------
     20696

SYS@orcl1> insert into tab$ select * from ORACHK3C08C86E063530510ACD937;

20696 rows created.

SYS@orcl1> commit;

Commit complete.

SYS@orcl1> select object_name,to_char(CREATED,'yyyy-mm-dd hh24:mi:ss') from dba_objects 
          2 where object_name in('DBMS_SUPPORT_DBMONITOR','DBMS_SUPPORT_DBMONITORP');

OBJECT_NAME                                  TO_CHAR(CREATED,'YY
-------------------------------------------- -------------------
DBMS_SUPPORT_DBMONITORP                      2019-06-19 17:06:46
DBMS_SUPPORT_DBMONITOR                       2019-06-19 17:06:46

然后清理掉恶意脚本,分别重启两个节点,完成数据恢复任务
2025-07-26_215903_578


这次故障能够快速顺利的恢复,和客户发现故障之后保留第一现场,没有把一个open的节点也重启有很大关系,open的节点也重启了,那后续恢复工作会麻烦很多,效果可能也没有这样的完美.

发表在 非常规恢复 | 标签为 , , , , , | 留下评论