联系:手机/微信(+86 17813235971) QQ(107644445)
标题:一次非常幸运的ORA-600 16703(tab$被清空)故障恢复
作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]
这次的ORA-600 16703的故障比较特殊,客户的一套rac运行了5年多没有重启,这次由于异常导致其中一个节点重启,然后触发了tab$被清空,异常节点启动报ORA-600 16703错误.朋友在故障之后,第一时间没有对在运行的节点进行重启(虽然也无法对外提供业务服务),使得恢复工作相对简单一些,恢复效果也是最完美的.这个是我在对于软件安装介质注入恶意脚本,300天之后重启触发tab$被清空的相关恢复case中,最完美的一次(以前遇到过一次客户是虚拟化环境通过cdp回退然后类似方法处理ORA-600 16703直接把orachk备份表插入到tab$恢复),凸显了这位朋友在故障发生之后对于问题的准确判断和果断的应对能力.
有朋友和我反馈,他们数据库突然报大量ORA-600错误,业务无法正常操作,我分析相关日志确认:节点2重启之后节点1开始报大量ORA-600错误,但是节点一直处于open状态
Fri Jul 25 15:28:53 2025 Decreasing number of real time LMS from 3 to 0 Fri Jul 25 15:29:18 2025 Reconfiguration started (old inc 13, new inc 15) List of instances: 1 2 (myinst: 1) Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Fri Jul 25 15:29:18 2025 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jul 25 15:29:18 2025 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Jul 25 15:29:18 2025 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Fri Jul 25 15:29:20 2025 minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:15 errcnt:0 Fri Jul 25 15:30:07 2025 Errors in file /u01/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_252634.trc (incident=77234): ORA-00600: internal error code, arguments: [ktsircinfo_num1], [0], [0], [0], [],[],[],[],[],[],[],[] Incident details in: /u01/oracle/diag/rdbms/orcl/orcl1/incident/incdir_77234/orcl1_ora_252634_i77234.trc Fri Jul 25 15:30:18 2025 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Jul 25 15:30:19 2025 Sweep [inc][77234]: completed Sweep [inc2][77234]: completed Fri Jul 25 15:30:27 2025 Errors in file /u01/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_ora_261587.trc (incident=76487): ORA-00600: internal error code, arguments: [ktsircinfo_num1],[0],[0],[0], [], [], [], [], [], [], [], []
通过grep筛选报错信息
[root@iZbp11c0qyuuo1gr7j98upZ tmp]# egrep "ORA-00600|ORA-07445" alert_1.txt |sort -u ORA-00600: internal error code, arguments: [25027], [0], [0], [], [], [], [], [], [], [], [], [] ORA-00600: internal error code, arguments: [kkpo_rcinfo_defstg:delseg], [28941391], [], [], [], [] ORA-00600: internal error code, arguments: [ktsircinfo_num1], [0], [0], [0], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [DEL], [0], [35038924], [], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [INS], [0], [277736], [], [], [], [], [], [], [] ORA-00600: 内部错误代码, 参数: [16659], [kqldtu], [INS], [0], [28829570], [], [], [], [], [], [], [] ORA-07445: exception encountered: core dump [qknSetParent()+9] [SIGSEGV] [ADDR:0x10354] [PC:0x1A48B9B] [Address not mapped to object] [] ORA-07445: exception encountered: core dump [qksxaMoveQbAnnotations()+168] [SIGSEGV] [ADDR:0x20304] [PC:0x1594954] [Address not mapped to object] [] ORA-07445: 出现异常错误: 核心转储 [qknExpRegIni_int()+87] [SIGSEGV] [ADDR:0x8C] [PC:0x1A4D729] [Address not mapped to object] [] ORA-07445: 出现异常错误: 核心转储 [qksxaMoveQbAnnotations()+168] [SIGSEGV] [ADDR:0x0] [PC:0x1594954] [SI_KERNEL(general_protection)] []
既然是由于节点2重启导致节点1报错,那分析节点2重启相关情况,第一次重启成功之后,数据库开始报ORA-600错误
Fri Jul 25 15:29:29 2025 QMNC started with pid=46, OS id=363757 Fri Jul 25 15:29:31 2025 minact-scn: Inst 2 is a slave inc#:15 mmon proc-id:363622 status:0x2 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 Fri Jul 25 15:29:33 2025 Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc (incident=248519): ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248519/orcl2_m003_363779_i248519.trc Starting background process SMCO Fri Jul 25 15:29:35 2025 SMCO started with pid=57, OS id=363802 Fri Jul 25 15:29:35 2025 Completed: ALTER DATABASE OPEN /* db agent *//* {2:23784:2} */ Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc (incident=248520): ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248520/orcl2_m003_363779_i248520.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Exception [type: SIGSEGV, Address not mapped to object][ADDR:0x10] [PC:0x2FDA4BB,kgmdelsis()+219][flags:0x0,count:1] Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_m003_363779.trc (incident=248521): ORA-07445: exception encountered: core dump [kgmdelsis()+219] [SIGSEGV] [ADDR:0x10] [PC:0x2FDA4BB] [Address not mapped to object] [] ORA-00600: internal error code, arguments: [kgmfvmi#3], [], [], [], [], [], [], [], [], [], [], [] Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_248521/orcl2_m003_363779_i248521.trc Use ADRCI or Support Workbench to package the incident. Fri Jul 25 15:29:47 2025 Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_363808.trc (incident=248559): ORA-00600: internal error code, arguments: [kkposds2], [18446744073709551615], [18446744073709551615], [18446744073709551615], [], [], [], [], [], [], [], []
然后第二次重启数据库无法open成功,而是报ORA-600 16703错误
ALTER DATABASE OPEN /* db agent *//* {2:21799:2} */ Picked broadcast on commit scheme to generate SCNs ARCH: STARTING ARCH PROCESSES Fri Jul 25 15:41:23 2025 ARC0 started with pid=39, OS id=369231 ARC0: Archival started ARCH: STARTING ARCH PROCESSES COMPLETE ARC0: STARTING ARCH PROCESSES Fri Jul 25 15:41:24 2025 ARC1 started with pid=40, OS id=369242 Fri Jul 25 15:41:24 2025 ARC2 started with pid=41, OS id=369244 Fri Jul 25 15:41:24 2025 ARC3 started with pid=42, OS id=369246 ARC1: Archival started ARC2: Archival started ARC1: Becoming the 'no FAL' ARCH ARC1: Becoming the 'no SRL' ARCH ARC2: Becoming the heartbeat ARCH Thread 2 opened at log sequence 33585 Current log# 7 seq# 33585 mem# 0: +DATA/orcl/onlinelog/group_7.269.1011373611 Successful open of redo thread 2 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set SMON: enabling cache recovery Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc (incident=260494): ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], [] Incident details in: /u01/oracle/diag/rdbms/orcl/orcl2/incident/incdir_260494/orcl2_ora_369210_i260494.trc ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE SUCCESS: diskgroup FRA was mounted Fri Jul 25 15:41:30 2025 Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], [] Errors in file /u01/oracle/diag/rdbms/orcl/orcl2/trace/orcl2_ora_369210.trc: ORA-00704: bootstrap process failure ORA-00704: bootstrap process failure ORA-00600: internal error code, arguments: [16703], [1403], [20], [], [], [], [], [], [], [], [], [] Error 704 happened during db open, shutting down database USER (ospid: 369210): terminating the instance due to error 704 Instance terminated by USER, pid = 369210 ORA-1092 signalled during: ALTER DATABASE OPEN /* db agent *//* {2:21799:2} */... opiodr aborting process unknown ospid (369210) as a result of ORA-1092 Fri Jul 25 15:41:31 2025 ORA-1092 : opitsk aborting process
到这一步基本上就清晰了,大概率是遭遇到以前恢复的类似case,tab$数据被清空导致,类似案例
ORA-600 16703故障解析—tab$表被清空
警告:互联网中有oracle介质被注入恶意程序导致—ORA-600 16703
通过在故障主机上找到安装介质,验证md5确认该程序是被注入恶意代码程序
这个库由于还有一个节点处于open状态,相对处理比较简单,直接把备份的表数据反向插入回去即可
SYS@orcl1> select count(1) from ORACHK3C08C86E063530510ACD937; COUNT(1) ---------- 20696 SYS@orcl1> insert into tab$ select * from ORACHK3C08C86E063530510ACD937; 20696 rows created. SYS@orcl1> commit; Commit complete. SYS@orcl1> select object_name,to_char(CREATED,'yyyy-mm-dd hh24:mi:ss') from dba_objects 2 where object_name in('DBMS_SUPPORT_DBMONITOR','DBMS_SUPPORT_DBMONITORP'); OBJECT_NAME TO_CHAR(CREATED,'YY -------------------------------------------- ------------------- DBMS_SUPPORT_DBMONITORP 2019-06-19 17:06:46 DBMS_SUPPORT_DBMONITOR 2019-06-19 17:06:46
这次故障能够快速顺利的恢复,和客户发现故障之后保留第一现场,没有把一个open的节点也重启有很大关系,open的节点也重启了,那后续恢复工作会麻烦很多,效果可能也没有这样的完美.