ORA-600 3417和ORA-600 3005故障处理

运行中的数据库突然报ORA-600 3417错误,lgwr进程异常数据库crash

Thu Nov 17 03:00:14 2022
Archived Log entry 23860 added for thread 2 sequence 1958 ID 0x6200e2f5 dest 1:
Thu Nov 17 03:13:11 2022
Auto-tuning: Shutting down background process GTX1
Thu Nov 17 04:00:02 2022
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lgwr_1740.trc  (incident=672186):
ORA-00600: 内部错误代码, 参数: [3417], [2], [0], [0], [0], [1], [2], [], [], [], [], []
Thu Nov 17 04:00:04 2022
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_lgwr_1740.trc:
ORA-00600: 内部错误代码, 参数: [3417], [2], [0], [0], [0], [1], [2], [], [], [], [], []
LGWR (ospid: 1740): terminating the instance due to error 470

重启之后报ORA-600 3005错误,数据库启动失败

Thu Nov 17 04:03:09 2022
Successful mount of redo thread 2, with mount id 1648753015
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /* db agent *//* {0:1:38} */
ALTER DATABASE OPEN /* db agent *//* {0:1:38} */
This instance was first to open
Beginning crash recovery of 1 threads
 parallel recovery started with 31 processes
Thu Nov 17 04:03:14 2022
Started redo scan
ORA-00600: ??????, ??: [3005], [1], [706], [10374], [0], [0], [], [], [], [], [], []
Thu Nov 17 04:03:15 2022
Reconfiguration started (old inc 14, new inc 16)
List of instances:
 1 2 (myinst: 2) 
 Global Resource Directory frozen
Thu Nov 17 04:03:15 2022
 Communication channels reestablished
Thu Nov 17 04:03:16 2022
 * domain 0 valid = 0 according to instance 1 
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Thu Nov 17 04:03:16 2022
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Thu Nov 17 04:03:16 2022
Sweep [inc][688298]: completed
Sweep [inc2][688298]: completed
Thu Nov 17 04:03:16 2022
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Thu Nov 17 04:03:16 2022
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
Abort recovery for domain 0
Aborting crash recovery due to error 600
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_15352.trc:
ORA-00600: ??????, ??: [3005], [1], [706], [10374], [0], [0], [], [], [], [], [], []
Abort recovery for domain 0
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_ora_15352.trc:
ORA-00600: ??????, ??: [3005], [1], [706], [10374], [0], [0], [], [], [], [], [], []
ORA-600 signalled during: ALTER DATABASE OPEN /* db agent *//* {0:1:38} */...

尝试人工进行recover恢复库

SQL> recover database;
ORA-00279: 更改 310644203 (在 11/17/2022 01:00:05 生成) 对于线程 2 是必需的
ORA-00289: 建议:
+DATA/orcl/archivelog/2022_11_17/thread_2_seq_1956.22763.1120960801
ORA-00280: 更改 310644203 (用于线程 2) 在序列 #1956 中


指定日志: {<RET>=suggested | filename | AUTO | CANCEL}
auto
ORA-00279: 更改 310663747 (在 11/17/2022 02:00:01 生成) 对于线程 2 是必需的
ORA-00289: 建议:
+DATA/orcl/archivelog/2022_11_17/thread_2_seq_1957.22764.1120962585
ORA-00280: 更改 310663747 (用于线程 2) 在序列 #1957 中


ORA-10877: error signaled in parallel recovery slave


ORA-01112: 未启动介质恢复

通过查看alert日志确认由于ORA-00353错误导致recover database失败

Thu Nov 17 08:07:39 2022
ALTER DATABASE RECOVER  database  
Media Recovery Start
 started logmerger process
Parallel Media Recovery started with 32 slaves
Thu Nov 17 08:07:41 2022
Recovery of Online Redo Log: Thread 1 Group 1 Seq 705 Reading mem 0
  Mem# 0: +DATA/orcl/onlinelog/group_1.261.1116409583
ORA-279 signalled during: ALTER DATABASE RECOVER  database  ...
Thu Nov 17 08:08:07 2022
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Media Recovery Log +DATA/orcl/archivelog/2022_11_17/thread_2_seq_1956.22763.1120960801
ORA-279 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
ALTER DATABASE RECOVER    CONTINUE DEFAULT  
Media Recovery Log +DATA/orcl/archivelog/2022_11_17/thread_2_seq_1957.22764.1120962585
Thu Nov 17 08:08:14 2022
Recovery of Online Redo Log: Thread 2 Group 4 Seq 1958 Reading mem 0
  Mem# 0: +DATA/orcl/onlinelog/group_4.266.1116409589
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_pr00_7116.trc  (incident=704315):
ORA-00353: 日志损坏接近块 20866 更改 310761542 时间 11/17/2022 03:00:04
ORA-00312: 联机日志 1 线程 1: '+DATA/orcl/onlinelog/group_1.261.1116409583'
Thu Nov 17 08:08:26 2022
Sweep [inc][704315]: completed
Thu Nov 17 08:08:27 2022
Media Recovery failed with error 354
Errors in file D:\APP\ADMINISTRATOR\diag\rdbms\orcl\orcl2\trace\orcl2_pr00_7116.trc:
ORA-00283: 恢复会话因错误而取消
ORA-00354: 损坏重做日志块标头
ORA-00353: 日志损坏接近块 20866 更改 310761542 时间 11/17/2022 03:00:04
ORA-00312: 联机日志 1 线程 1: '+DATA/orcl/onlinelog/group_1.261.1116409583'
Thu Nov 17 08:08:27 2022
ORA-10877 signalled during: ALTER DATABASE RECOVER    CONTINUE DEFAULT  ...
ALTER DATABASE RECOVER CANCEL 
ORA-1112 signalled during: ALTER DATABASE RECOVER CANCEL ...

通过对redo进行处理顺利recover成功并完美open库

SQL> recover database;
完成介质恢复。
SQL> alter database open;

数据库已更改。
发表在 Oracle备份恢复 | 标签为 , , | 评论关闭

ORA-12547: TNS:lost contact故障一例

一个朋友安装好aix平台11.2.0.4 rac,dbca创建库报ORA-12547错误.
ORA-12547


尝试sqlplus 登录数据库

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Thu Nov 17 11:45:33 2022

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

ERROR:
ORA-12547: TNS:lost contact


Enter user-name: 


ERROR:
ORA-12547: TNS:lost contact


Enter user-name: ERROR:
ORA-12547: TNS:lost contact


SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus

使用truss跟踪sqlplus 执行过程,卡在kfork上面
20221116121953


检查oracle二进制文件权限,正常

$ ls -l $ORACLE_HOME/bin/oracle
-rwsr-s--x    1 oracle   oinstall  309671362 Nov 17 03:13 /u01/app/oracle/product/11.2.0/db/bin/oracle

经过分析确认是oracle home权限异常

$ ls -ld $ORACLE_HOME
drwxrwxr-x   74 grid     oinstall       4096 Nov 17 02:59 /u01/app/oracle/product/11.2.0/db
$ exit
# chown oracle:oinstall /u01/app/oracle/product/11.2.0/db

sqlplus和dbca都正常

$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.4.0 Production on Thu Nov 17 11:46:32 2022

Copyright (c) 1982, 2013, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> exit
Disconnected

发表在 Oracle | 标签为 , , | 评论关闭

rac kill 大事物后回滚慢,smon等待DFS lock handle和enq: TX – contention

Customer killed a large DELETE statement of a large table in instance 1 in high workload period in RAC environment of 2 instances.
Transaction recovery did not yet complete for more than 1 hour by querying V$FAST_START_TRANSACTIONS.
Update on the same table on instance 2 hung more than 1 hour.
Transaction recovery complete quickly after shutdown instance 2.
ASH shows high waiting on “enq: TX – row lock contention” and “transaction” without blocking session until instance 2 shutdown after DELETE session killed in instance 1

SQL> select instance_number,program,event,BLOCKING_INST_ID b_inst,count(session_id) session_cnt,
count(BLOCKING_SESSION) b_session_cnt,min(sample_time) start_time,max(sample_time) end_time,count(*) cnt
  2  from DBA_HIST_ACTIVE_SESS_HISTORY where sql_id='d1xg4v7rqx3kt' and instance_number=2
group by instance_number,program,event,BLOCKING_INST_ID
order by instance_number,start_time
  3    4    5  ;

INSTANCE_NUMBER PROGRAM            EVENT                         B_INST SESSION_CNT B_SESSION_CNT START_TIME END_TIME       CNT
--------------- ------------------ ----------------------------- ------ ----------- ------------- ---------- ---------- -------
              2 JDBC Thin Client   enq: TX - row lock contention             728939             0 19:20:34   21:04:25    728939
              2 JDBC Thin Client   transaction                               489837             0 20:12:44   21:04:25    489837

ASH shows that all sessions of the “enq: TX – row lock contention” without blocking session on instance 2 were waiting for the transaction of the Killed DELETE on instance 1

SQL> select instance_number,event,trunc(p2/65536) XIDUSN, trunc(mod(p2,65536)) XIDSLOT, 
p3 XIDSQN,count(session_id) session_cnt,count(BLOCKING_SESSION) b_session_cnt,min(sample_time) start_time,max(sample_time) end_time,count(*) cnt
from DBA_HIST_ACTIVE_SESS_HISTORY where sql_id='d1xg4v7rqx3kt' and instance_number=2 and 
to_char(sample_time,'HH24:MI') >= '20:12' AND BLOCKING_SESSION IS NULL AND EVENT='enq: TX - row lock contention'
group by instance_number,event,trunc(p2/65536) , trunc(mod(p2,65536)) , p3
order by start_time
;
  2    3    4    5

INSTANCE_NUMBER EVENT                            XIDUSN  XIDSLOT   XIDSQN SESSION_CNT B_SESSION_CNT START_TIME END_TIME      CNT
--------------- ------------------------------ -------- -------- -------- ----------- ------------- ---------- ---------- ------
              2 enq: TX - row lock contention       932       15   623055      717850             0 20:12:55   21:04:25   717850
SQL>

ASH shows that SMON of instance 1 was waiting for “DFS lock handle” and “enq: TX – contention” among DELETE session killed in instance 1 and instance 2 shutdown.

SQL> select instance_number,session_id,session_serial#,program,event,BLOCKING_SESSION b_sid,
BLOCKING_SESSION_SERIAL# b_serial,BLOCKING_INST_ID b_inst,min(sample_time) start_time,
max(sample_time) end_time,count(*) cnt
 from DBA_HIST_ACTIVE_SESS_HISTORY where upper(program) like '%SMON%'
group by instance_number,session_id,session_serial#,program,event,BLOCKING_SESSION ,BLOCKING_SESSION_SERIAL# ,BLOCKING_INST_ID
order by instance_number,start_time
;

INSTANCE_NUMBER SESSION_ID SESSION_SERIAL# PROGRAM                       EVENT                 B_SID  B_SERIAL B_INST START_TIME END_TIME   CNT
--------------- ---------- --------------- ----------------------------- --------------------- ----- --------- ------ ---------- ---------- ---
              1       3761               1 oracle@<instance_name> (SMON) DFS lock handle                              19:05:24   21:05:07   128
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention                         20:14:31   21:04:47    50

The P1 of “DFS lock handle” shows that SMON was requesting TA enqueue through cross instance calls which is used for serializing operations on undo segments and undo tablespaces

SQL> col event for a20
select instance_number,sample_time,session_id,session_serial#,program,event,p1,p2,p3
  from DBA_HIST_ACTIVE_SESS_HISTORY where upper(program) like '%SMON%' AND EVENT='DFS lock handle' ;
  SQL>   2
INSTANCE_NUMBER SAMPLE_TIME     SESSION_ID SESSION_SERIAL# PROGRAM                       EVENT                      P1     P2     P3
--------------- --------------- ---------- --------------- ----------------------------- ---------------- ------------ ------ ------
              1 19:05:24              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   4722
              1 20:12:27              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   6668
              1 20:13:40              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   5787
              1 20:13:50              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   5580
              1 20:14:00              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   5826
... ...
              1 21:04:57              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   1347
              1 21:05:07              3761               1 oracle@<instance_name> (SMON) DFS lock handle    1413545989      3   4582

144 rows selected.

* All of P1 is 1413545989, converted to hexadecimal: ‭54410005‬, 54 ASCII code: T, 41 ASCII code: A => TA enqueue

The P2,P3 of “enq: TX – row lock contention” shows that SMON process was requesting TX enqueue of the UNDO block of the transaction of the Killed DELETE on instance 1.

SQL> select instance_number,session_id,session_serial#,program,event,trunc(p2/65536) XIDUSN, trunc(mod(p2,65536)) XIDSLOT, p3 XIDSQN,COUNT(1) CNT,min(sample_time) start_time,max(sample_time) end_time
from DBA_HIST_ACTIVE_SESS_HISTORY WHERE session_id =3761 and SESSION_SERIAL# = 1 and instance_number=1 AND event like 'enq: TX - contention%'
group by instance_number,session_id,session_serial#,program,event,trunc(p2/65536) , trunc(mod(p2,65536)) , p3
;
  2    3    4
INSTANCE_NUMBER SESSION_ID SESSION_SERIAL# PROGRAM                        EVENT                  XIDUSN  XIDSLOT   XIDSQN CNT START_TIME END_TIME
--------------- ---------- --------------- ------------------------------ -------------------- -------- -------- -------- --- ---------- ----------
              1       3761               1 oracle@<instance_name> (SMON)  enq: TX - contention      932       15   623055  50 20:14:31   21:04:47

SQL>
SQL> select instance_number,session_id,session_serial#,program,event,CURRENT_OBJ#, CURRENT_FILE#, CURRENT_BLOCK#,CURRENT_ROW#,COUNT(1) CNT,min(sample_time) start_time,max(sample_time) end_time
  2  from DBA_HIST_ACTIVE_SESS_HISTORY WHERE session_id =3761 and SESSION_SERIAL# = 1 and instance_number=1 AND event like 'enq: TX - contention%'
group by instance_number,session_id,session_serial#,program,event,CURRENT_OBJ#, CURRENT_FILE#, CURRENT_BLOCK#,CURRENT_ROW#
order by start_time
;
  3    4    5
INSTANCE_NUMBER SESSION_ID SESSION_SERIAL# PROGRAM                       EVENT                CURRENT_OBJ# CURRENT_FILE# CURRENT_BLOCK# CURRENT_ROW#   CNT START_TIME END_TIME
--------------- ---------- --------------- ----------------------------- -------------------- ------------ ------------- -------------- ------------ ----- ---------- ----------
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            1            42              4            0     1 20:14:31   20:14:31
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            1             3              3            0     1 20:16:22   20:16:22 <<<<CURRENT_FILE#:3 ==>UNDO
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            1             3              5            0     4 20:22:17   20:25:49 <<<<CURRENT_FILE#:3 ==>UNDO
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            0             3           1920            0   3 20:30:13   20:30:33 <<<<CURRENT_FILE#:3 ==>UNDO
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            0             3          15728            0 23 20:37:58   20:52:39 <<<<CURRENT_FILE#:3 ==>UNDO
              1       3761               1 oracle@<instance_name> (SMON) enq: TX - contention            1          1002              2            0    18 20:53:09   21:04:47

6 rows selected.

SQL>

解决方案
Please avoid to kill large transaction during high workload period.
If has to kill large transaction, shutdown all remote instances can speed up the transaction recovery.
参考:Transaction Recovery Slow And High Row Lock Contention After Killed Large Transaction in RAC (Doc ID 2668617.1)

发表在 Oracle | 标签为 , , | 评论关闭