标签云
asm恢复 bbed bootstrap$ dul In Memory kcbzib_kcrsds_1 kccpb_sanity_check_2 MySQL恢复 ORA-00312 ORA-00607 ORA-00704 ORA-00742 ORA-01110 ORA-01555 ORA-01578 ORA-01595 ORA-08103 ORA-600 2131 ORA-600 2662 ORA-600 3020 ORA-600 4000 ORA-600 4137 ORA-600 4193 ORA-600 4194 ORA-600 16703 ORA-600 kcbzib_kcrsds_1 ORA-600 KCLCHKBLK_4 ORA-15042 ORA-15196 ORACLE 12C oracle dul ORACLE PATCH Oracle Recovery Tools oracle加密恢复 oracle勒索 oracle勒索恢复 oracle异常恢复 Oracle 恢复 ORACLE恢复 ORACLE数据库恢复 oracle 比特币 OSD-04016 YOUR FILES ARE ENCRYPTED 勒索恢复 比特币加密文章分类
- Others (2)
- 中间件 (2)
- WebLogic (2)
- 操作系统 (103)
- 数据库 (1,763)
- DB2 (22)
- MySQL (76)
- Oracle (1,605)
- Data Guard (52)
- EXADATA (8)
- GoldenGate (24)
- ORA-xxxxx (166)
- ORACLE 12C (72)
- ORACLE 18C (6)
- ORACLE 19C (15)
- ORACLE 21C (3)
- Oracle 23ai (8)
- Oracle ASM (69)
- Oracle Bug (8)
- Oracle RAC (54)
- Oracle 安全 (6)
- Oracle 开发 (28)
- Oracle 监听 (28)
- Oracle备份恢复 (588)
- Oracle安装升级 (97)
- Oracle性能优化 (62)
- 专题索引 (5)
- 勒索恢复 (86)
- PostgreSQL (30)
- pdu工具 (6)
- PostgreSQL恢复 (9)
- SQL Server (32)
- SQL Server恢复 (13)
- TimesTen (7)
- 达梦数据库 (3)
- 达梦恢复 (1)
- 生活娱乐 (2)
- 至理名言 (11)
- 虚拟化 (2)
- VMware (2)
- 软件开发 (39)
- Asp.Net (9)
- JavaScript (12)
- PHP (2)
- 小工具 (22)
-
最近发表
- .sstop勒索加密数据库恢复
- 解决一次硬件恢复之后数据文件0kb的故障恢复case
- Error in invoking target ‘libasmclntsh19.ohso libasmperl19.ohso client_sharedlib’问题处理
- ORA-01171: datafile N going offline due to error advancing checkpoint
- linux环境oracle数据库被文件系统勒索加密为.babyk扩展名溯源
- ORA-600 ksvworkmsgalloc: bad reaper
- ORA-600 krccfl_chunk故障处理
- Oracle Recovery Tools恢复案例总结—202505
- ORA-600 kddummy_blkchk 数据库循环重启
- 记录一次asm disk加入到vg通过恢复直接open库的案例
- CHECKDB 发现了 N 个分配错误和 M 个一致性错误
- 达梦数据库dm.ctl文件异常恢复
- Oracle Recovery Tools修复ORA-00742、ORA-600 ktbair2: illegal inheritance故障
- 可能是 tempdb 空间用尽或某个系统表不一致故障处理
- 11.2.0.4库中遇到ORA-600 kcratr_nab_less_than_odr报错
- [MY-013183] [InnoDB] Assertion failure故障处理
- Oracle 19c 202504补丁(RUs+OJVM)-19.27
- Oracle Recovery Tools修复ORA-600 6101/kdxlin:psno out of range故障
- pdu完美支持金仓数据库恢复(KingbaseES)
- 虚拟机故障引起ORA-00310 ORA-00334故障处理
年归档:2013
使用UltraEdit修改oracle二进制文件
对数据库OPEN的过程做10046,因为数据库没有open,那写着语句是写在什么地方的呢?通过使用UltraEdit打开$ORACLE_HOME/bin/oracle文件,可以发现很多sql语句,而且与open过程的10046很多一致.
044acdc0h: 73 65 6C 65 63 74 20 75 2E 6E 61 6D 65 2C 20 6F ; select u.name, o 044acdd0h: 32 2E 6E 61 6D 65 2C 20 6F 32 2E 6F 62 6A 23 20 ; 2.name, o2.obj# 044acde0h: 66 72 6F 6D 20 69 6E 64 24 20 69 2C 20 6F 62 6A ; from ind$ i, obj 044acdf0h: 24 20 6F 31 2C 20 6F 62 6A 24 20 6F 32 2C 20 75 ; $ o1, obj$ o2, u 044ace00h: 73 65 72 24 20 75 20 77 68 65 72 65 20 6F 31 2E ; ser$ u where o1. 044ace10h: 6F 77 6E 65 72 23 20 3D 20 3A 31 20 20 61 6E 64 ; owner# = :1 and 044ace20h: 20 6F 31 2E 74 79 70 65 23 20 3D 20 32 20 61 6E ; o1.type# = 2 an 044ace30h: 64 20 69 2E 74 79 70 65 23 20 3D 20 39 20 61 6E ; d i.type# = 9 an 044ace40h: 64 20 69 2E 62 6F 23 20 3D 20 6F 31 2E 6F 62 6A ; d i.bo# = o1.obj 044ace50h: 23 20 20 61 6E 64 20 69 2E 6F 62 6A 23 20 3D 20 ; # and i.obj# = 044ace60h: 6F 32 2E 6F 62 6A 23 20 61 6E 64 20 6F 32 2E 6F ; o2.obj# and o2.o 044ace70h: 77 6E 65 72 23 20 3D 20 75 2E 75 73 65 72 23 20 ; wner# = u.user# 044ace80h: 61 6E 64 20 69 2E 74 79 70 65 23 20 3D 20 39 20 ; and i.type# = 9 044ace90h: 61 6E 64 20 6F 31 2E 6F 77 6E 65 72 23 20 21 3D ; and o1.owner# != 044acea0h: 20 6F 32 2E 6F 77 6E 65 72 23 00 00 00 00 00 00 ; o2.owner#...... 044aceb0h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................ 044acec0h: 73 65 6C 65 63 74 20 75 2E 6E 61 6D 65 2C 20 6F ; select u.name, o 044aced0h: 2E 6E 61 6D 65 2C 20 6F 2E 6F 62 6A 23 20 66 72 ; .name, o.obj# fr 044acee0h: 6F 6D 20 6F 62 6A 24 20 6F 2C 20 75 73 65 72 24 ; om obj$ o, user$ 044acef0h: 20 75 2C 20 69 6E 64 24 20 69 20 77 68 65 72 65 ; u, ind$ i where 044acf00h: 20 6F 2E 6F 77 6E 65 72 23 3D 3A 31 20 61 6E 64 ; o.owner#=:1 and 044acf10h: 20 6F 2E 6F 77 6E 65 72 23 3D 75 2E 75 73 65 72 ; o.owner#=u.user 044acf20h: 23 20 61 6E 64 20 6F 2E 6F 62 6A 23 3D 69 2E 6F ; # and o.obj#=i.o 044acf30h: 62 6A 23 20 61 6E 64 20 69 2E 74 79 70 65 23 3D ; bj# and i.type#= 044acf40h: 39 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; 9............... 044acf50h: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ; ................ 044acf60h: 73 65 6C 65 63 74 20 76 6E 61 6D 65 20 66 72 6F ; select vname fro 044acf70h: 6D 20 73 79 73 2E 73 6E 61 70 24 20 77 68 65 72 ; m sys.snap$ wher 044acf80h: 65 20 73 6F 77 6E 65 72 20 3D 20 3A 31 20 61 6E ; e sowner = :1 an 044acf90h: 64 20 69 6E 73 74 73 69 74 65 20 3D 20 30 20 20 ; d instsite = 0 044acfa0h: 61 6E 64 20 28 62 69 74 61 6E 64 28 66 6C 61 67 ; and (bitand(flag 044acfb0h: 2C 20 32 36 38 34 33 35 34 35 36 29 20 3D 20 30 ; , 268435456) = 0 044acfc0h: 20 6F 72 20 62 69 74 61 6E 64 28 6F 62 6A 66 6C ; or bitand(objfl 044acfd0h: 61 67 2C 20 33 32 29 20 3D 20 30 29 00 00 00 00 ; ag, 32) = 0)....
既然数据库OPEN的过程很多sql是在oracle二进制文件中写的,那是否可以考虑通过修改二进制文件中的sql语句来改变执行计划,甚至可以尝试修改这些语句使得异常的数据库能够正常open.这里演示通过修改sql语句来展示改变执行计划(不使用index,而改用全表扫描)
10046 捕获语句和执行计划
PARSING IN CURSOR #2 len=116 dep=2 uid=0 oct=3 lid=0 tim=1345475346332403 hv=854877822 ad='2f2be060' select o.owner#,o.name,o.namespace,o.remoteowner,o.linkname,o.subname,o.dataobj#,o.flags from obj$ o where o.obj#=:1 END OF STMT PARSE #2:c=1000,e=1258,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,tim=1345475346332394 BINDS #2: kkscoacd Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=b69ef000 bln=22 avl=03 flg=05 value=893 EXEC #2:c=2000,e=1382,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,tim=1345475346333978 WAIT #2: nam='db file sequential read' ela= 17820 file#=1 block#=220 blocks=1 obj#=-1 tim=1345475346351927 FETCH #2:c=1000,e=18054,p=1,cr=3,cu=0,mis=0,r=1,dep=2,og=4,tim=1345475346352100 STAT #2 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS BY INDEX ROWID OBJ$ (cr=3 pr=1 pw=0 time=18047 us)' STAT #2 id=2 cnt=1 pid=1 pos=1 obj=36 op='INDEX UNIQUE SCAN I_OBJ1 (cr=2 pr=1 pw=0 time=18014 us)'
使用UE查看该sql在二进制文件内容
0459c650h: 3A 35 2C 3A 36 2C 3A 37 29 00 00 00 00 00 00 00 ; :5,:6,:7)....... 0459c660h: 73 65 6C 65 63 74 20 6F 2E 6F 77 6E 65 72 23 2C ; select o.owner#, 0459c670h: 6F 2E 6E 61 6D 65 2C 6F 2E 6E 61 6D 65 73 70 61 ; o.name,o.namespa 0459c680h: 63 65 2C 6F 2E 72 65 6D 6F 74 65 6F 77 6E 65 72 ; ce,o.remoteowner 0459c690h: 2C 6F 2E 6C 69 6E 6B 6E 61 6D 65 2C 6F 2E 73 75 ; ,o.linkname,o.su 0459c6a0h: 62 6E 61 6D 65 2C 6F 2E 64 61 74 61 6F 62 6A 23 ; bname,o.dataobj# 0459c6b0h: 2C 6F 2E 66 6C 61 67 73 20 66 72 6F 6D 20 6F 62 ; ,o.flags from ob 0459c6c0h: 6A 24 20 6F 20 77 68 65 72 65 20 6F 2E 6F 62 6A ; j$ o where o.obj 0459c6d0h: 23 3D 3A 31 00 00 00 00 00 00 00 00 00 00 00 00 ; #=:1............ 0459c6e0h: 73 65 6C 65 63 74 20 74 79 70 65 23 2C 63 6F 6E ; select type#,con
修改二进制文件中sql语句
0459c650h: 3A 35 2C 3A 36 2C 3A 37 29 00 00 00 00 00 00 00 ; :5,:6,:7)....... 0459c660h: 73 65 6C 65 63 74 20 6F 2E 6F 77 6E 65 72 23 2C ; select o.owner#, 0459c670h: 6F 2E 6E 61 6D 65 2C 6F 2E 6E 61 6D 65 73 70 61 ; o.name,o.namespa 0459c680h: 63 65 2C 6F 2E 72 65 6D 6F 74 65 6F 77 6E 65 72 ; ce,o.remoteowner 0459c690h: 2C 6F 2E 6C 69 6E 6B 6E 61 6D 65 2C 6F 2E 73 75 ; ,o.linkname,o.su 0459c6a0h: 62 6E 61 6D 65 2C 6F 2E 64 61 74 61 6F 62 6A 23 ; bname,o.dataobj# 0459c6b0h: 2C 6F 2E 66 6C 61 67 73 20 66 72 6F 6D 20 6F 62 ; ,o.flags from ob 0459c6c0h: 6A 24 20 6F 20 77 68 65 72 65 20 6F 62 6A 23 2B ; j$ o where obj#+ 0459c6d0h: 30 3D 3A 31 00 00 00 00 00 00 00 00 00 00 00 00 ; 0=:1............ 0459c6e0h: 73 65 6C 65 63 74 20 74 79 70 65 23 2C 63 6F 6E ; select type#,con
这里通过对obj#加上一个常量0,使得该sql在执行之时不会使用obj$.obj#上的index,从而使用全表扫描,来实现我们修改执行计划,屏蔽该index的目的
修改后的10046观察
PARSING IN CURSOR #2 len=116 dep=2 uid=0 oct=3 lid=0 tim=1345475781593851 hv=493726595 ad='2f2ba76c' select o.owner#,o.name,o.namespace,o.remoteowner,o.linkname,o.subname,o.dataobj#,o.flags from obj$ o where obj#+0=:1 END OF STMT PARSE #2:c=1000,e=1095,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,tim=1345475781593840 BINDS #2: kkscoacd Bind#0 oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00 oacflg=08 fl2=0001 frm=00 csi=00 siz=24 off=0 kxsbbbfp=b6a86000 bln=22 avl=03 flg=05 value=893 EXEC #2:c=2999,e=1603,p=0,cr=0,cu=0,mis=1,r=0,dep=2,og=4,tim=1345475781595754 WAIT #2: nam='db file sequential read' ela= 18417 file#=1 block#=121 blocks=1 obj#=-1 tim=1345475781614344 WAIT #2: nam='db file sequential read' ela= 1000 file#=1 block#=123 blocks=1 obj#=-1 tim=1345475781615726 WAIT #2: nam='db file sequential read' ela= 1462 file#=1 block#=124 blocks=1 obj#=-1 tim=1345475781617720 WAIT #2: nam='db file sequential read' ela= 492 file#=1 block#=125 blocks=1 obj#=-1 tim=1345475781618452 WAIT #2: nam='db file sequential read' ela= 1358 file#=1 block#=126 blocks=1 obj#=-1 tim=1345475781620361 WAIT #2: nam='db file sequential read' ela= 500 file#=1 block#=127 blocks=1 obj#=-1 tim=1345475781621039 WAIT #2: nam='db file sequential read' ela= 63132 file#=1 block#=128 blocks=1 obj#=-1 tim=1345475781684316 WAIT #2: nam='db file sequential read' ela= 989 file#=1 block#=5233 blocks=1 obj#=-1 tim=1345475781685559 WAIT #2: nam='db file sequential read' ela= 792 file#=1 block#=5234 blocks=1 obj#=-1 tim=1345475781686583 FETCH #2:c=7999,e=90994,p=9,cr=13,cu=0,mis=0,r=1,dep=2,og=4,tim=1345475781686865 STAT #2 id=1 cnt=1 pid=0 pos=1 obj=18 op='TABLE ACCESS FULL OBJ$ (cr=13 pr=9 pw=0 time=90977 us)'
很明显,到这里我们通过UE修改oracle二进制文件,实现了sql语句执行计划的更改.
免责申明:本方法仅供在非常特殊或者测试环境下使用,一般生产环境请勿模仿,否则后果自负
误杀进程导致rac hang住
有客户反馈系统hang住,不能归档,需要我们紧急介入分析
节点1日志
出现redo不能归档,redo日志都已经被写满,人工执行了ALTER SYSTEM ARCHIVE LOG CURRENT,数据库就开始把redo全部归档,但是后面产生的redo又不能归档,当redo全部写满之后,数据库有出现大量log file switch (archiving needed)等待
Tue Sep 24 22:05:37 2013 Thread 1 advanced to log sequence 47282 (LGWR switch) Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:07:31 2013 ORACLE Instance q9db1 - Can not allocate log, archival required Thread 1 cannot allocate new log, sequence 47283 All online logs needed archiving Current log# 6 seq# 47282 mem# 0: +DATA/q9db/onlinelog/group_6.1244.818697409 Tue Sep 24 22:28:17 2013 ALTER SYSTEM ARCHIVE LOG Archived Log entry 259646 added for thread 1 sequence 47266 ID 0x354620c2 dest 1: Tue Sep 24 22:28:18 2013 Thread 1 advanced to log sequence 47283 (LGWR switch) Current log# 7 seq# 47283 mem# 0: +DATA/q9db/onlinelog/group_7.1243.818697415 Archived Log entry 259647 added for thread 1 sequence 47267 ID 0x354620c2 dest 1: Archived Log entry 259648 added for thread 1 sequence 47268 ID 0x354620c2 dest 1: Archived Log entry 259649 added for thread 1 sequence 47269 ID 0x354620c2 dest 1: Archived Log entry 259650 added for thread 1 sequence 47270 ID 0x354620c2 dest 1: Archived Log entry 259651 added for thread 1 sequence 47271 ID 0x354620c2 dest 1: Archived Log entry 259652 added for thread 1 sequence 47272 ID 0x354620c2 dest 1: Tue Sep 24 22:28:28 2013 Archived Log entry 259653 added for thread 1 sequence 47273 ID 0x354620c2 dest 1: Archived Log entry 259654 added for thread 1 sequence 47274 ID 0x354620c2 dest 1: Archived Log entry 259655 added for thread 1 sequence 47275 ID 0x354620c2 dest 1: Archived Log entry 259656 added for thread 1 sequence 47276 ID 0x354620c2 dest 1: Archived Log entry 259657 added for thread 1 sequence 47277 ID 0x354620c2 dest 1: Archived Log entry 259658 added for thread 1 sequence 47278 ID 0x354620c2 dest 1: Archived Log entry 259659 added for thread 1 sequence 47279 ID 0x354620c2 dest 1: Tue Sep 24 22:28:39 2013 Archived Log entry 259660 added for thread 1 sequence 47280 ID 0x354620c2 dest 1: Archived Log entry 259661 added for thread 1 sequence 47281 ID 0x354620c2 dest 1: Archived Log entry 259662 added for thread 1 sequence 47282 ID 0x354620c2 dest 1: Tue Sep 24 22:29:39 2013 Thread 1 advanced to log sequence 47284 (LGWR switch) Current log# 8 seq# 47284 mem# 0: +DATA/q9db/onlinelog/group_8.1242.818697417 Tue Sep 24 22:31:18 2013 Thread 1 advanced to log sequence 47285 (LGWR switch) Current log# 16 seq# 47285 mem# 0: +DATA/q9db/onlinelog/group_16.1884.827003545 Thread 1 advanced to log sequence 47286 (LGWR switch) Current log# 17 seq# 47286 mem# 0: +DATA/q9db/onlinelog/group_17.1885.827003587
节点2日志
节点2中出现大量的IPC Send timeout
Tue Sep 24 15:22:19 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] ………… Tue Sep 24 18:51:55 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Tue Sep 24 18:57:54 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Receiver: inst 1 binc 464003926 ospid 1566 Tue Sep 24 19:03:57 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] Receiver: inst 1 binc 464003926 ospid 1566 Tue Sep 24 19:09:53 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)] ………… Tue Sep 24 20:22:00 2013 IPC Send timeout detected. Sender: ospid 4008 [oracle@q9db02.800best.com (PING)]
节点1因为不能归档hang住,节点2紧接着也就hang住。对节点1hang住之时对两个节点分别做systemstate dump,使用ass进行分析得到节点1和节点2的记录大体如下:
节点1
393:waiting for 'log file switch (archiving needed)' 394:waiting for 'log file switch (archiving needed)' Cmd: Insert 395:waiting for 'log file switch (archiving needed)' Cmd: Insert 397:waiting for 'log file switch (archiving needed)' Cmd: Insert 398:waiting for 'log file switch (archiving needed)' Cmd: Insert 451:waiting for 'SQL*Net message from client' 469:waiting for 'log file switch (archiving needed)' Cmd: Insert 470:waiting for 'log file switch (archiving needed)' Cmd: Insert 471:waiting for 'log file switch (archiving needed)' Cmd: Insert 618:waiting for 'log file switch (archiving needed)' Cmd: Insert 626:waiting for 'log file switch (archiving needed)' Cmd: Insert NO BLOCKING PROCESSES FOUND
节点2
515:waiting for 'gc buffer busy acquire' Cmd: Insert 516:waiting for 'gc buffer busy acquire' Cmd: Insert 517:waiting for 'gc buffer busy acquire' Cmd: Insert 518:waiting for 'gc buffer busy acquire' Cmd: Insert 519:waiting for 'gc buffer busy acquire' Cmd: Insert 520:waiting for 'gc buffer busy acquire' Cmd: Select 521:waiting for 'gc current request' Cmd: Insert 522:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C] Cmd: Select 523:waiting for 'gc buffer busy acquire' Cmd: Insert 524:waiting for 'SQL*Net message from client' 525:waiting for 'gc buffer busy acquire' Cmd: Insert 526:waiting for 'gc buffer busy acquire' Cmd: Insert 527:waiting for 'enq: TX - row lock contention'[Enq TX-00BA0020-001E3E3C] Cmd: Select 528:waiting for 'SQL*Net message from client' 529:waiting for 'gc buffer busy acquire' Cmd: Select Resource Holder State Enq TX-0005001E-0022374F 223: waiting for 'gc current request' Enq TX-0047001B-002BCEB2 247: waiting for 'gc current request' Enq TX-015B001E-000041FF 330: waiting for 'gc current request' Enq TX-00010010-002EA7CD 179: waiting for 'gc current request' Enq TX-00BA0020-001E3E3C ??? Blocker Object Names ~~~~~~~~~~~~ Enq TX-0005001E-0022374F Enq TX-0047001B-002BCEB2 Enq TX-015B001E-000041FF Enq TX-00010010-002EA7CD Enq TX-00BA0020-001E3E3C
通过这里,我们可以明白,节点2的很多事务hang住是因为请求gc current request,而该等待是因为节点1无法归档,有些block无法正常传输到节点2,导致节点2一直hang在这里,然后就出现IPC Send timeout;节点1上的事务阻塞甚至hang住是因为无法归档导致.到此需要定位的问题是为什么节点1不能归档
继续分析节点1 alert日志
Tue Sep 24 15:18:20 2013 opidrv aborting process O000 ospid (7332) as a result of ORA-28 Immediate Kill Session#: 1904, Serial#: 1065 Immediate Kill Session: sess: 0x24a2522a38 OS pid: 7338 Immediate Kill Session#: 3597, Serial#: 11107 Immediate Kill Session: sess: 0x24c27cf498 OS pid: 7320 Tue Sep 24 15:18:23 2013 opidrv aborting process W000 ospid (7980) as a result of ORA-28 Tue Sep 24 15:18:23 2013 opidrv aborting process W001 ospid (8560) as a result of ORA-28 Tue Sep 24 15:18:35 2013 LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: Detected ARCH process failure LGWR: STARTING ARCH PROCESSES Tue Sep 24 15:18:35 2013 ARC0 started with pid=66, OS id=10793 Tue Sep 24 15:18:35 2013 Errors in file /u01/app/oracle/diag/rdbms/q9db/q9db1/trace/q9db1_nsa2_12635.trc: ORA-00028: your session has been killed LNS: Failed to archive log 8 thread 1 sequence 47156 (28) ARC0: Archival started LGWR: STARTING ARCH PROCESSES COMPLETE Thread 1 advanced to log sequence 47157 (LGWR switch) ARC0: STARTING ARCH PROCESSES Current log# 9 seq# 47157 mem# 0: +DATA/q9db/onlinelog/group_9.1241.818697421 Tue Sep 24 15:18:36 2013 ARC1 started with pid=81, OS id=10805 Tue Sep 24 15:18:36 2013 ARC2 started with pid=84, OS id=10807 Tue Sep 24 15:18:36 2013 ARC3 started with pid=87, OS id=10809 ARC1: Archival started ARC2: Archival started ARC2: Becoming the 'no FAL' ARCH ARC2: Becoming the 'no SRL' ARCH ARC1: Becoming the heartbeat ARCH Error 1031 received logging on to the standby PING[ARC1]: Heartbeat failed to connect to standby 'q9adgdg'. Error is 1031. ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Archived Log entry 259135 added for thread 1 sequence 47156 ID 0x354620c2 dest 1: Error 1031 received logging on to the standby FAL[server, ARC3]: Error 1031 creating remote archivelog file 'q9adgdg' FAL[server, ARC3]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance q9db1 - Archival Error. Archiver continuing. Tue Sep 24 15:18:46 2013 opidrv aborting process O001 ospid (9605) as a result of ORA-28 Tue Sep 24 15:18:46 2013 opidrv aborting process O000 ospid (10813) as a result of ORA-28 Tue Sep 24 15:18:46 2013 Immediate Kill Session#: 2909, Serial#: 369 Immediate Kill Session: sess: 0x24226c7200 OS pid: 9091 Immediate Kill Session#: 3380, Serial#: 30271 Immediate Kill Session: sess: 0x2422782c58 OS pid: 10265 Immediate Kill Session#: 3597, Serial#: 11109 Immediate Kill Session: sess: 0x24c27cf498 OS pid: 10267 Tue Sep 24 15:20:14 2013 Restarting dead background process DIAG Tue Sep 24 15:20:14 2013 DIAG started with pid=64, OS id=20568 Restarting dead background process PING Tue Sep 24 15:20:14 2013 PING started with pid=68, OS id=20570 Restarting dead background process LMHB Tue Sep 24 15:20:14 2013 LMHB started with pid=70, OS id=20572 Restarting dead background process SMCO ………… Tue Sep 24 15:23:13 2013 ARC0: Detected ARCH process failure Tue Sep 24 15:23:13 2013 Thread 1 advanced to log sequence 47158 (LGWR switch) Current log# 10 seq# 47158 mem# 0: +DATA/q9db/onlinelog/group_10.1240.818697423 ARC0: STARTING ARCH PROCESSES ARC0: STARTING ARCH PROCESSES COMPLETE ARC0: Becoming the heartbeat ARCH ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance q9db1 - Archival Error ORA-00028: your session has been killed
查看ARCn进程
[oracle@q9db01 ~]$ ps -ef|grep ora_ar oracle 20718 12870 0 22:07 pts/14 00:00:00 grep ora_ar [oracle@q9db01 ~]$ ps -ef|grep ora_ar oracle 25998 12870 0 22:07 pts/14 00:00:00 grep ora_ar
这里基本上明白了,因为客户的系统从15:15开始由于中间件程序异常,导致大量会话连接数据库,然后dba为了防止其他业务不受影响,然后开始大量通过alter system kill session,误杀了不少系统进程,包括ARCn(0,1,2,3)进程,在后面ARCn进程因为某种原因无法正常启动,导致redo无法归档,所有的redo组写满系统即hang住,该系统由于大量kill session已经导致了实例本身异常(正常情况ARCn进程kill之后会自动重启),处理方案:先增加redo组配合定时人工归档,等待业务低峰重启节点1,解决问题。温馨提醒:kill进程请小心
发表在 Oracle RAC
2 条评论
ALERT: Disable Transparent HugePages on SLES11, RHEL6, OEL6 and UEK2 Kernels
随着去ioe的潮流越来越汹涌,随着内存越来越便宜,使用Liunx大内存跑核心业务的系统越来越多。因为大内存,多会话,使得配置HugePages也成了必然的选择,在以前的文章中介绍过怎么去配置HugePages(Liunx系统中Oracle使用HugePages配置)以及因为没有配置Hugepage导致的故障(因未配置Hugepage会话数添增悲剧案例).但是随着Linux 6(RedHat 6, OEL 6, SLES 11 and UEK2 kernels等)的广泛应用,他们所在HugePages方面所具有的新特性(Transparent HugePages)很多人可能还没有注意到,Oracle强烈建议无论是rac还是单实例都关闭该特性.
检查是否启用Transparent HugePages方法
To check if the Transparent HugePages are enabled in your server execute the following: Default/Enabled setting is [always]: # cat /sys/kernel/mm/transparent_hugepage/enabled [always] never Disabled setting is [never]: # cat /sys/kernel/mm/transparent_hugepage/enabled always [never] If "enabled" is NOT set to "[never]", the Transparent HugePages are being used. You can also issue: # grep AnonHugePages /proc/meminfo If the output contains a line like "AnonHugepages: xxxx kB", with a value > 0kB the kernel is using Transparent HugePages.
禁用Transparent HugePages方法
Add the following to the kernel boot line in /etc/grub.conf (this is the preferred method) and reboot the server: transparent_hugepage=never Once modified the line will read similar to the following example: title Oracle Linux Server (2.6.32-300.25.1.el6uek.x86_64) root (hd0,0) kernel /vmlinuz-2.6.32-300.25.1.el6uek.x86_64 ro root=LABEL=/ transparent_hugepage=never initrd /initramfs-2.6.32-300.25.1.el6uek.x86_64.img OR Add the following lines in /etc/rc.local and reboot the server: if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi if test -f /sys/kernel/mm/transparent_hugepage/defrag; then echo never > /sys/kernel/mm/transparent_hugepage/defrag fi
具体见文档:ALERT Disable Transparent HugePages on SLES11 RHEL6 OEL6 and UEK2 Kernels(Doc ID 1557478.1)