HP Itaniums上一次ORA-240异常处理

一、问题由来
今天有网友和我说，他的数据库始终只能mount，不能打开到open状态，具体状态为：

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01154: database busy. Open, close, mount, and dismount not allowed now

启动数据库一直停留在Database mounted.
最后报ORA-03113: end-of-file on communication channel

二、数据库环境

OS version：HP-UX Itanium  version 11.31
Oracle version：Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit

三、日志文件错误
查看alert日志

Fri Nov 11 09:31:01 2011
Completed: ALTER DATABASE   MOUNT
Fri Nov 11 09:31:08 2011
ALTER DATABASE OPEN
Fri Nov 11 09:40:32 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_dbw0_3122.trc:
ORA-00240: control file enqueue held for more than 120 seconds
Fri Nov 11 09:47:18 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_dbw0_3122.trc:
Fri Nov 11 10:02:53 2011
Errors in file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 3122'
Fri Nov 11 10:02:55 2011
System State dumped to trace file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc
Killing enqueue blocker (pid=3122) on resource CF-00000000-00000000
 by killing session 54.1
Fri Nov 11 10:07:57 2011
Errors in file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc:
ORA-00494: enqueue [CF] held for too long (more than 900 seconds) by 'inst 1, osid 3122'
Fri Nov 11 10:07:59 2011
System State dumped to trace file /opt/oracle/oracle1/admin/gminer/udump/gminer_ora_3555.trc
Killing enqueue blocker (pid=3122) on resource CF-00000000-00000000
 by terminating the process
USER: terminating instance due to error 2103
Instance terminated by USER, pid = 3555

四、分析ORA-00240和ORA-00494

[oracle@node1 ~]$ oerr ora 240
00240, 00000, "control file enqueue held for more than %s seconds"
// *Cause:  The current process did not release the control file enqueue within
//          the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
//          Services with the error information.
[oracle@node1 ~]$ oerr ora 494
00494, 00000, "enqueue%s held for too long%s by '%s'"
// *Cause:  The specified process did not release the enqueue within
//          the maximum allowed time.
// *Action: Reissue any commands that failed and contact Oracle Support
//          Services with the incident information.
ORA-00240: control file enqueue held for more than 120 seconds
ORA-00494: enqueue [CF] held for too long (more than 900 seconds)

dbw0进程持有control file enqueue超过了默认的900s，导致系统kill掉持有这个enqueue的会话，然后数据库自动终止。现在问题定位在control file enqueue的持有时间为什么会超时(ORA-00240)上面

五、深入理解ORA-00240

ORA-00240: control file enqueue held for more than 120 seconds
The Developers confirmed that this is just a warning to let the DBA know that a CF enqueue is being held for more than 120 seconds. 
This is not an error, and error will occur if a CF enqueue is held for more than 900 seconds (15 minutes) and this is not the case here.
The message occurs when there are many datafiles in the database. DBWriter (dbw0) is taking too much time to release the CF enqueue due to having to open these datafiles.

从mos的这几句中，可以分析出来，是因为dbw0在open数据文件的过程中占用了太多时间的CF enqueue。这说明和存储或者asm有关系

六、继续分析日志
分析日志，找出出现ORA-00240和ORA-00494之前出现过什么错误

Stopping Job queue slave processes, flags = 7
Thu Nov 10 17:52:26 2011
Job queue slave processes stopped
Thu Nov 10 17:54:23 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_asmb_26101.trc:
ORA-00600: internal error code, arguments: [kffmAllocate_1], [8], [2], [], [], [], [], []
Thu Nov 10 17:54:24 2011
Errors in file /opt/oracle/oracle1/admin/gminer/bdump/gminer_asmb_26101.trc:
ORA-00600: internal error code, arguments: [kffmAllocate_1], [8], [2], [], [], [], [], []
Thu Nov 10 17:54:24 2011
ASMB: terminating instance due to error 486
Termination issued to instance processes. Waiting for the processes to exit
Instance terminated by ASMB, pid = 26101

通过这个，发现和asm有关系，更加确定了自己的猜测

七、分析asm日志

Fri Nov 11 09:44:42 2011
Errors in file /opt/oracle/oracle1/admin/+ASM/bdump/+asm_gmon_24008.trc:
ORA-27091: unable to queue I/O
ORA-27072: File I/O error
HPUX-ia64 Error: 11: Resource temporarily unavailable
Additional information: 4
Additional information: 2044
Additional information: -1

+asm_gmon_24008.trc中内容也就报这些错误，不再贴出

八、mos中查找ORA-27091和ORA-27072
Bug 8236874 – HP-Itanium: Intermittent ORA-27091, ORA-27072, Additional information: 4 on HPUX中有类此描述

  ORA-01115: IO error reading block from file fff (block # bbb) 
  ORA-01110: data file 5: '/dev/xxx'
  ORA-27091: unable to queue I/O 
  ORA-27072: File I/O error 
  Additional information: 4  
  Additional information: xxx
  Additional information: yyy

问题很类此，进一步确信是asm的问题

八、在查询mos中ORA-00240和asm相关主题
发现HP Itanium – ORA-240 or process on ASM & Database hang [ID 1105825.1]，阅读完毕，马上查去阅读gminer_ora_3555.trc文件
果然发现和mos中一致的描述

*** SESSION ID:(40.1) 2011-11-11 09:36:59.860
Waited for detached process: DBW0 for 300 seconds:
*** 2011-11-11 09:36:59.860
Dumping diagnostic information for DBW0:
OS pid = 3122
loadavg : 0.00 0.01 0.01
Swapinfo : 
	Avail = 124485.08Mb Used = 62062.20Mb
	Swap free = 62422.88Mb Kernel rsvd = 4492.72Mb
	Free Mem  = 60374.91Mb 
  F S      UID   PID  PPID  C PRI NI             ADDR   SZ            WCHAN    STIME TTY       TIME COMD
1401 S   oracle  3122     1  0 154 20 e00000090de8b980 55372 e0000009203b1340 09:19:12 ?         0:06 ora_dbw0_gminer
Attaching to program: /opt/oracle/oracle1/bin/oracle, process 3122
warning: The shared libraries were not privately mapped; setting a
breakpoint in a shared library will not work until you rerun the program;
stepping over longjmp calls will not work as expected.
Please set the kernel variable "shlib_debug_enable" to 1 to enable the shared library debugging
warning: Load module /opt/oracle/oracle1/lib/libskgxp10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libhasgen10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocr10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocrb10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libocrutl10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libdbcfg10.so has been stripped.  
Debugging information is not available.
warning: Load module /opt/oracle/oracle1/lib/libnnz10.so has been stripped.  
Debugging information is not available.
0xc00000000043e3f0:0 in pw_wait+0x30 () from /usr/lib/hpux64/libc.so.1
(gdb) (gdb) #0  0xc00000000043e3f0:0 in pw_wait+0x30 () from /usr/lib/hpux64/libc.so.1
#1  0x40000000094e5ee0:0 in pw_wait () at sskgp.c:2245
#2  0x4000000002cb36c0:0 in sskgpwwait () at sskgp.c:926
#3  0x4000000002f16440:0 in skgpwwait () at skgp.c:2137
#4  0x4000000002e28770:0 in ksliwat () at ksl.c:7428
#5  0x400000000402a7f0:0 in kslwaitns_timed () at ksl.c:7652
#6  0x4000000002d3de40:0 in kskthbwt () at ksk.c:2263
#7  0x4000000002d3c940:0 in kslwait () at ksl.c:7616
#8  0x40000000041780e0:0 in ksvsubmit () at ksv.c:1975
#9  0x4000000004d5b710:0 in kfncSlaveSubmit () at kfnc.c:2913
#10 0x4000000004d63120:0 in kfncFileIdentify () at kfnc.c:1638
#11 0x4000000004cfb890:0 in kfioIdentify () at kfio.c:1431
#12 0x4000000004137c20:0 in ksfd_osmopn () at ksfd.c:16583
#13 0x40000000044c9850:0 in $cold_ksfdopn+0xc0 () at kfio.c:1654
#14 0x4000000002630aa0:0 in kcfbid () at kcf.c:13447
#15 0x40000000025b72f0:0 in kcfida () at kcf.c:13599
#16 0x4000000002d3bc80:0 in ksbabs () at ksb.c:1390
#17 0x4000000003aed5f0:0 in ksbrdp () at ksb.c:2971
#18 0x4000000002807f50:0 in opirip () at opirip.c:283
#19 0x400000000200f680:0 in $cold_opidrv+0x580 ()
#20 0x400000000277edf0:0 in sou2o () at sou2o.c:86
#21 0x4000000001fff020:0 in $cold_opimai_real+0x280 ()
#22 0x400000000269c5e0:0 in main () at opimai.c:173
(gdb) Detaching from program: /opt/oracle/oracle1/bin/oracle, process 3122

九、HP官网描述

Issue
Any questions on the use of software developed and maintained by Oracle should involve Oracle technical support. Oracle, not HP, supports Oracle software.
All ITRC HP-UX version 11.31 March 2009 patch bundles were applied to two HP Integrity rx3600-based Oracle RAC nodes. After this installation, the crsd.bin process from the Oracle RAC Clustering software seemed to hang at random intervals, causing a cluster check process launched by a third party to wait indefinitely and to eventually consume all available RAM memory.
A situation where crsd.bin hangs may cause RAC to issue a Transfer of Control (TOC), that is, a system restart with crash dump.

Solution
If any patches related to Oracle RAC are included in a patch bundle (whether it is HP-UX specific patching that RAC depends upon or vice versa), RAC requires that the binaries be relinked prior to starting them.
An alternative might be to (temporarily) uninstall the patch bundle.
The relink of the binaries should be done by a DBA, they should know how to perform that action.
In the current situation, relinking the Oracle RAC binaries resolved the issue.

NOTE:
It seems this problem can also be caused by the installation of HP patches PHKL_38762 or PHKL_39145. QXCR1000940361 tracked the problem. This has been fixed now, and the problem is not present anymore when installing the superseded patch PHKL_40208.
The problem of QXCR1000940361 has existed before the fix of PHKL_39145 was introduced so we dont set a patch warning.

十、解决方案
问题定位完毕是因为缺少打HP patches PHKL_38762 or PHKL_39145的补丁导致
根据HP官网的建议，打HP patches PHKL_38762 or PHKL_39145解决这个问题，也可以打上PHKL_40208补丁问题解决

十一、补充说明
1、因为系统以前可以运行，那么我猜测这个bug应该有偶然性，让网友重启asm，然后启动数据库正常
2、在处理问题过程中，因缺少耐心，开始没有仔细阅读trace文件，导致问题分析过程中走了一些弯路
3、这个是我第一次发现因为系统补丁没有打，导致数据库异常

《HP Itaniums上一次ORA-240异常处理》有 2 条评论

问心说：

2011 年 11 月 13 日22:59

其实跟这里说有点像。
http://blog.sina.com.cn/s/blog_548668f10100obv2.html
只是我们是HP-UX 11.31和IBM DS4800的存储。跟他的现象是一样的，当时我们用OSM管理软件查看，发现LUN 在控制器A和控制器B 之间不断切换，导致48存储的读写性能很低。但当时我们是改了这里才行的。scsimgr set_attr -D /dev/rdisk/disk149 -a load_bal_policy=preferred_path
load_bal_policy 的策略可以设置为如下模式：
• round_robin
以循环方式选择路径。当设备的所有路径具有相似的I/O 循环特性时，该参数将更为合适。
• least_cmd_load
选择活动的I/O 请求数最少的LUN 路径来执行下一个I/O。当LUN 路径表现出不对称延迟特性时，该策略将比较合适。分配负载以优化每个LUN 路径上的带宽。
• cl_round_robin（可识别单元的循环）
此负载平衡策略适用于基于单元的HP 平台。在启动了I/O 的CPU 的位置内以循环方式选择LUN 路径，以确保优化内存访问延迟。
• preferred_path
I/O 传输优先选择preferred_path 属性中设置的I/O 路径。如果该I/O 路径不可用，或未设置preferred_path属性，则I/O 传输将选择任何其他路径。该策略对于某些磁盘阵列（这些磁盘阵列同时通过LUN 的多个I/O 路径传输I/O 时可能出现性能下降）很有用。
惜分飞说：

2011 年 11 月 11 日14:47

网友反馈他们的hp 系统和IBM ds48存储连接在一起的时候，本身访问就是非常慢，用dd测试只有几十k/s，后来根据ibm和hp的建议使用了scsimgr set_attr -D /dev/rdisk/disk149 -a load_bal_policy=preferred_path命令，修改了一下磁盘的选路方式，是的访问速度达到200m/s，这次出现这个问题也有可能是因为系统重启，磁盘的选路方式还原了导致系统读取存储数据异常的慢，然后oracle 出现ora-240d错误。

因为同时他们的hp的补丁也是没有打，所以不能完全的确定到底是因为磁盘访问路径还是补丁导致了这个问题

《HP Itaniums上一次ORA-240异常处理》有 2 条评论

文章分类

最近发表

HP Itaniums上一次ORA-240异常处理

《HP Itaniums上一次ORA-240异常处理》有 2 条评论

标签云

文章分类

最近发表