11.2.0.3 adg库因 bug 16427872 导致smon占用大量cpu

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:11.2.0.3 adg库因 bug 16427872 导致smon占用大量cpu

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

检查数据库发现客户有一套核心的ADG库smon进程负载异常,单进程一直持有cpu 100%

[oracle@q9adg01 trace]$ top -c
top - 14:00:14 up 83 days, 21:39,  4 users,  load average: 10.34, 11.55, 11.25
Tasks: 1162 total,   3 running, 1157 sleeping,   0 stopped,   2 zombie
Cpu(s):  1.7%us,  1.2%sy,  0.0%ni, 86.2%id, 10.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264253752k total, 200445076k used, 63808676k free,   757684k buffers
Swap: 33554424k total,        0k used, 33554424k free,  6529220k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
 5707 oracle    25   0  150g  20m  16m R 99.9  0.0  14273:00 ora_smon_q9db1 
 5285 oracle    16   0 13564 1952  820 R 31.5  0.0   0:02.49 top -c         
 5713 oracle    18   0  150g  20m  17m S  5.3  0.0 410:01.33 ora_asmb_q9db1 
 5821 oracle    15   0  150g  23m  17m S  5.3  0.0   4883:29 ora_lck0_q9db1 
 7596 oracle    15   0  150g  69m  37m S  5.3  0.0   5368:28 ora_pr00_q9db1 


[oracle@q9adg02 ~]$ top -c
top - 14:00:03 up 84 days, 19:36,  3 users,  load average: 6.46, 6.96, 6.76
Tasks: 1045 total,   5 running, 1040 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.8%us,  1.0%sy,  0.0%ni, 93.4%id,  3.7%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264253752k total, 196879216k used, 67374536k free,   425320k buffers
Swap: 33554424k total,        0k used, 33554424k free,  4727836k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
11615 oracle    25   0  150g  16m  14m R 100.0  0.0  14272:55 ora_smon_q9db2
18173 oracle    16   0  150g  73m  37m D 18.6  0.0  24:33.91 oracleq9db2 (LOCAL=NO) 
 6561 oracle    15   0  150g  31m  25m R 12.2  0.0   0:48.50 oracleq9db2 (LOCAL=NO)  

数据库版本和patch信息

14:18:05 sys@Q9DB>select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production
CORE    11.2.0.3.0      Production
TNS for Linux: Version 11.2.0.3.0 - Production
NLSRTL Version 11.2.0.3.0 - Production


SQL>  SELECT INST_ID,DATABASE_ROLE,OPEN_MODE FROM GV$DATABASE;

   INST_ID DATABASE_ROLE    OPEN_MODE
---------- ---------------- --------------------
         2 PHYSICAL STANDBY READ ONLY WITH APPLY
         1 PHYSICAL STANDBY READ ONLY WITH APPLY

SQL >select inst_id,STARTUP_TIME from gv$instance;

   INST_ID STARTUP_T
---------- ---------
         2 01-NOV-13
         1 01-NOV-13

[oracle@q9adg01 trace]$ /u01/app/oracle/product/11.2.0/db_1/OPatch/opatch  lspatches
16056266;Database Patch Set Update : 11.2.0.3.6 (16056266)
16315641;Grid Infrastructure Patch Set Update : 11.2.0.3.6 (16083653)

SYSAUX表空间增加数据文件

SQL> select ts# from v$tablespace where name='SYSAUX';

       TS#
----------
         2

SQL> select file#,name,creation_time from v$datafile where ts#=2;

     FILE# NAME                                               CREATION_
---------- -------------------------------------------------- ---------
         3 +DATA/q9db/datafile/sysaux.1412.818566605          12-MAR-08
       151 +DATA/q9db/datafile/sysaux.1431.818566885          26-MAR-12
       221 +DATA/q9db/datafile/sysaux.828.818547945           16-APR-12
      1744 +DATA/q9db_adg/datafile/sysaux.2050.835459505      29-DEC-13

核对数据库确实在2013年12月29日对SYSAUX表空间增加了数据文件而且未重启数据库,触发Bug 16427872 Standby SMON spins on CPU after add/drop SYSAUX datafile on primary
bug-16427872
Bug 16427872 Standby SMON spins on CPU after add/drop SYSAUX datafile on primary
在12.1.0.1中修复,在未修复前增加/删除sysaux的数据文件后,通过重启实例来解决该问题

此条目发表在 Oracle Bug 分类目录,贴了 标签。将固定链接加入收藏夹。

11.2.0.3 adg库因 bug 16427872 导致smon占用大量cpu》有 1 条评论

  1. 惜分飞 说:

    Bug 16427872 : SMON STUCK ON LATCH FREE ON ACTIVE DATA GUARD AFTER ADDING A DATAFILE IN PRIMARY

    Hdr: 16427872 11.2.0.3 RDBMS 11.2.0.3 SPACE PRODID-5 PORTID-226
    Abstract: SMON STUCK ON LATCH FREE ON ACTIVE DATA GUARD AFTER ADDING A DATAFILE IN PRIMARY
     
    *** 03/03/13 05:24 pm *** (ADD: Impact/Symptom->WRONG RESULT - INCO...)
    *** 03/03/13 05:24 pm ***
     
    BUG TYPE CHOSEN
    ===============
    Code
     
    == Component: RDBMS ==
    ======================
    DETAILED PROBLEM DESCRIPTION
    ============================
    The SMON process appears stuck on Latch Free on Active Data Guard after a
    datafile was added to the sysaux tablespace in the primary database.  The
    SMON is moving between these few functions: ktm_back_to_smon, ktm_set_curfun,
    ktmmon, ktmSmonMain.
     
    DIAGNOSTIC ANALYSIS
    ===================
    We obtained the pstack outputs of the smon process
        ps -ef | grep smon
        script /tmp/stack.<pid of smon>
        date
        pstack <pid of smon>
        -- wait for 15 seconds
        date
        pstack <pid of smon>
        -- wait for 15 seconds
        date
        pstack <pid of smon>
        exit
     
    In stack12124.htm we can find the pstack of smon as follow:
     
    876:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    date
    Thu Feb 28 23:22:57 GMT 2013
    877:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    pstack 12124
    #0  0x000000339f698b17 in times () from /lib64/libc.so.6
    #1  0x00000000092cde1f in sltrgatime64 ()
    #2  0x000000000107967a in ktm_set_curfun ()
    #3  0x00000000010744c3 in ktmmon ()
    #4  0x0000000001073198 in ktmSmonMain ()
    #5  0x00000000023343b3 in ksbrdp ()
    #6  0x000000000255a4ff in opirip ()
    #7  0x000000000180c839 in opidrv ()
    #8  0x0000000001e01f9b in sou2o ()
    #9  0x0000000000a0b3ea in opimai_real ()
    #10 0x0000000001e07f0c in ssthrdmain ()
    #11 0x0000000000a0b2d1 in main ()
    878:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    date
    Thu Feb 28 23:23:35 GMT 2013
    880:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    pstack 12124
    #0  0x000000339f698b17 in times () from /lib64/libc.so.6
    #1  0x00000000092cde1f in sltrgatime64 ()
    #2  0x00000000010798ef in ktm_back_to_smon ()
    #3  0x00000000010744e8 in ktmmon ()
    #4  0x0000000001073198 in ktmSmonMain ()
    #5  0x00000000023343b3 in ksbrdp ()
    #6  0x000000000255a4ff in opirip ()
    #7  0x000000000180c839 in opidrv ()
    #8  0x0000000001e01f9b in sou2o ()
    #9  0x0000000000a0b3ea in opimai_real ()
    #10 0x0000000001e07f0c in ssthrdmain ()
    #11 0x0000000000a0b2d1 in main ()
    881:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    pstack 12124
    #0  0x000000339f698b17 in times () from /lib64/libc.so.6
    #1  0x00000000092cde1f in sltrgatime64 ()
    #2  0x000000000107967a in ktm_set_curfun ()
    #3  0x0000000001074425 in ktmmon ()
    #4  0x0000000001073198 in ktmSmonMain ()
    #5  0x00000000023343b3 in ksbrdp ()
    #6  0x000000000255a4ff in opirip ()
    #7  0x000000000180c839 in opidrv ()
    #8  0x0000000001e01f9b in sou2o ()
    #9  0x0000000000a0b3ea in opimai_real ()
    #10 0x0000000001e07f0c in ssthrdmain ()
    #11 0x0000000000a0b2d1 in main ()
    883:skywp_nwk-p20adm-lapd04.corp.apple.com:/home/oracle/skywp/admin/backdump>
    exit
     
    This indicates that smon is moving between these few functions:
    ktm_back_to_smon, ktm_set_curfun, ktmmon and ktmSmonMain.  This is matched to
    the errorstack level 3.
     
    A repeated query from v$session and v$process shows:
    TIME			 SID	   SEQ# PNAME	      P1	 P2	    P3	WAIT_TIME SECONDS_IN_WAIT STATE
    	      EVENT
    ----------------- ---------- ---------- ----- ---------- ----------
    ---------- ---------- --------------- -------------------
    ----------------------------------------------------------------
    13-03-01:04:03:34	1639	   3702 SMON  1610802608	259	     0	       -1	   
    25277 WAITED SHORT TIME   latch free 
     
    TIME			 SID	   SEQ# PNAME	      P1	 P2	    P3	WAIT_TIME SECONDS_IN_WAIT STATE
    	      EVENT
    ----------------- ---------- ---------- ----- ---------- ----------
    ---------- ---------- --------------- -------------------
    ----------------------------------------------------------------
    13-03-01:04:03:22	1639	   3702 SMON  1610802608	259	     0	       -1	   
    25325 WAITED SHORT TIME   latch free 
     
    TIME			 SID	   SEQ# PNAME	      P1	 P2	    P3	WAIT_TIME SECONDS_IN_WAIT STATE
    	      EVENT
    ----------------- ---------- ---------- ----- ---------- ----------
    ---------- ---------- --------------- -------------------
    ----------------------------------------------------------------
    13-03-01:04:03:09	1639	   3702 SMON  1610802608	259	     0	       -1	   
    25433 WAITED SHORT TIME   latch free
     
    WORKAROUND?
    ===========
    No
     
    TECHNICAL IMPACT
    ================
    Standby database unusable.
     
    RELATED ISSUES (bugs, forums, RFAs)
    ===================================
    Few bugs found:
    Defect 13712579 SMON CPU SPIKES ON ADG PHYSICAL STANDBY
    Defect 13060518 ADG 11.2.0.2 - SMON TAKES 100% CPU ON STANDBY SINCE ADDING A
    DATAFILE ON PRIMARY
    Defect 16092439 CDB EXIT : SMON IN ADG AT 100% CPU IN KTM
    All of them are not closed.
     
     
    We have set the following event:
    alter session set tracefile_identifier='SMON';
    alter system set events ' 10500 level 15 ';
    After few minutes got a 1GB trace file smon.trc.gz, in it:
    ...
        0: waited for 'latch free'
            address=0x6002e5b0, number=0x103, tries=0x0
            wait_id=3701 seq_num=3702 snap_id=1
            wait times: snap=0.000069 sec, exc=0.000069 sec, total=0.000069 sec
            wait times: max=infinite
            wait counts: calls=0 os=0
            occurred after 287 min 5 sec of elapsed time
         1: waited for 'smon timer'
            sleep time=0x12c, failed=0x0, =0x0
            wait_id=3700 seq_num=3701 snap_id=1
            wait times: snap=4 min 43 sec, exc=4 min 43 sec, total=4 min 43 sec
            wait times: max=5 min 0 sec
            wait counts: calls=1 os=95
            occurred after 0.003400 sec of elapsed time
         2: waited for 'smon timer'
            sleep time=0x12c, failed=0x0, =0x0
            wait_id=3699 seq_num=3700 snap_id=1
            wait times: snap=5 min 0 sec, exc=5 min 0 sec, total=5 min 0 sec
            wait times: max=5 min 0 sec
            wait counts: calls=1 os=99
            occurred after 0.000032 sec of elapsed time
    ...
    The above looks very similar to defect 13060518 "ADG 11.2.0.2 - SMON TAKES
    100% CPU ON STANDBY SINCE ADDING A DATAFILE ON PRIMARY" (status 11).
     
    HOW OFTEN DOES THE ISSUE REPRODUCE AT CUSTOMER SITE?
    ====================================================
    Intermittent
     
    DOES THE ISSUE REPRODUCE INTERNALLY?
    ====================================
    No
     
    IS A TESTCASE AVAILABLE?
    ========================
    No
     
    Link to IPS Package:
    ====================
    "not available"