CURSOR_SHARING=SIMILAR引起的悲剧

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:CURSOR_SHARING=SIMILAR引起的悲剧

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

一个客户反馈说有一系统经常性负载比较高,让我帮忙分析原因
系统负载情况

[oracle@zwq-kfdialdb ~]$ top -c
top - 17:11:06 up 78 days,  1:12,  5 users,  load average: 124.83, 125.90, 112.13
Tasks: 836 total, 152 running, 684 sleeping,   0 stopped,   0 zombie
Cpu(s): 98.1%us,  0.1%sy,  0.0%ni,  1.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132070908k total, 90494280k used, 41576628k free,  1147384k buffers
Swap: 67108856k total,        0k used, 67108856k free, 79109904k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                
12179 oracle    15   0 32.5g 142m 139m R 39.0  0.1   6:09.20 oracleahdial (LOCAL=NO)
11530 oracle    16   0 32.5g 469m 466m R 36.1  0.4  11:14.23 oracleahdial (LOCAL=NO)
11816 oracle    15   0 32.5g 467m 463m R 36.1  0.4   6:33.86 oracleahdial (LOCAL=NO)
11577 oracle    15   0 32.5g 480m 477m R 34.7  0.4   7:15.98 oracleahdial (LOCAL=NO)
12136 oracle    16   0 32.5g 455m 452m R 31.9  0.4   9:07.88 oracleahdial (LOCAL=NO)
11237 oracle    16   0 32.5g 997m 992m R 31.2  0.8  20:53.50 oracleahdial (LOCAL=NO)
11427 oracle    16   0 32.5g 137m 135m R 31.2  0.1  11:50.16 oracleahdial (LOCAL=NO)
12051 oracle    16   0 32.5g 459m 456m R 31.2  0.4   6:12.67 oracleahdial (LOCAL=NO)

[oracle@zwq-kfdialdb ~]$ vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
137  0      0 41566600 1147384 79109888    0    0     0     3    0    0  0  0 99  0  0
146  0      0 41567548 1147384 79109888    0    0     0    47 1058 32190 98  0  2  0  0
150  0      0 41568056 1147384 79109888    0    0     0    39 1081 31543 98  0  2  0  0
12  0      0 41568464 1147384 79109888    0    0     0    23 1056 32111 98  0  2  0  0
62  0      0 41568776 1147384 79109888    0    0     0    11 1067 31474 98  0  2  0  0
108  0      0 41568304 1147384 79109888    0    0     0    39 1059 31193 98  0  2  0  0
140  0      0 41569280 1147384 79109888    0    0     0    48 1063 31171 98  0  2  0  0
140  0      0 41569444 1147384 79109888    0    0     0    40 1075 30508 98  0  2  0  0

通过top和vmstat看出系统现在负载很高,主要都是用户进程导致.

查询等待事件

SQL> select event from v$session where wait_class#<>6;

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> /

EVENT
----------------------------------------------------------------
cursor: mutex S

SQL> SELECT a.*, s.sql_text
  2    FROM v$sql s,
  3         (SELECT sid,
  4                 event,
  5                 wait_class,
  6                 p1 cursor_hash_value,
  7                 p2raw Mutex_value,
  8                 TO_NUMBER (SUBSTR (p2raw, 1, 8), 'xxxxxxxx') hold_mutex_x_sid
  9            FROM v$session_wait
 10           WHERE event LIKE 'cursor%') a
 11   WHERE s.HASH_VALUE = a.cursor_hash_value
 12  /

no rows selected

SQL>  select event from v$session where wait_class#<>6;

EVENT
----------------------------------------------------------------
SQL*Net message to client

数据库开始的等待事件只有cursor: mutex S,等该等待事件消失后系统负载也恢复正常

再次查看系统负载

[oracle@zwq-kfdialdb ~]$ top -c -i10
top - 17:13:51 up 78 days,  1:15,  6 users,  load average: 12.57, 78.21, 96.45
Tasks: 702 total,   2 running, 700 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.5%us,  0.2%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  132070908k total, 86477808k used, 45593100k free,  1147500k buffers
Swap: 67108856k total,        0k used, 67108856k free, 79116036k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND        
15045 oracle    16   0 13136 1476  724 R 10.6  0.0   0:00.12 top -c -i10    
10010 root      10  -5     0    0    0 D  0.0  0.0   0:00.68 [kondemand/4]  
10019 root      10  -5     0    0    0 D  0.0  0.0   1:41.58 [kondemand/13] 
10020 root      10  -5     0    0    0 D  0.0  0.0   1:52.28 [kondemand/14] 
10021 root      10  -5     0    0    0 R  0.0  0.0   2:01.54 [kondemand/15] 
12166 root      24   0 10084  300  216 D  0.0  0.0   0:00.00 /opt/VRTSgab/gablogd

[oracle@zwq-kfdialdb ~]$  vmstat 3 10
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 45484264 1147528 79117360    0    0     0     3    0    0  0  0 99  0  0
 2  0      0 45468452 1147528 79117456    0    0     4   161 1372 5369  2  2 97  0  0
 3  0      0 45463712 1147528 79117584    0    0     4   187 1602 7253  6  0 93  0  0
 1  0      0 45458220 1147528 79117648    0    0     1    99 1358 5821  2  0 98  0  0
 0  0      0 45475168 1147528 79117712    0    0     0    41 1321 5321  2  0 98  0  0
 0  0      0 45473624 1147528 79117744    0    0     3   104 1378 5455  2  0 98  0  0
 2  0      0 45474656 1147528 79117776    0    0     0    55 1196 4872  1  0 99  0  0
 0  0      0 45474376 1147532 79117824    0    0     8   113 1170 4990  2  0 98  0  0
 1  0      0 45475440 1147532 79117872    0    0     1    56 1187 5301  3  0 97  0  0
 1  0      0 45475824 1147532 79117888    0    0     3    99 1083 4643  3  0 97  0  0

结合上面的等待事件查询,我们可以大概评估出来,当cursor: mutex S等待消失后,系统负载也恢复正常,现在已经不存在环境,如果要找出问题只能够是借助AWR和ASH

分析ASH
Top User Events

Top SQL with Top Events

Activity Over Time

通过对ASH分析,可以大概确定,在这段时间内,引起系统负载高主要是cursor: mutex S导致

分析AWR
awr整体信息(从这里看数据库是相当的繁忙)

Load Profile(从这里看数据库业务比较小)

Top 5 Timed Foreground Events(主要等待事件cursor: mutex S,和前面分析相符)

OS LOAD(虽然和系统看到有一定出入,但是整体还是展示系统负载较高)

SQL ordered by Version Count(出现cursor: mutex S,因为load profile中解析不多,所以想到高版本问题,这里确实非常高)

通过这里的一些列分析,我们已经基本上可以确定,该数据库因为高版本问题导致cursor: mutex S以及library cache 相关等待严重,从而出现系统负载过高.

找出高版本原因
高版本相关信息和查询请见:关于High Versions Count总结

SQL> select * from table(version_rpt('f8b9tba7sfsb5')); 

COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:13
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 000000080FA4CEA0  Hash_Value: 2408014181  SQL_ID f8b9tba7sfsb5
Sharable_Mem: 206315729 bytes   Parses: 48689
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.MOBILE_TELE_NO=:"SYS_B_1"
3


COLUMN_VALUE
--------------------------------------------------------------------------------
Versions Summary
----------------
AUTH_CHECK_MISMATCH :4
TRANSLATION_MISMATCH :4
ROLL_INVALID_MISMATCH :10219
PURGED_CURSOR :9

Total Versions:10219

Plan Hash Value Summary
-----------------------

COLUMN_VALUE
--------------------------------------------------------------------------------
Plan Hash Value Count
=============== =====
      791727930 920
     2820478500 9300

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for AUTH_CHECK_MISMATCH :

  # of Ver PARSING_USER_ID PARSING_SCHEMA_ID PARSING_SCHEMA_NAME
========== =============== ================= ===================
     10218               75                75 HOLLYSP

COLUMN_VALUE
--------------------------------------------------------------------------------
         2              107               107 HOLLYSP_TEST
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for TRANSLATION_MISMATCH :

Summary of objects probably causing TRANSLATION_MISMATCH

  Object# Owner.Object_Name
========= =================
    76737 HOLLYSP.TBL_SP_SALES_RECORDS
 HOLLYSP.TBL_SP_SALES_RECORDS
 HOLLYSP_TEST.TBL_SP_SALES_RECORDS

COLUMN_VALUE
--------------------------------------------------------------------------------
   107043 HOLLYSP_TEST.TBL_SP_SALES_RECORDS
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for ROLL_INVALID_MISMATCH :

No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for PURGED_CURSOR :

No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.

COLUMN_VALUE
--------------------------------------------------------------------------------
alter session set events
 'immediate trace name cursortrace address 2408014181, level LLL';
To turn it off do use address 1, level 2147483648
================================================================

59 rows selected.

SQL> select * from table(version_rpt('6zhjf3qh8gyp9'));   

COLUMN_VALUE
--------------------------------------------------------------------------------
Version Count Report Version 3.2.1 -- Today's Date 19-jul-12 18:15
RDBMS Version :11.2.0.1.0 Host: zwq-kfdialdb Instance 1 : ahdial
==================================================================
Addr: 00000007FD46D2A8  Hash_Value: 2693266089  SQL_ID 6zhjf3qh8gyp9
Sharable_Mem: 111904227 bytes   Parses: 4880
Stmt:
0 select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales
1 0_ where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and tblsps
2 ales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2
3 " and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES
4 _PERSON=:"SYS_B_4"

COLUMN_VALUE
--------------------------------------------------------------------------------
5

Versions Summary
----------------
BIND_MISMATCH :4804
INCOMP_LTRL_MISMATCH :372
HASH_MATCH_FAILED :4936

Total Versions:4935

Plan Hash Value Summary

COLUMN_VALUE
--------------------------------------------------------------------------------
-----------------------
Plan Hash Value Count
=============== =====
     1645985080 3
     2040125427 4933

~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for BIND_MISMATCH :

Consolidated details for :
BIND_MISMATCH,USER_BIND_PEEK_MISMATCH,BIND_UACS_DIFF and

COLUMN_VALUE
--------------------------------------------------------------------------------
BIND_EQUIV_FAILURE (Mislabled as ROW_LEVEL_SEC_MISMATCH BY bug 6964441 in 11gR1)

from v$sql_bind_capture
COUNT(*) POSITION MIN(MAX_LENGTH) MAX(MAX_LENGTH) DATATYPE (PRECISION,SCALE)
======== ======== =============== =============== ======== ================
    4936        1              32              32        1 (,)
    4936        2              32              32        1 (,)
    4936        3              32              32        1 (,)
    4936        4              32              32        1 (,)
    4936        5              32              32        1 (,)


COLUMN_VALUE
--------------------------------------------------------------------------------
SUM(DECODE(column,Y, 1, 0) FROM V$SQL
IS_OBSOLETE IS_BIND_SENSITIVE IS_BIND_AWARE IS_SHAREABLE
=========== ================= ============= ============
          0              4309             0         4309
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for INCOMP_LTRL_MISMATCH :

No details available
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for HASH_MATCH_FAILED :


COLUMN_VALUE
--------------------------------------------------------------------------------
No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.
alter session set events
 'immediate trace name cursortrace address 2693266089, level LLL';
To turn it off do use address 1, level 2147483648
================================================================

62 rows selected.

根据经验,出现这么多的高版本情况,很可能是cursor_sharing参数设置问题

--对应sql语句
select count(*) as col_0_0_ from TBL_SP_SALES_RECORDS tblspsales0_ 
where tblspsales0_.SALES_RECORDS_STATUS=:"SYS_B_0" and 
tblspsales0_.CAMPAIGN_ID=:"SYS_B_1" and tblspsales0_.IS_CONN=:"SYS_B_2" 
and tblspsales0_.SALES_TIME>=:"SYS_B_3" and tblspsales0_.SALES_PERSON=:"SYS_B_4"

--cursor_sharing参数
SQL> show parameter cursor_sharing;

NAME                                 TYPE                             VALUE
------------------------------------ -------------------------------- ----------------
cursor_sharing                       string                           SIMILAR

根据oracle官方建议在11g中不推荐使用cursor_sharing=SIMILAR,其实在所有版本中都不推荐,设置为该值很容易导致高版本问题.而且该值会出现莫名其妙的,无法解释的高版本问题.而且根据oracle相关文档,在即将发布的12c版本中,将除掉SIMILAR值.对于客户库的该问题,因为很多sql未绑定参数,为了减少硬解析,建议在业务低谷时设置cursor_sharing=FORCE,并刷新shared pool.

此条目发表在 Oracle性能优化 分类目录。将固定链接加入收藏夹。

CURSOR_SHARING=SIMILAR引起的悲剧》有 2 条评论

  1. 惜分飞 说:

    Cursor Mutex S Wait Event And Too Many Child Cursors

    Applies to:
    Oracle Waveset - Version 8.1.1 and later
    Information in this document applies to any platform.
    
    Goal
    Cursor Mutex S wait event and too many child cursors open when cursor sharing is set to similar.
    Customer upgraded the DB Repository for Oracle Waveset from Oracle 10g to  Oracle 11g.  
    Cursor sharing was set to SIMILAR as per the Oracle Waveset Documentation in 10g.
    Child cursors were getting released in 10g. Customer noticed child cursors were not getting released in 11g.
    As per the documentation in MOS 1169017.1 cursor sharing should be exact or force
    
    Customer had set the following in their environment with Oracle Waveset.
    cursor_sharing String EXACT       *** similiar
    open_cursors Integer 1000          *** 1000
    The above setting uses more CPU but the child processes are released.
    
    Fix
    The recommendation to set the 'cursor_sharing' option as 'SIMILAR' 
    was to remedy the known issues when 'cursor_sharing' was set to be 
    the default 'EXACT' option in previous versions of Oracle Waveset and Oracle 10g as repository.
    
    Since the Oracle 11g database is deprecates the 'SIMILAR' option, 
    the recommendation now would be to set 'cursor_sharing' to be 'FORCE'.
    
  2. 惜分飞 说:

    关于高版本的一些mos文章
    Troubleshooting: High Version Count Issues [ID 296377.1]
    High SQL Version Counts – Script to determine reason(s) [ID 438755.1]
    ANNOUNCEMENT: Deprecating the cursor_sharing = ‘SIMILAR’ setting [ID 1169017.1]
    High Version Count with CURSOR_SHARING = SIMILAR or FORCE from pre 10g to 11.2.0.3 [ID 1401814.1]