某集团ebs数据库redo undo丢失导致悲剧

联系:手机/微信(+86 17813235971) QQ(107644445)QQ咨询惜分飞

标题:某集团ebs数据库redo undo丢失导致悲剧

作者:惜分飞©版权所有[未经本人同意,不得以任何形式转载,否则有进一步追究法律责任的权利.]

某集团的ebs系统因磁盘空间不足把redo和undo存放到raid 0之上,而且该库无任何备份。最终悲剧发生了,raid 0异常导致redo undo全部丢失,数据库无法正常启动(我接手之时数据库已经resetlogs过,但是未成功)

Sun Jul 27 11:31:27 2014
SMON: enabling cache recovery
SMON: enabling tx recovery
Sun Jul 27 11:31:27 2014
Database Characterset is ZHS16GBK
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_454754.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 42 cannot be read at this time
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
Sun Jul 27 11:31:27 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_663670.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-00376: file 41 cannot be read at this time
ORA-01110: data file 41: '/prod/oracle/PROD/logdata/undo/undo2.dbf'
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 663670
ORA-1092 signalled during: ALTER DATABASE OPEN...

查询相关文件状态发现,undo表空间文件丢失,被offline处理
df_status
因为以前alert日志被清理,通过这里大概猜测是offline丢失的undo文件,然后resetlogs了数据库,现在处理方式为
使用_corrupted_rollback_segments屏蔽回滚段,然后尝试启动数据库

Tue Jul 29 11:40:39 2014
SMON: enabling cache recovery
SMON: enabling tx recovery
Tue Jul 29 11:40:39 2014
Database Characterset is ZHS16GBK
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_smon_569378.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Tue Jul 29 11:40:39 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_585786.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number  with name "" too small
Error 604 happened during db open, shutting down database
USER: terminating instance due to error 604
Instance terminated by USER, pid = 585786
ORA-1092 signalled during: alter database open...

该错误是由于数据库启动需要找到对应的回滚段,但是由于undo异常导致该回滚段无法找到,因此出现该错误,解决方法是通过修改数据scn,让其不找回滚段,从而屏蔽该错误.数据库启动后,删除undo重新创建新undo

Tue Jul 29 15:59:22 2014
drop tablespace undo2 including contents and datafiles
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01122: database file 41 failed verification check
ORA-01110: data file 41: '/prod/oracle/PROD/logdata/undo/undo2.dbf'
ORA-01565: error in identifying file '/prod/oracle/PROD/logdata/undo/undo2.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo2.dbf
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01122: database file 42 failed verification check
ORA-01110: data file 42: '/prod/oracle/PROD/logdata/undo/undo1.dbf'
ORA-01565: error in identifying file '/prod/oracle/PROD/logdata/undo/undo1.dbf'
ORA-27037: unable to obtain file status
IBM AIX RISC System/6000 Error: 2: No such file or directory
Additional information: 3
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo2.dbf
Tue Jul 29 15:59:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_782490.trc:
ORA-01259: unable to delete datafile /prod/oracle/PROD/logdata/undo/undo1.dbf
Tue Jul 29 15:59:23 2014
Completed: drop tablespace undo2 including contents and datafiles
Tue Jul 29 15:59:56 2014
create undo tablespace undotbs1 datafile '/prod/oracle/PROD/logdata/undo_new01.dbf' size 100M autoextend on next 128M maxsize 30G
Tue Jul 29 15:59:57 2014
Completed: create undo tablespace undotbs1 datafile '/prod/oracle/PROD/logdata/undo_new01.dbf' size 100M autoextend on next 128M maxsize 30G
Tue Jul 29 16:00:03 2014
alter tablespace undotbs1 add datafile '/prod/oracle/PROD/logdata/undo_new02.dbf' size 100M autoextend on next 128M maxsize 30G
Completed: alter tablespace undotbs1 add datafile '/prod/oracle/PROD/logdata/undo_new02.dbf' size 100M autoextend on next 128M maxsize 30G

业务运行过程中,数据库报大量ORA-600 4097,ORA-600 kdsgrp1,ORA-600 kcfrbd_3错误

Tue Jul 29 16:07:03 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_950484.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:07:06 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_950484.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Tue Jul 29 16:10:06 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_917702.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:10:07 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_917702.trc:
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], []
Tue Jul 29 16:12:45 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/bdump/prod_m000_880692.trc:
ORA-00600: internal error code, arguments: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:21:23 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:21:37 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:21:56 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:22:18 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1040638.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [41], [231381], [1], [12800], [12800], [], []
Tue Jul 29 16:22:28 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1105950.trc:
ORA-00600: 内部错误代码, 参数: [4097], [], [], [], [], [], [], []
Tue Jul 29 16:22:33 2014
Errors in file /prod/oracle/PROD/db/tech_st/10.2.0/admin/PROD_erpserver/udump/prod_ora_1159232.trc:
ORA-00600: 内部错误代码, 参数: [kcfrbd_3], [42], [61235], [1], [12800], [12800], [], []

出现该错误有几个原因和解决方法:
ORA-600 kdsgrp1 是因为相关坏块引起(tab,index,memory,cr block等),结合日志分析对象异常原因,根据具体情况确定对象然后选择合适处理方案(具体参考NOTE:1332252.1)
ORA-600 4097 由于数据库异常关闭然后open,创建回滚段,可能触发bug导致该问题(虽然说在当前版本修复,但是实际处理我确实按照NOTE:1030620.6解决)
ORA-600 kcfrbd_3 有事务的block被访问之后,根据回滚槽信息定位到相关回滚段,而正好新建的回滚段信息又和以前的名字编号一致,从而反馈出来是数据文件大小不够,从而出现该错误(具体参考NOTE:601798.1)
最终该数据库虽然恢复了,抢救了大量数据,但是对于ebs系统来说,丢失redo和undo数据的损失还是巨大的.再次温馨提示:数据库的redo,undo也很重要,数据库的备份更加重要

此条目发表在 Oracle备份恢复, 非常规恢复 分类目录,贴了 , , , , , , , 标签。将固定链接加入收藏夹。

某集团ebs数据库redo undo丢失导致悲剧》有 3 条评论

  1. 惜分飞 说:

    ORA-600[4097][][][][][][][] (文档 ID 1030620.6)

    Problem Description: 
    ==================== 
     
    An ORA-600 [4097] can be encountered through various activities that use 
    rollback segments.
    
    
    Solution Description: 
    ===================== 
     
    The most likely cause of this is BUG 427389.  This BUG is fixed in
    version 7.3.3.3.  The BUG is caused when Rollback Segments are dropped and 
    recreated after a shutdown abort.  It is encountered through a very specific 
    set of circumstances: 
     
    When an instance has a rollback segment offline and the instance crashes, or 
    the user does a shutdown abort, the rollback segment wrap number does not get 
    updated.  If that segment is then dropped and recreated immediately after the 
    instance is restarted, the wrap number could be lower than existing wrap 
    numbers.  This will cause the ORA-600[4097] to occur in subsequent 
    transactions using Rollback. 
     
    To avoid encountering this bug, rollback segments should only be dropped and 
    recreated after the instance has been shutdown normal and restarted.  If you 
    have already encountered the bug, use the following workaround:  
     
       Select segment_name, segment_id from dba_rollback_segs; 
     
       Drop all Rollback Segments except for SYSTEM.  
     
       Recreate dummy (small) rollback segments with the same names in their place. 
     
       Then, recreate additional rollback segments you want to keep with their 
       permanent storage parameters.   
     
       Now drop the dummy ones. This should ensure that the segment_ids are not 
       reused. 
    
    If you ever want to add a rollback segment you have to use the workaround steps
    again.  If you do not fill the dummy slots you may see the problem re-appear.
    
  2. 惜分飞 说:

    RA-00600:[Kcfrbd_3]

    ksedmp: internal or fatal error
    
    ORA-00600: internal error code, arguments: [kcfrbd_3], [13], [64011], [1], [64000], [64000], [], []
    
    
    
    Arg [a] File number
    
    Arg [b] Block number
    
    Arg [c] Nro of blocks
    
    Arg [d] Logical File Size
    
    Arg [e] File Size
    
  3. 惜分飞 说:

    Causes and Solutions for ora-600 [kdsgrp1] (文档 ID 1332252.1)

    APPLIES TO:
    
    Oracle Database - Enterprise Edition - Version 10.2.0.4 and later
    Information in this document applies to any platform.
    ***Checked for relevance on 12-Dec-2012***
    PURPOSE
    
    This document discusses the ora-600 [kdsgrp1] error, its possible causes and the work around solutions that can be tried.
    
    TROUBLESHOOTING STEPS
    
    The ora-600 [kdsgrp1] error is thrown when a fetch operation fails to find the expected row. The error is hit in memory and so may 
    be a memory only error or an error that results from corruption on disk.
    
    This error may indicate (but is not restricted to) any of the following conditions:
    
    Lost writes
    Parallel DML issues
    Index corruption
    Data block corruption
    Consistent read [CR] issues 
    Buffer cache corruption  
    A full list of known issues is given in 
    Note 285586.1 - ORA-600 [kdsgrp1] 
    Each bug has a short description that indicates the circumstances where it is hit. The bug list can be shortened by 
    selecting your database release to show only those issues that may affect you.
    
    This issue may be intermittent or it may persist until the underlying disk level corruption is fixed. 
    Intermittent issues are likely to be memory based (however intermittent access to the corruption 
    can be confused with intermittent memory issues).
    
    Common Work Around Solutions
    
    If the issue is in memory only we can try to immediately resolve the issue by flushing the buffer 
    cache but remember to consider the performance impact on production systems:
    
    alter system flush buffer_cache;
    If we have an intermittent consistent read issue we can try disabling rowCR which is an optimization 
    to reduce consistent-read rollbacks during queries by setting _row_cr=FALSE in the initialization files. 
    However, this could lead to performance degradation of queries. Please check the ratio of the two 
    statistics "RowCR hits"/"RowCR attempts" to determine whether the workaround is to be used.
    
    If this is a result of index corruption then we can drop and rebuild the index. Note that this will 
    require a maintenance window on production systems.
    
    Root Cause Determination
    Now lets look at how we discover the root cause of the problem: the first step in finding the root 
    cause of this issue is to inspect the generated trace file. The ora-600 will generate both a 
    trace file in the trace directory and an incident file under the incident id within the incident directory. 
    The top part of the trace file tells us the SQL that was being run when the error was hit:
    
    ----- Current SQL Statement for this session (sql_id=9mamr7xn4wg7x) ----- 
     
    This immediately shows us the data objects that were accessed. Searching the trace file for the 
    text string 'Plan Table' will locate the SQL execution plan that is dumped within this trace file. 
    For a persistent issue this allows us to determine which indexes have been accessed and so identify 
    indexes that should be validated to check for block corruption:
    
    SQL> analyze index scott.pk_dept validate structure online; 
    
    Index analyzed.
    
    An other approach we can take is to use the file and block information contained in the trace file. 
    At the top of the trace file we will find information on the block where the corruption was found:
    
    *** SESSION ID:(3202.5644) 2011-03-19 04:12:16.910 
    row 07c7c8c7.a continuation at 
    file# 31 block# 510151 slot 11 not found
    
    This information can be used to identify the object details in dba_extents:
    
    Select owner, segment_name, segment_type, partition_name,tablespace_name 
    From dba_extents 
    Where relative_fno = <file id> 
    And <block#> between block_id and (block_id+blocks-1);
    
    We can then validate this object, for example a table and all it's indexes:
    
    Analyze table scott.dept validate structure cascade online;
    Remember that we may be dealing with a permanent corruption that is not located in the object blocks themselves. 
    Examples of this include:
    
    Dictionary corruption issue from transportable tablespace operations: check dba_tablespaces to 
    see if the tablespace has been plugged in.
    Lost writes in ASM diskgroup mirrors - most likely to be seen when there is heavy IO and disk resync activity. 
    To check this run dbms_diskgroup.checkfile to detect mirror discrepancies
    If analyze reports no corruption then check if there are any chained rows on the table. If these exist then 
    we may have an undetected corruption and the issue should reproduce whenever the SQL is run. 
    Exporting the table will also detect this issue.
    
    If analyze and exporting the table (in the presence of chained rows) both report no errors 
    then this should be considered a consistent read issue.
    
    Once you understand the nature of the problem you can review the list of known bugs and determine 
    which one matches your condition. If you cannot determine which issue is affecting you then open 
    a service request with Oracle Support and upload the RDBMS and ASM (if applicable)instance alert 
    logs for all nodes, any trace and incident files generated and a full description of the nature of the problem.