分类目录归档:Oracle备份恢复

存储双活系统逻辑损坏数据库抢救恢复

计划休假的前一夜晚上节点朋友求救电话,说xx医院核心his系统的Oracle数据库很多表报ORA-8103错误,业务无法正常办理.
ora-8103


通过dbv检查文件发现连续坏块
2

根据以往经验数据库出现类似这样的错误,很可能是底层问题,查看系统日志发现大量磁盘错误
3

该报错时间和应用反馈系统异常时间基本上匹配,初步怀疑是硬件或者os异常导致.因为客户数据库大量表表ORA-8103,而且有文件出现连片被置空,无法准确定位数据库损坏情况(置空值数据库级别的物理损坏,ora-8103是逻辑错误在表不被访问的情况下无法检查出来),考虑分析客户的硬件环境,备份容灾情况,分析选择最佳方案.
通过和客户沟通以及检查数据库的相关情况发现信息如下:
1)存储使用的是xx厂商的双活方案,这种存储级解决方案对于该故障来说没用,因为是lun的逻辑级别损坏,损坏数据同时同步到两套存储上.
2)数据库库容灾使用的是某厂家的cdp同步容灾,客户对cdp库进行分析,发现数据同步异常,基本上该方案也无法使用
3)数据库的备份情况:由于存放数据库备份的存储电池异常和有坏盘导致存储写io效率非常低,客户在3天之前停止掉了文件系统中的rman备份;有tsm的带库备份,结果检查发现竟无一次备份成功.
故障进一步扩大
针对客户情况,确定是节点2有明显异常,准备停掉节点2的数据库和集群,然后看下在节点1上是否有改善,结果发现把节点2的crs停掉之后,节点1的库直接crash,通过分析发现asm disk有一块盘磁盘前几M表直接置空(应该是在关闭crs之前就已经异常,只是因为磁盘头部分数据没有相关操作,因此没有触发相关问题),当一个节点关闭会去写磁盘头信息,asm发现异常直接dismount 节点1的磁盘组了,从而使得节点1的库异常.
4

5

现在的情况:
1)现在的asm 磁盘组异常(其中一个磁盘头前几M损坏),也就是说在原库基础上直接修复的概率基本上没有可能
2)cdp数据异常,不可用
3)在数据库相关服务器中找到一份4天之前的一次全备
恢复思路:
1.客户准备新空间,直接把4天之前的备份还原到本地文件系统中
2.通过底层工具对于有磁盘损坏的asm磁盘组进行分析,尝试恢复归档日志和redo(尽可能做到最大限度恢复数据)
3.通过备份还原4天之前的备份结合我们恢复的归档日志和redo尝试完全恢复数据
4.问题风险,就算归档日志和redo从损坏的asm 磁盘组中恢复出来,但是也有可能损坏,导致后面无法恢复到最新数据(造成数据丢失)
实际操作:
1. 由于客户在昨天晚上故障之后增加了一些undo数据文件,使得无法正常全库restore database(因为ctl中数据文件信息比备份集中多)
2. 后续由于10204 rac还原到单机出现ORA-600 kgeade_is_1错误
6

3. 数据库恢复完成之后,出现sqlplus 操作数据库正常,plsql dev和应用访问数据库报ora-27092的问题
7

最后运气不错,经过一系列努力,数据库open成功,应用也正常访问,最初生产环境中损坏的表现在查询也不再报ORA-8103,dbv检查异常文件也ok
8

9

再次提醒各位朋友:
1)你的数据库备份是否正常,建议定期做故障演练
2)选择合适数据库的容灾方案,建议定期检查或者演练
3)存储双活可以解决硬件故障问题,但是还要有适当的解决方案来规避存储逻辑错误风险.

发表在 Oracle备份恢复 | 标签为 | 评论关闭

数据库open过程遭遇ORA-1555对应sql语句补充

在2015年的在数据库open过程中常遇到ORA-01555汇总文章中写过oracle open过程中可能会遇到ORA-01555错误,对应的sql语句.最近的恢复中又遇到两个新的,对其进行补充
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1

Thu May 09 02:10:27 2019
SMON: enabling cache recovery
ORA-01555 caused by SQL statement below (SQL ID: bqbdby3c400p7, SCN: 0x0000.3e785fc7):
select rowcnt,blkcnt,empcnt,avgspc,chncnt,avgrln,nvl(degree,1), nvl(instances,1) from tab$ where obj# = :1
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Errors in file /home/u01/diag/rdbms/orcl/orcl/trace/orcl_ora_15929.trc:
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 91 with name "_SYSSMU91_1360910548$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 15929): terminating the instance due to error 704
Instance terminated by USER, pid = 15929
ORA-1092 signalled during: alter database open resetlogs...
opiodr aborting process unknown ospid (15929) as a result of ORA-1092
Thu May 09 02:10:28 2019
ORA-1092 : opitsk aborting process

select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null

NSA2 started with pid=41, OS id=32571518 
ORA-01555 caused by SQL statement below (SQL ID: 3nkd3g3ju5ph1, Query Duration=0 sec, SCN: 0x0005.e4bea784):
select obj#,type#,ctime,mtime,stime, status, dataobj#, flags, oid$, spare1, spare2 from 
    obj$ where owner#=:1 and name=:2 and namespace=:3 and remoteowner is null and linkname is null and subname is null
Errors in file /u01/app/oracle/diag/rdbms/xifenfei_std/xifenfei/trace/xifenfei_ora_18939904.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 7 with name "_SYSSMU7_542380376$" too small
Errors in file /u01/app/oracle/diag/rdbms/xifenfei_std/xifenfei/trace/xifenfei_ora_18939904.trc:
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 2
ORA-01555: snapshot too old: rollback segment number 7 with name "_SYSSMU7_542380376$" too small
Error 704 happened during db open, shutting down database
USER (ospid: 18939904): terminating the instance due to error 704
Instance terminated by USER, pid = 18939904
ORA-1092 signalled during: alter database open RESETLOGS...
opiodr aborting process unknown ospid (18939904) as a result of ORA-1092
发表在 Oracle备份恢复 | 标签为 , | 评论关闭

ORA-600 kfrValAcd30 恢复

有客户由于存储控制器损坏,修复控制器之后,asm无法正常mount,报ORA-600 kfrValAcd30错误,让我们提供技术支持
kfrValAcd30


asm alert日志报错

Wed Apr 03 16:50:57 2019
SQL> alter diskgroup data mount 
NOTE: cache registered group DATA number=1 incarn=0x14248741
NOTE: cache began mount (first) of group DATA number=1 incarn=0x14248741
NOTE: Assigning number (1,0) to disk (ORCL:DATAVOL1)
Wed Apr 03 16:51:03 2019
NOTE: start heartbeating (grp 1)
kfdp_query(DATA): 15 
kfdp_queryBg(): 15 
NOTE: cache opening disk 0 of grp 1: DATAVOL1 label:DATAVOL1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache mounting (first) external redundancy group 1/0x14248741 (DATA)
Wed Apr 03 16:51:04 2019
* allocate domain 1, invalid = TRUE 
Wed Apr 03 16:51:04 2019
NOTE: attached to recovery domain 1
NOTE: starting recovery of thread=1 ckpt=27.2697 group=1 (DATA)
Errors in file /u01/app/grid/diag/asm/+asm/+ASM2/trace/+ASM2_ora_15951.trc  (incident=23394):
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM2/incident/incdir_23394/+ASM2_ora_15951_i23394.trc
Abort recovery for domain 1
NOTE: crash recovery signalled OER-600
ERROR: ORA-600 signalled during mount of diskgroup DATA
ORA-00600: internal error code, arguments: [kfrValAcd30], [DATA], [1], [27], [2697], [28], [2697], [], [], [], [], []
ERROR: alter diskgroup data mount
NOTE: cache dismounting (clean) group 1/0x14248741 (DATA) 
NOTE: lgwr not being msg'd to dismount
freeing rdom 1
Wed Apr 03 16:51:05 2019
Sweep [inc][23394]: completed
Sweep [inc2][23394]: completed
Wed Apr 03 16:51:05 2019
Trace dumping is performing id=[cdmp_20190403165105]
NOTE: detached from domain 1
NOTE: cache dismounted group 1/0x14248741 (DATA) 
NOTE: cache ending mount (fail) of group DATA number=1 incarn=0x14248741
kfdp_dismount(): 16 
kfdp_dismountBg(): 16 
NOTE: De-assigning number (1,0) from disk (ORCL:DATAVOL1)
ERROR: diskgroup DATA was not mounted
NOTE: cache deleting context for group DATA 1/337938241

mos相关记录
参考:ORA-600 [KFRVALACD30] in ASM (Doc ID 2123013.1)
kfrValAcd30-mos


ORA-00600: internal error code, arguments: [kfrValAcd30]可能是bug或者硬件故障导致.基于客户的情况,最大可能就是由于硬件故障导致asm 磁盘组的acd无法正常进行,从而无法mount成功.如果运气好,通过kfed相关修复可以正常mount成功,运气不好可以通过底层进行恢复数据文件,从而最大限度恢复数据.

发表在 Oracle ASM, 非常规恢复 | 标签为 , , | 评论关闭