Control file mount id mismatch!故障处理

通过沟通确认客户由于存储双活异常,业务运行在主存储上,另外一套存储修复之后,进行存储双活同步,结果在这个过程中由于遭遇Control file mount id mismatch! 导致数据库crash了

2023-05-03T20:21:07.446873+08:00
Archived Log entry 491897 added for T-1.S-246903 ID 0x97d92f0b LAD:1
2023-05-03T20:47:53.902701+08:00
Error: 2141
Control file mount id mismatch!
fhmid: 2592441863, SGA mid: 2624617448
Requesting DIAG on each RAC instance to dump the control file header block
2023-05-03T20:47:55.906490+08:00
Errors in file /opt/rac/oracle/diag/rdbms/xff/xff1/trace/xff1_rms0_20989.trc:
2023-05-03T20:47:56.521500+08:00
RMS0 (ospid: 20989): terminating the instance
2023-05-03T20:47:56.610656+08:00
System state dump requested by (instance=1, osid=20989 (RMS0)), summary=[abnormal instance termination].
System State dumped to trace file /opt/rac/oracle/diag/rdbms/xff/xff1/trace/xff1_diag_20912_20230503204756.trc
2023-05-03T20:47:58.480397+08:00
License high water mark = 395
2023-05-03T20:48:02.600203+08:00
Instance terminated by RMS0, pid = 20989
2023-05-03T20:48:02.601563+08:00
Warning: 2 processes are still attach to shmid 393226:
 (size: 28672 bytes, creator pid: 19941, last attach/detach pid: 20912)
2023-05-03T20:48:03.481726+08:00
USER (ospid: 967): terminating the instance
2023-05-03T20:48:03.483351+08:00
Instance terminated by USER, pid = 967

节点自动重启报错ORA-600 kccsbck_first

2023-05-03T20:48:34.870435+08:00
NOTE: ASMB mounting group 2 (FRA)
NOTE: ASM background process initiating disk discovery for grp 2 (reqid:0)
NOTE: Assigning number (2,1) to disk (/dev/asm_data0g)
NOTE: Assigning number (2,0) to disk (/dev/asm_data0f)
SUCCESS: mounted group 2 (FRA)
NOTE: grp 2 disk 1: FRA_0001 path:/dev/asm_data0g
NOTE: grp 2 disk 0: FRA_0000 path:/dev/asm_data0f
2023-05-03T20:48:34.919965+08:00
NOTE: dependency between database xff and diskgroup resource ora.FRA.dg is established
2023-05-03T20:48:38.983416+08:00
Errors in file /opt/rac/oracle/diag/rdbms/xff/xff1/trace/xff1_ora_2436.trc  (incident=1333249):
ORA-00600: ??????, ??: [kccsbck_first], [1], [2624617448], [], [], [], [], [], [], [], [], []
Incident details in: /opt/rac/oracle/diag/rdbms/xff/xff1/incident/incdir_1333249/xff1_ora_2436_i1333249.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ORA-600 signalled during: ALTER DATABASE MOUNT /* db agent *//* {0:8:116} */...

再次重启数据库报错ORA-00742 ORA-00312

2023-05-04T08:18:59.635790+08:00
Aborting crash recovery due to error 742
2023-05-04T08:18:59.635897+08:00
Errors in file /opt/rac/oracle/diag/rdbms/xff/xff1/trace/xff1_ora_80855.trc:
ORA-00742: ??????? 2 ?? 244996 ? 8262 ??????????
ORA-00312: ???? 7 ?? 2: '+FRA/xff/ONLINELOG/group_7.446.1059323695'
ORA-00312: ???? 7 ?? 2: '+DATA/xff/ONLINELOG/group_7.272.1059323695'
Abort recovery for domain 0, flags 4
2023-05-04T08:18:59.647994+08:00
Errors in file /opt/rac/oracle/diag/rdbms/xff/xff1/trace/xff1_ora_80855.trc:
ORA-00742: ??????? 2 ?? 244996 ? 8262 ??????????
ORA-00312: ???? 7 ?? 2: '+FRA/xff/ONLINELOG/group_7.446.1059323695'
ORA-00312: ???? 7 ?? 2: '+DATA/xff/ONLINELOG/group_7.272.1059323695'
ORA-742 signalled during: ALTER DATABASE OPEN /* db agent *//* {2:37368:2} */...
2023-05-04T08:19:00.820708+08:00
License high water mark = 33
2023-05-04T08:19:00.820936+08:00
USER (ospid: 82788): terminating the instance
2023-05-04T08:19:01.827132+08:00
Instance terminated by USER, pid = 82788

明显数据库在启动的时候做实例恢复,发现redo写丢失,从而引起数据库无法正常open,对于此类故障,处理比较多
ORA-00742 ORA-00312 故障恢复-1
ORA-00742 ORA-00312故障恢复-2
ORA-00742: 日志读取在线程 %d 序列 %d 块 %d 中检测到写入丢失情况

发表在 Oracle备份恢复 | 标签为 , , , | 评论关闭

Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.

中联的his系统在alert日志中经常会看到如下的日志告警

[oracle@oracle1 trace]$ tail -f alert_orcl.log 
Tue May 02 22:06:46 2023
Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.
Tue May 02 22:06:50 2023
Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.
Tue May 02 22:06:50 2023
Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.
Tue May 02 22:06:50 2023
Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.
Tue May 02 22:06:50 2023
Maximum of 148 enabled roles exceeded for user ZLHIS. Not loading all the roles.

查询ZLHIS用户当前的role情况

SQL>  select Grantee, count(*) "Role Number" from
  2   (
  3   select distinct connect_by_root grantee Grantee, granted_role
  4   from dba_role_privs 
 connect by prior granted_role=grantee
  5    6   ) where GRANTEE='ZLHIS'
 group by Grantee  7  
  8  /

GRANTEE                        Role Number
------------------------------ -----------
ZLHIS                                  149

虽然max_enabled_roles参数为150

SQL> show parameter MAX_ENABLED_ROLES;

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
max_enabled_roles                    integer     150

但是用户支持的默认最大enable role为148个,对于该问题可以把一些角色的权限进行合并,然后再授权给ZLHIS,或者删除掉一些不需要的角色授权.如果一定需要这些角色,而且使其在用户登录的时候enable,可以以下两种常见方法解决

alter user <username> default roles <list of roles>;   
--可以是all或者角色列表

或者在会话中启用role

set roles all;
or
execute dbms_session.set_role('ALL');

参考:What to Check When Dealing With Ora-28031: Maximum Of 148 Enabled Roles Exceeded? (Doc ID 778785.1)

发表在 Oracle | 评论关闭

echo 0 > /proc/sys/kernel/hung_task_timeout_secs disables this message

客户反馈数据库无法登录,系统ssh也无法登录,但是可以ping通,通过sqlplus sys/pwd@tns as sysdba方式登录成功,直接对数据库进行shutdown abort操作,然后系统可以正常ssh登录.通过分析发现一些io问题

系统messages日志报错
20230430084031


默认情况下, Linux会最多使用40%[根据系统配置决定]的可用内存作为文件系统缓存。当超过这个阈值后,文件系统会把将缓存中的内存全部写入磁盘, 导致后续的IO请求都是同步的。将缓存写入磁盘时,有一个默认120秒的超时时间。 出现上面的问题的原因是IO子系统的处理速度不够快,不能在120秒将缓存中的数据全部写入磁盘。IO系统响应缓慢,导致越来越多的请求堆积,最终系统内存全部被占用,导致系统失去响应。

检查系统io情况
20230430084643

磁盘在io请求很小的情况下busy 100%,属于不正常情况,让客户安排人检查硬盘情况
20230430084802

发现raid 5中有一块磁盘异常从而引起性能下降,客户安排人员换盘之后,系统恢复正常.

调整系统参数缓解
对于linux系统文件系统缓存可以进行调整参数vm.dirty_background_ratio和vm.dirty_ratio为适当值,比如

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
发表在 Linux | 标签为 | 评论关闭