11. kdbh 数据头信息
•kdbhflag bit位, KDBHFFK 0x01 Flushable Key
•kdbhntab Number of tables in the table index
•kdbhnrow Number of rows in the row index
•kdbhfrre first FRee Row index Entry
•kdbhfsbo Free Space Beginning Offset
•kdbhfseo Free Space Ending Offset
•kdbhavsp AVailable SPace in the block
•kdbhtosp TOtal Space that will be available
17. 存储/硬盘故障
•右图为36个月的硬盘留任率
•一些案例:
–ORA-1578 Transient Corruption – Caused by Parity Error on EMC DMX4 (Doc ID 1486903.1)
–ORA-1578 Data Block Corruptions when using EMC Storage. Blocks with 0xc9 byte (Doc ID 1323108.1)
–Database Corruption due to Lost IO on Hitachi storage. ORA-600 [kdsgrp1] ORA- 1499 ORA-1410 ORA-600 [3020] (Doc ID 1512717.1)
18. 卷管理软件或文件系统bug
•一些案例:
–ORA-354 Redo log corruption when using Xisgo Driver (Doc ID 1498389.1)
–ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1)
–ORA-1578 Misplaced blocks against datafiles stored in NFS filers (Doc ID 1525108.1)
–ORA-1578 ORA-354 ORA-600 [3020] Misplaced blocks by Symantec / Veritas after adding LUN (Doc ID 1323532.1)
–ORA-1578 Block overwritten with string “DiskDescription cyl alt hd sec” when using Symantec / Veritas (Doc ID 1313454.1)
–Block Corruption (ALL ZERO) detected after reclaiming space on Veritas Filesystem (Doc ID 1587427.1)
遇到过的VxCFS问题:
•32位指针问题导致内存泄露
•卷同步存在漏洞,将实际未完成同步的数据返回给Oracle 使用
20. 数据块物理损坏
•在读取数据块时发现,或者通过工具(dbv,rman)预先发现
•一般会报错ORA-01578,具体损坏信息在trace中
•可能是:块头错乱,块不完整,checksum验证错误
•物理损坏表现在磁盘上,不会自动消失,checksum要么读不出要么 不对
•Redo日志流中不存在损坏,可以通过blockrecover修复,前提是有备份 和归档
Header
Data
Footer
Corrupt Block:
Header
Data
Footer
Corrupt Block:
Header
Data
Corrupt Block:
Footer
Mismatch
21. 数据块物理损坏- 现象
•ORA-01578错误被报出,alert.log中出现相关损坏描述:
Corrupt block relative dba: 0x0380e573 (file 14, block 58739)
Fractured block found during buffer read
Data in bad block -
type: 6 format: 2 rdba: 0x0380e573
last change scn: 0x0288.8e5a2f78 seq: 0x1 flg: 0x04
consistency value in tail: 0x00780601 应为 0x2f780601
check value in block header: 0x8739, computed block checksum: 0x2f00
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Reread of rdba: 0x0380e573 (file 14, block 58739) found same corrupted data
22. ORA-1578
ORA-1578 "ORACLE data block corrupted (file # %s, block # %s)"
Information about object and corruption description is found in the alert log.
Normally is reported with an ORA-1110 reporting the Absolute File Number:
SQL> select * from scott.case1;
select * from scott.case1
*
ERROR at line 1:
ORA-01578: ORACLE data block corrupted (file # 11, block # 34) =>相对文件号RFN
ORA-01110: data file 6:‘/home/oracle/corrclass.dbf‘ =>绝对文件号AFN
23. 块断裂 Fractured Block
Corrupt block relative dba: 0x0380e573 (file 14, block 58739)
Fractured block found during buffer read
Data in bad block -
type: 6 format: 2 rdba: 0x0380e573
last change scn: 0x0288.8e5a2f78 seq: 0x1 flg: 0x04
consistency value in tail: 0x00780601
check value in block header: 0x8739, computed block checksum: 0x2f00
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Reread of rdba: 0x0380e573 (file 14, block 58739) found same corrupted data
24. 错误的校验码 bad checksum
Corrupt block relative dba: 0x0380a58f (file 14, block 42383)
Bad check value found during buffer read
Data in bad block -
type: 6 format: 2 rdba: 0x0380a58f
last change scn: 0x0288.7784c5ee seq: 0x1 flg: 0x06
consistency value in tail: 0xc5ee0601
check value in block header: 0x68a7, computed block checksum: 0x2f00
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Reread of rdba: 0x0380a58f (file 14, block 42383) found same corrupted data
25. 损坏的块头Bad header
Corrupt block relative dba: 0x0d805a89 (file 54, block 23177)
Bad header found during buffer read
Data in bad block -
type: 6 format: 2 rdba: 0x0d805b08
last change scn: 0x0692.86dc08e3 seq: 0x1 flg: 0x04
consistency value in tail: 0x08e30601
check value in block header: 0x2a6e, computed block checksum: 0x0
spare1: 0x0, spare2: 0x0, spare3: 0x0
***
Reread of rdba: 0x0d805a89 (file 54, block 23177) found valid data
26. 数据块损坏检测参数 DB_BLOCK_CHECKSUM
•三种不同作用的检测 保护机制,但都不负责修复 现有问题。
•DB_BLOCK_CHECKSUM负责控制块的cache层 的chkval_kcbh是否在块被写出时计算并写入到磁 盘中
BBED> sum
Check value for File 0, Block 500:
current = 0x5327, required = 0x5327
BBED> p chkval_kcbh
ub2 chkval_kcbh @16 0x5327
32. db_block_checksum=TRUE
Block#=X
Chkval=A
Server process reads block X.
Checksum is calculated and compared with the one in the block header. Block is modified and checksum=0
DBWn later calculates a new checksum , stores it in block header and writes the block
Buffer cache
Datafiles
Block#=XChkval=0
Block#=X
Chkval=B
33. db_block_checksum=FULL
Block#=X
Chkval=A
Server process reads block X.
Checksum is calculated and compared with the one in the block header. Block is modified and new checksum is calculated =B
DBWn later checks the checksum and writes the block
Buffer cache
Datafiles
Block#=XChkval=B
Block#=X
Chkval=B
37. 数据块逻辑损坏
•例子:
Bug 10113224 Index coalesce may generate invalid redo if blocks in the buffer cache are invalid/corrupted
若未启用db_block_checking,buffer cache中的block出现讹误,则索引index coalesce操作可能生成错乱的redo数据
对于oracle检测到的逻辑损坏,绝大多数情况可以从ORA-600 [kddummy_blkchk] (Doc ID 300581.1)
中找到
逻辑讹误也可能由于非Oracle的进程或硬件意外写入到buffer cache内存中引发
40. Buffer cache中的块逻辑损坏
•Bug 12314760 - bad check value found during preparing block for write for 32k block
•Bug 11799496 ORA-600 [kcbzpbuf_1] block corruption in buffer cache for 32k block size / ORA-7445 [kdb4cpss] by cache protect.
•db_block_checksum=full 避免损坏污染到磁盘上
•Oracle检测代码,参见文档ORA-600 [kcbzpbuf_1] (Doc ID 337018.1)
62. DB_LOST_WRITE_PROTECT 发现Lost Write的Standby端的alert.log
Wed Oct 23 19:18:08 2013
Hex dump of (file 7, block 131) in trace file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr02_1401.trc
Reading datafile '+DATA/orcls/datafile/lw.281.829593935' for corruption at rdba: 0x01c00083 (file 7, block 131)
Read datafile mirror 'DATA_0004' (file 7, block 131) found same corrupt data (logically corrupt)
Read datafile mirror 'DATA_0006' (file 7, block 131) found same corrupt data (logically corrupt)
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 131, FILE 7
NO REDO AT OR AFTER SCN 3987667 CAN BE USED FOR RECOVERY.
Recovery of Online Redo Log: Thread 1 Group 7 Seq 103 Reading mem 0
Slave exiting with ORA-752 exception
Errors in file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr02_1401.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 7, block# 131, file offset is 1073152 bytes)
ORA-10564: tablespace LW
ORA-01110: data file 7: '+DATA/orcls/datafile/lw.281.829593935'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 87637
Wed Oct 23 19:18:12 2013
Recovery Slave PR02 previously exited with exception 752
Wed Oct 23 19:18:12 2013
MRP0: Background Media Recovery terminated with error 448
Errors in file /u01/app/oracle/diag/rdbms/orcls/orcls1/trace/orcls1_pr00_1395.trc:
ORA-00448: normal completion of background process
MRP0: Background Media Recovery process shutdown (orcls1)
63. DB_LOST_WRITE_PROTECT PRXX进程TRACE
Hex dump of (file 7, block 131)
Dump of memory from 0x00000000F03B0000 to 0x00000000F03B2000
0F03B0000 0000A206 01C00083 003CC55A 04010000 [........Z.<.....]
...
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 131, FILE 7
The block read on the primary had SCN 3977658 (0x0000.003cb1ba) seq 1 (0x01)
while expected to have SCN 3982682 (0x0000.003cc55a) seq 1 (0x01)
The block was read at SCN 3987667 (0x0000.003cd8d3), BRR:
CHANGE #1 TYP:2 CLS:6 AFN:7 DBA:0x01c00083 OBJ:87637 SCN:0x0000.003cb1ba SEQ:1 OP:23.2 ENC:0 RBL:1
...
REDO RECORD - Thread:1 RBA: 0x000067.00000128.0010 LEN: 0x0034 VLD: 0x10
SCN: 0x0000.003cd8d3 SUBSCN: 1 10/23/2013 19:18:16
(LWN RBA: 0x000067.00000126.0010 LEN: 0003 NST: 0001 SCN: 0x0000.003cd8cf)
CHANGE #1 TYP:2 CLS:6 AFN:7 DBA:0x01c00083 OBJ:87637 SCN:0x0000.003cb1ba SEQ:1 OP:23.2 ENC:0 RBL:1
Block Read - afn: 7 rdba: 0x01c00083 BFT:(1024,29360259) non-BFT:(7,131)
scn: 0x0000.003cb1ba seq: 0x01
flags: 0x00000006 ( dlog ckval )
发现Lost Write的数据块
① Primary侧 Block的SCN
② Standby侧Block的SCN
Block Dump
Redo Dump
Main Message
③ Primary侧读取该block时的SCN
3977658(①) < 3982682(②) 可判定Primary发生Lost Write
3982682(②)的更新被丢失了
3987667(③)读取数据块时被检测到
69. 表和索引不一致的原因
•常见bug列表:
–OERR: ORA-1499 table/Index Cross Reference Failure - see trace file (Doc ID 1499.1)
–OERR: ORA-8102 "index key not found, obj# %s, file %s, block %s (%s)" (Doc ID 8102.1)
•由于写丢失lost write
70. 数据字典不一致
•数据字典记录错乱,字典表之间逻辑不一致
•属于SYS用户的系统核心表bootstrap对象,存放在system表空间上, 是Oracle数据库能被实例打开并正常运作的基础
•不一致的例子:
–'Problem: OBJ$.OWNER# not in USER$" which refers to a user in table OBJ$ that does not exist in USER$.
–ORPHAN TABPART$: OBJ=3168347 TS=33 RFILE/BLOCK=21 3316440 BO#=3168256 SegType= ORA-00600: internal error code, arguments: [ktsircinfo_num1], [33], [21], [3315720], [], [], [], []
–ORPHAN TAB$: OBJ=106713 DOBJ=106713 TS=45 RFILE/BLOCK=50 304491 BOBJ#= SegType= ORA-00600: internal error code, arguments: [ktssdrp1], [45], [50], [304491], [], [], [], []
• Identify the corruption extension using RMAN/DBV/ANALYZE etc (Doc ID 836658.1)
71. Redo在线日志损坏
损坏发生在redo日志文件的块上,逻辑或者物理损
Redo log中的无效块:
–ORA-353 log corruption near block %s change %s time %s
–ORA-368 checksum error in redo log block
–ORA-355 change numbers out of order
72. 控制文件损坏 control file
•发生在控制文件中的块损坏
•ORA-00207 "control files are not for the same database“
•ORA-00600: internal error code, arguments: [kccpb_sanity_check_2]
74. 针对损坏的恢复
•使用RMAN的BMR block media recovery
–RMAN : Block-Level Media Recovery - Concept & Example (Doc ID 144911.1)
–HOW TO PERFORM BLOCK MEDIA RECOVERY (BMR) WHEN BACKUPS ARE NOT TAKEN BY RMAN. (Doc ID 342972.1)
•DataGuard,从standby恢复或Switchover,Active Dataguard的Automatic Block Media recovery
•重建索引rebuild index
•对于已经传播到磁盘上的逻辑损坏,跳过损坏块或者从备 份中重建表
•DBMS_REPAIR包跳过损坏块
•采用10.2.0.4以后的43810事件或11g _index_scan_check_skip_corrupt参数来跳过IOT上的坏 块
81. TAB$中有记录,而SEG$中无对应记录的场景
SQL> connect test/test
Connected.
SQL> drop table EMP;
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [ktssdrp1], [11], [10], [9], [], [],[], []
SQL> select * FROM EMP;
select * FROM EMP
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [ktsircinfo_num1], [11], [10], [9]
ERROR:
ORA-600 [ktsircinfo_num1] [a] [b] [c] [d] [e]
DESCRIPTION:
This exception occurs when there are problems obtaining the row cache information correctly from sys.seg$
ARGUMENTS:
Arg [a] tablespace number
Arg [b] file number
Arg [c] block number
82. IND$中有记录,而SEG$中无对应记录的场景
SQL> select * from dept where dname = 'ACCOUNTS';
select * from dept where dname = 'ACCOUNTS'
*
ERROR at line 1:
ORA-00600: internal error code, arguments: [ktsircinfo_num1], [14], [2], [49],[], [], [], []
Problem: Orphaned IND$ (no SEG$) - See Note:65987.1 (Bug:624613)
ORPHAN IND$: OBJ=39442 DOBJ=39442 TS=14 RFILE/BLOCK=2 49 BO#=39438 SegType=
SQL> select * from sys.seg$ where ts#=14 and file#=2 and block#=49;
no rows selected
Ind$ with an entry:
SQL> select ts#, file#, block# from sys.ind$ where obj#=39442;
TS# FILE# BLOCK#
---------- ---------- ----------
14 2 49
83. SEG$中有记录,而OBJ$中无对应记录的场景
SQL> drop tablespace test_dict_corr including contents and datafiles;
drop tablespace test_dict_corr including contents and datafiles
*
ERROR at line 1:
ORA-01561: failed to remove all objects in the tablespace specified
SQL> select * from dba_segments where tablespace_name = 'TEST_DICT_CORR';
no rows selected
SQL> select ts#, name from v$tablespace where name = 'TEST_DICT_CORR';
TS# NAME
---------- ------------------------------
20 TEST_DICT_CORR
SQL> Select ts#, file#, block#, type# from seg$ where ts#=20;
TS# FILE# BLOCK# TYPE#
---------- ---------- ---------- ----------
20 7 17 6 ----? type#=6 (INDEX).
Hcheck8i output:
Problem: Orphaned SEG$ Entry ORPHAN SEG$: SegType=INDEX TS=20 RFILE/BLOCK=7 17
No orphaned IND$ entries
85. 若bootstrap$中存放的自举对象受损
•Bootstrap$中记录的数据字典索引不能被rebuild或drop
•Bootstrap$中记录的数据表不能被truncate或drop
•常规的DDL alter操作不能针对bootstrap$相关对象执行
SQL> truncate table obj$;
truncate table obj$
*
ERROR at line 1:
ORA-00701: object necessary for warmstarting database cannot be altered
SQL> alter index i_obj1 rebuild;
alter index i_obj1 rebuild
*
ERROR at line 1:
ORA-00604: error occurred at recursive SQL level 1
ORA-00060: deadlock detected while waiting for resource
SQL> drop index i_obj2;
drop index i_obj2
*
ERROR at line 1:
ORA-00701: object necessary for warmstarting database cannot be altered
86. 针对损坏收集诊断信息
•使用11g IPS incident package
•数据文件块损坏
–Identify the corruption extension using RMAN/DBV/ANALYZE etc (Doc ID 836658.1)
–How to identify the corrupt Object reported by ORA-1578 / RMAN / DBVERIFY (Doc ID 819533.1)
•RMAN检测数据文件上的坏块
•How to identify all the Corrupted Objects in the Database reported with RMAN (Doc ID 472231.1)
•Rman validate check logical database
•DBV工具
•问题数据库的redo dump,可以通过下列命令手动dump redo
•alter system dump logfile '&name' dba min <file#> <block#> dba max <file#> <block#>;
87. 针对损坏收集诊断信息
•通过dd命令复制oracle数据块:
–dd if=<datafile name> bs=<datafile blocksize> skip=<block#> count=1 of=<block#.dd>
•通过oracle命令dump数据块:
SQL> alter system dump datafile ‘&file_name’ block &blocknumber;
SQL> alter system dump datafile &file_number block &blocknumber;
•获得表/索引的DDL
–Dbms_metadata.get_ddl
–exp user/passwd file=<TableName> statistics=none rows=n grants=n tables=<TableName>
89. 常见错误
•ORA-1578 Note:28814.1
•ORA-1578 ORA-26040. Lob Note 293515.1
•ORA-8103 Note:268302.1
•ORA-1499 Mismatch between table/index (analyze validate structure cascade):Table/Index row count mismatch row not found in index
•ORA-1498 (analyze validate structure)
•ORA-600 [kcbzpb_1] / ORA-600 [kcbzpbuf_1]
•ORA-600 [12700] Mismatch between Table and Index
•ORA-600 [kdsgrp1] New in 10g similar to ORA-600 [12700]
•“Corrupt block relative dba: …” 信息