Slides from MOW2010 presentation.
The presentation provides practical understanding of Oracle Clusterware/CRS and knowledge required for independent troubleshooting of Clusterware issues - why nodes are evicted, why resources don't start or fail for no reason. After the presentation, a DBA will know where to look for the answers instead of blindly running cluvfy.sh utility. The session includes demos of how to troubleshoot clusterware issues such as evictions. The presentation does goes into Oracle Clusterware internals but it's appropriate for all DBA's from beginners to experienced.
- Successful growing business for more than 10 years
- Served many customers with complex requirements/infrastructure just like yours.
- Operate globally for 24 x 7 “always awake” services
Clusterware is generic with customizations for Oracle resources.
Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
Clusterware is generic with customizations for Oracle resources.
Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
Clusterware is generic with customizations for Oracle resources.
Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
Clusterware is generic with customizations for Oracle resources.
Only Clusterware accesses OCR and VD.
Only DB instances access shared database files.
OCR is accessed by almost every Clusterware component - configuration read from OCR.
VIP is part of OC.
Emphasize shared access to data!!!
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
OPROCD - pre 10.2.0.4 - hangcheck-timer
Node membership and group membership for instances, ASM diskgrops
Node membership and group membership for instances, ASM diskgrops
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
CSSD cannot talk to each other -> operations are not synchronized -> shared data access -> corruption
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
In addition to NHB, Oracle introduced DHB.
IO Fencing needed on split brain to avoid evicted node doing any further IO’s.
Oracle doesn’t rely on any hardware - need compatibility with all palatform/hardware.
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
Oracle can’t shoot another node without remote control and can’t rely on one type of IO fencing (HBA/SCSI reservations).
What’s left - beg another another - please shoot yourself!
What if CSSD is not healthy? It’s very possible that it’s not network problem but CSSD just doesn’t reply for some reason. OCLSOMON comes to the scene.
What if CSSD is not healthy? It’s very possible that it’s not network problem but CSSD just doesn’t reply for some reason. OCLSOMON comes to the scene.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Worse yes, the whole node is sick and even OCLSOMON can’t function properly. Like CPU execution is stall.
Losing access to voting disks - CSSD commit suicide.
Why? Cluster must have two communication paths + VD is the media for IO fencing.
Losing access to voting disks - CSSD commit suicide.
Why? Cluster must have two communication paths + VD is the media for IO fencing.
Losing access to voting disks - CSSD commit suicide.
Why? Cluster must have two communication paths + VD is the media for IO fencing.
All nodes can reboot if voting disk is lost.
Good time to discuss voting disk redundancy? 1 vs 2 vs 3
All nodes can reboot if voting disk is lost.
Good time to discuss voting disk redundancy? 1 vs 2 vs 3
All nodes can reboot if voting disk is lost.
Good time to discuss voting disk redundancy? 1 vs 2 vs 3
All nodes can reboot if voting disk is lost.
Good time to discuss voting disk redundancy? 1 vs 2 vs 3
All nodes can reboot if voting disk is lost.
Good time to discuss voting disk redundancy? 1 vs 2 vs 3
diagwait -> not set by default (assumed 0)
reboottime -> 3 seconds
margin = reboottime - diagwait
See init.cssd for more details
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
When Clusterware autostart is disabled (crsstart -> disable) then “init.cssd autostart” doesn’t do anything. In this case a DBA can initiate the start later using “init.crs start” (10.1+) or crsctl start crs (10.2+).
Configuration data - voting disks, ports, resource profiles (ASM, instances, listeners, VIPs and etc).
DEMO - existing dependencies
DB is in CRS Home
Log files would be in appropriate Oracle home:
{home}/log/{host}/racg/{resource_name}.log
DEMO - log files and action script home match!
DEMO - IMON logs
DEMO - stop DB + rename spfile + start DB
old way if have time with .cap file