Data Notes

As you know, Early this year Oracle started supporting virtualization on Oracle Database Appliance which enhances the functionality of ODA. One of the main usage of this virtualization is to utilize ODA for Weblogic as well as database.

I had a chance to test various Weblogic offerings on ODA and I thought to share my findings with readers, hopefully it helps readers with planning and architecting the environment using ODA and virtualization. Basically, the two main versions which can be run on ODA V1 and ODA X3-2 and support virtualization are ODA Version 2.5 and ODA version 2.6. The following shows my findings on Weblogic install for each version:

ODA 2.5

Oracle Weblogic template on ODA is available on e-delivery for Weblogic 10.3.6 and 12c

Oracle Weblogic template includes Weblogic configurator which is a new tool (specific to ODA) to automate installation of Weblogic on ODA similar to what OAM does for database.

Weblogic configurator only installs Weblogic in cluster mode, having non-cluster weblogic is not supported with the configurator.

Weblogic configurator installs OTD (Oracle Traffic Director) and as of now, there is no way to skip it during installation. However after installation, it can be shutdown if it is not needed.

Minimum installation with Weblogic configurator includes two Weblogic cluster nodes which installs 6VMs in total (1 VM for Weblogic admin,2 VMs (one per compute node) for Weblogic managed server, 1 VM for OTD admin, 2 VMs (one per compute node) for OTD)

Weblogic configurator also creates a database for internal work on ODA_BASE.

Weblogic configurator installs all components as root due to a bug.

Since 6VMs are the possible minimum configuration and each VM should have at least 2VCPU, there will be total of 12VCPU (6Cores) for minimum installation

ODA 2.6

No Weblogic template on ODA is released for 2.6

No Weblogic configurator for automation of Weblogic is released

As of now, creation readhat Linux and installing Weblogic manually would be the only approach in this release

Bottom line:

Although it is recommended to utilize automation on ODA to ensure ODA stays with Oracle standard template and to have less impact on future changes such as patching and upgrade, however Oracle Weblogic configurator looks not to be mature enough to be utilized. First of all, to use it, you end up to stay with 2.5 (not with the latest ODA release ) plus the configurator in 2.5 is buggy (installation of weblogic as root) and it has several shortcomings which are listed above as well as not much flexible. I recommend to wait (if you can) as Weblogic configurator should be out soon or use manual process for Weblogic with the risk that in long run, you may not be able to utilize patch bundle (if Oracle includes weblogic in ODA bundle) or any further automation which may be planned for Weblogic. Also consider that installation of OTD does add complexity to weblogic licensing on ODA.

At the end, I found FAQ pdf file on otn website helpful.

http://www.oracle.com/technetwork/middleware/weblogic-oda/overview/faq-weblogicserver-on-oda-1927929.pdf

If you have any questions or you want to see more on this topic, please drop me a line.

In next couple months, I will examine different eviction scenarios in RAC 11gR2 and when I see some strange behavior, I will do my best to inform it in this blog, so RAC lover, please stay tuned and here is the scenario number 1 :

AS you know, in 11gR2, oracle uses UDP protocol for heartbeats between the nodes.
In this post, I present a node eviction scenario when UDP communication is blocked between the nodes and you see that depends on where and how UDP is blocked, a different situation could occur.
The test is done on two node RAC in 11.2.0.3PSU3 version on Linux

Scenario 1: When UDP communication is blocked on the second node

In this scenario, outgoing UDP for ocssd process on node2 is blocked.

To do so, we find out UDP port on which ocssd is listening and will disable any outgoing traffic

netstat -a --inet |grep -i udp | grep -i racnode2

udp        0      0 racnode2-priv:14081         *:*                                     
udp        0      0 racnode2-priv:52358         *:*                                     
udp        0      0 racnode2-priv:52242         *:*                                     
udp        0      0 racnode2-priv:42517         *:*  --> ocssd
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:60741         *:*  

[root@racnode2 ~]# lsof -i :42517
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3005 grid   55u  IPv4  22340       UDP racnode2-priv:42517

42517 is the port which ocssd on racnode2 sends its heartbeat.
To break heartbeat communication between node2 and node1, any outgoing traffic on racnode2 for ocssd process (port 42517) is blocked with this command

iptables -A OUTPUT -s 192.168.2.152 -p udp --sport 42517 -j DROP

alter log on node1 shows that node2 is evicted (bootless eviction) and then it is reconfigured and is joined the cluster.

[cssd(3015)]CRS-1612:Network communication with node racnode2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.840 seconds
2012-12-31 10:50:54.213
[cssd(3015)]CRS-1611:Network communication with node racnode2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.820 seconds
2012-12-31 10:50:58.232
[cssd(3015)]CRS-1610:Network communication with node racnode2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.810 seconds
2012-12-31 10:51:01.056
[cssd(3015)]CRS-1607:Node racnode2 is being evicted in cluster incarnation 249572820; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/racnode1/cssd/ocssd.log.
2012-12-31 10:51:02.584
[cssd(3015)]CRS-1625:Node racnode2, number 2, was manually shut down
2012-12-31 10:51:02.590
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 .
2012-12-31 10:51:02.630
[crsd(3393)]CRS-5504:Node down event reported for node 'racnode2'.
2012-12-31 10:51:05.827
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'Generic'.
2012-12-31 10:51:05.829
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'ora.orcl'.
2012-12-31 10:51:37.987
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.

ocssed on racnode2 has more details, I highlighted couple key lines

2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
2012-12-31 10:51:01.129: [    CSSD][3019058064]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
.
.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: total iocapables 0
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: graceful shutdown completed.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssnmSendManualShut: Notifying all nodes that this node has been manually shut down
.
.
2012-12-31 10:51:25.352: [    CSSD][3040868032]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1356979885
2012-12-31 10:51:25.353: [    CSSD][3040868032]clssscmain: Environment is production
.
.
2012-12-31 10:51:26.167: [GIPCHTHR][3024477072] gipchaWorkerCreateInterface: created local interface for node 'racnode2', haName 'CSS_racnode-cluster', inf 'udp://192.168.2.152:29788'

A key note here is that udp is reconfigured to be run on different port and after that node2 is able to join the cluster and starts up all its resources.
The following netstat also confirms that ocssd.bin listens on the new port

[root@racnode2 ~]# netstat -a --inet |grep -i udp | grep -i racnode2
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:35489         *:*                                     
udp        0      0 racnode2-priv:38321         *:*                                     
udp        0      0 racnode2-priv:60741         *:*                                     
udp        0      0 racnode2-priv:10321         *:*                                     
udp        0      0 racnode2-priv:29788         *:*   --> new port is created.....
   
[root@racnode2 working]# lsof -i :29788
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 5919 grid   52u  IPv4 764918       UDP racnode2-priv:29788

If I sum up, the following sequence of events occurs :

udp communication for heartbeat is blocked (outgoing udp on ocssd port)
Node1 evicts Node2
Node2 is able to stop all IO capable resources and as the result, no need to boot the node (11g feature).
Node2 starts CSSD and reconfigures UDP port
Node2 is able to join the cluster

This sounds perfect as node2 is able to recover by itself. It looks like transparent and straight forward recovery.
Let see how this failure is recovered if UDP hiccups occur on node1 (master node in two node RAC)

Scenario 2: When UDP communication is blocked on the first node

To follow the same step, UDP port for hearbeat is found and it is blocked as it is shown in below

bash-3.2$ netstat -a --inet |grep -i udp | grep -i racnode1
udp        0      0 racnode1-priv:36613         *:*   
udp        0      0 racnode1-priv:36892         *:*   
udp        0      0 racnode1-priv:26055         *:*   
udp        0      0 racnode1-priv:13167         *:*   
udp        0      0 racnode1-priv:17914         *:*   
udp        0      0 racnode1-priv:51067         *:*   

[root@racnode1 ~]# lsof -i :36613
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3010 grid   55u  IPv4  19676       UDP racnode1-priv:36613

To block heartbeat, all outgoing traffic on port 36613 is blocked

iptables -A OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP

Based on scenario 1, I expected to see the same sequence of events. In other words, I expected to see node2 is evicted and is reconfigured and is rejoined the cluster.
However, in this case, it is seen that node2 is evicted and then as it is shown in below cssd is hung in start up and joining the cluster.

[root@racnode1 ~]# crsctl check cluster -all
**************************************************************
racnode1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
racnode2:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
[root@racnode2 ~]# crsctl stat res -init -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               Abnormal Termination
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       racnode2                                     
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING      
ora.cssdmonitor
      1        ONLINE  ONLINE       racnode2                                     
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  ONLINE       racnode2                                     
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       racnode2                                     
ora.gpnpd
      1        ONLINE  ONLINE       racnode2                                     
ora.mdnsd
      1        ONLINE  ONLINE       racnode2

Even,unblocking the same port with dropping the rule from iptables does not help and still CSS on node2 is not able to join the cluster.

iptables -L      
      
iptables -D OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP


[root@racnode1 ~]# iptables -L      
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

After reviewing all logs (quite length log, so I avoid to copy it here!), It is seen that ocssd in node2 complains about network heartbeat and no reconfig attempt is done, also in node1, after blocking UDP, the interface was disabled and no try is done to setup communication on differently.
As I mentioned earlier, Although UDP port is unblocked, still the following error is reported on node2 and node1 repeatedly.

Node 2
==========
 [    CSSD][3013077904]clssnmvDHBValidateNcopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 249572810, wrtcnt, 181183, LATS 1471304, lastSeqNo 181182, uniqueness 1356788464, timestamp 1356790643/1482244
 
Node1 
=========
[GIPCHALO][3023862672] gipchaLowerProcessNode: no valid interfaces found to node for 25790 ms, node 0xa062a88 { host 'racnode2', haName 'CSS_racnode-cluster', srcLuid be28e076-9f3aafb1, dstLuid 61a1f895-ba260945 numInf 0, contigSeq 2639, lastAck 2626, lastValidAck 2638, sendSeq [2627 : 2683], createTime 4294328280, sentRegister 1, localMonitor 1, flags 0x2408 }

It turned out that the issue is reported as Bug 14281269 : "NODE CAN'T REJOIN THE CLUSTER AFTER A TEMPORARY INTERCONNECT FAILURE - PROBLEM:after an interconnect failure on the first node the second node restarts the clusterware (rebootless restart) as expected, but can't join the cluster again till the interconnect interface of node1
is not shutdown/startup manually "
At the time of posting this, there is no patch available and the suggested workaround is to bounce interconnect interface.

In my test, even bouncing node2 (evicted node) did not help and I ended up to kill gipc daemon on node1 (master node/surviving node) and it did help and whole cluster recovered and node2 was able to join the cluster.

[root@racnode1 working]# ps -ef |grep -i gipc
grid      2961     1  0 05:40 ?        00:00:16 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7709  4792  0 06:44 pts/1    00:00:00 grep -i gipc
[root@racnode1 working]# kill -9 2961
[root@racnode1 working]# ps -ef |grep -i gipc
grid      7717     1 15 06:44 ?        00:00:00 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7755  4792  0 06:44 pts/1    00:00:00 grep -i gipc


[/u01/app/11.2.0/grid/bin/oraagent.bin(3528)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:5} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log.
2012-12-29 06:44:56.972
[/u01/app/11.2.0/grid/bin/orarootagent.bin(3535)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:23} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-12-29 06:44:56.974
[/u01/app/11.2.0/grid/bin/oraagent.bin(3741)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:5:63} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2012-12-29 06:44:57.098
[ohasd(2414)]CRS-2765:Resource 'ora.ctssd' has failed on server 'racnode1'.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2401:The Cluster Time Synchronization Service started on host racnode1.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:01.164
[cssd(3010)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:02.363
[crsd(7759)]CRS-1012:The OCR service started on node racnode1.
2012-12-29 06:45:03.155
[evmd(7762)]CRS-1401:EVMD started on node racnode1.
2012-12-29 06:45:05.147
[crsd(7759)]CRS-1201:CRSD started on node racnode1.
2012-12-29 06:45:38.798
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-29 06:45:38.800
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.


===== alert for node2 =========

2012-12-29 06:39:38.132
[cssd(7700)]CRS-1605:CSSD voting file is online: /dev/sda1; details in /u01/app/11.2.0/grid/log/racnode2/cssd/ocssd.log.
2012-12-29 06:45:01.165
[cssd(7700)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2401:The Cluster Time Synchronization Service started on host racnode2.
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:05.257
[ohasd(2405)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2012-12-29 06:45:16.836
[ctssd(8061)]CRS-2408:The clock on host racnode2 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2012-12-29 06:45:25.475
[crsd(8199)]CRS-1012:The OCR service started on node racnode2.
2012-12-29 06:45:25.541
[evmd(8079)]CRS-1401:EVMD started on node racnode2.
2012-12-29 06:45:27.331
[crsd(8199)]CRS-1201:CRSD started on node racnode2.
2012-12-29 06:45:35.642
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-12-29 06:45:36.181
[/u01/app/11.2.0/grid/bin/oraagent.bin(8347)]CRS-5011:Check of resource "orcl" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2012-12-29 06:45:37.301
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"

To conclude, in 2 node RAC :

Network hiccups on heartbeat port on node2 is recovered automatically.

Network hiccups on heartbeat port on node1 requires manual intervention due to bug 14281269

Due to several reported bug, it is recommended to be on 11203 PSU3 at least.check out metalink note for other bugs: List of gipc defects that prevent GI from starting/joining after network hiccups [ID 1488378.1])

Bug	Workaround	Bug No
Copy( cp) Performance On ACFS	Do not use ls while cp is running OR Wait for 11204	12626187
ls -l' and 'find' Commands Slow on ACFS	No workaround - Wait for 11204	10418517

Data Notes

TIP 107# : Virtualization on Oracle Database Appliance (ODA)

TIP 106# : Quick tips on Cloning Oracle

TIP 105# : ACFS and bugs

TIP 104#: Node Eviction in RAC 11gr2 due to temporary network hiccups on heartbeat communication

Scenario 1: When UDP communication is blocked on the second node

Scenario 2: When UDP communication is blocked on the first node

TIP 103# : GoldenGate and statistics