So what happened?
The SOA JMS Store (Handled as an XA Resource by WLS) went unavailable (declared Unhealthy) for 30 minutes. During this time the logs were full of below errors.
The SOA JMS Store (Handled as an XA Resource by WLS) went unavailable (declared Unhealthy) for 30 minutes. During this time the logs were full of below errors.
start()
failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL :
Resource manager is unavailable
After
30 minutes the persistent store was available.After the store was available, JVM was filled with
pending/back log messages which resulted in Full GC condition. The Full GCs
are Stop the World and rendered the JVM inaccessible for application use. To make the server accessible and available for application work to proceed had to restart the SOA server.
Above error/issue applies to Oracle SOA Suite 11g/12C.
What were the error in logs?
The
JTA health state has changed from HEALTH_OK to HEALTH_WARN with reason codes:
Resource WLStore_XXX_base_domain_SOAJMSFileStore declared unhealthy
start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL : Resource manager is unavailable
Exception occured
when binding was invoked.
Exception occured
during invocation of JCA binding: "JCA Binding execute of Reference
operation 'Produce_Message' failed due to: ERRJMS_PROVIDER_ERR.
ERRJMS_PROVIDER_ERR.
Unable to produce
message due to JMS provider internal error.
Please examine the
log file to determine the problem.
".
The invoked JCA
adapter raised a resource exception.
Please examine the
above error message carefully to determine a resolution.
" . Root cause
:
javax.transaction.SystemException:
start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore':
XAER_RMFAIL : Resource manager is unavailable
javax.transaction.xa.XAException:
Internal error: XAResource 'WLStore_XXX_base_domain_SOAJMSFileStore' is
unavailable
So why was the persistent store declared unhealthy?
By default, if an XA resource that is participating in a global
transaction fails to respond to an XA call from the WebLogic Server transaction
manager, WebLogic Server flags the resource as unhealthy and unavailable, and
blocks any further calls to the resource in an effort to preserve resource
threads. The failure can be caused by either an unhealthy transaction or an
unhealthy resource—there is no distinction between the two causes. In both
cases, the resource is marked as unhealthy (Doc ID 1484996.1)
Here
JMS store/XA resource has not responded to a request from the WebLogic
Transaction Manager for 120 seconds "MaxXACallMillis." When this
happened, the WLS Transaction Manager marked that XA resource as unhealthy and
then stopped all further communication to that XA resource until the time
"MaxResourceUnavailableMillis" passed, which is set to 30 mins (in a default install)
Q. Why did the persistent store go inaccessible for 30
minutes?
A.
MaxResourceUnavailableMillis defines the maximum duration (in milliseconds)
that an XA resource is marked as unhealthy. This is by default set to 30
Minutes After this duration, the XA resource is declared available again.
Q. Why did the
JMS Store not respond to transaction manager on time?
A. There could be various
reasons. It could be because –
1. As per Oracle Note# 1358303.1 which has the same
error code we faced - file store itself had an issue. It had grown very
big, so it was showing as unhealthy and compromising the JTA health as it
is a participating resource in the complete transaction.
2. There could be a minor NW Issue that would have
caused accessibility issue between server and JMS Store which resides on
disks. I could not see anything in logs regarding NW connectivity errors
so far.
3. The JMS store could be busy
processing other transactions and would need more time to respond than
configured as per MaxXACallMillis. Talk to developers and understand the code design and see how busy the JMS queues/topics are?
Q. What are the tuning recommendations to prevent this
error/issue in future?
1. Set WLS
domain parameter MaxResourceUnavailableMillis to lesser minutes from existing 30 minutes, I would start with 10.
(This recommendation is as per Metalink Note # 1320141.1). This will ensure the
WLS resources are tried for availability after 10 minutes instead of current 30,
hence causing minimal system downtime. This will also cause less messages to
queue up for processing once the store comes back up in case of similar
failures in future. Fewer back logs will prevent server to go into long
duration GCs which happened in above case.
2.
In case you see this issue reappearing and anticipate a busy store, increase MaxXACallMillis to 3 Minutes and
see if the issue reappears. By making this change we will allow more time to
store to respond before being declared unhealthy. Keep tuning this parameter until you see optimal performance in your environment based on the application design and usage. Again no one size fits all, so try coming up with number that will work for your environment/application.
3.
Compacting the file store would help to compact and fragment the space occupied
by the file store. The compact command does not delete current data, and only
works when the WebLogic Server that hosts the store is off-line. Make sure you back up the old store file before you run the compact command. Refer here to see how you can run the compaction commands.
4.
In most situations, file stores do not grow too large. After a message is
consumed, it is deleted from the file store and the space it consumed is made
available for other messages. However, if too many messages are stored in the
file store so that the file store keeps getting too large repeatedly, then we
must set lower quotas so that producers are blocked from sending more messages
into the destination until the consumers have consumed and deleted the message.
Note that it is recommended that JMS configurations should configure quotas on
each JMS server. The quota can be set based on application requirements. I will try and discuss this at length in another post.
Please let me know in the comment section if above tuning helped you. I will be glad to listen to your stories and experiences. Happy learning !
Soumya Mishra
ReplyDeleteIt's very nice post , Thanks For sharing
Oracle SOA Online Training Bangaalore
Why would yo post an irrelevant ad which no one would give a damn !!
ReplyDeleteThanks, exactly same bhavior happened in Prod environment which I had to restart the server...
ReplyDeleteThanks and that i have a super proposal: What To Expect When Renovating A House residential renovation contractors near me
ReplyDelete