Thursday, June 29, 2017

JMS Store declared unhealthy and unavailable: start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL : Resource manager is unavailable

So what happened? 

The SOA JMS Store (Handled as an XA Resource by WLS) went unavailable (declared Unhealthy)  for 30 minutes. During this time the logs were full of below errors.

start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL : Resource manager is unavailable

After 30 minutes the persistent store was available.After the store was available, JVM was filled with pending/back log messages which resulted in Full GC condition. The Full GCs are Stop the World and rendered the JVM inaccessible for application use. To make the server accessible and available for application work to proceed had to restart the SOA server.

Above error/issue applies to Oracle SOA Suite 11g/12C.

What were the error in logs?

The JTA health state has changed from HEALTH_OK to HEALTH_WARN with reason codes: Resource WLStore_XXX_base_domain_SOAJMSFileStore declared unhealthy

start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL : Resource manager is unavailable

Exception occured when binding was invoked.
Exception occured during invocation of JCA binding: "JCA Binding execute of Reference operation 'Produce_Message' failed due to: ERRJMS_PROVIDER_ERR.
ERRJMS_PROVIDER_ERR.
Unable to produce message due to JMS provider internal error.
Please examine the log file to determine the problem.
".
The invoked JCA adapter raised a resource exception.
Please examine the above error message carefully to determine a resolution.
" . Root cause :
javax.transaction.SystemException: start() failed on resource 'WLStore_XXX_base_domain_SOAJMSFileStore': XAER_RMFAIL : Resource manager is unavailable
javax.transaction.xa.XAException: Internal error: XAResource 'WLStore_XXX_base_domain_SOAJMSFileStore' is unavailable


So why was the persistent store declared unhealthy?

By default, if an XA resource that is participating in a global transaction fails to respond to an XA call from the WebLogic Server transaction manager, WebLogic Server flags the resource as unhealthy and unavailable, and blocks any further calls to the resource in an effort to preserve resource threads. The failure can be caused by either an unhealthy transaction or an unhealthy resource—there is no distinction between the two causes. In both cases, the resource is marked as unhealthy (Doc ID 1484996.1)

Here JMS store/XA resource has not responded to a request from the WebLogic Transaction Manager for 120 seconds "MaxXACallMillis." When this happened, the WLS Transaction Manager marked that XA resource as unhealthy and then stopped all further communication to that XA resource until the time "MaxResourceUnavailableMillis"  passed, which is set to 30 mins (in a default install)

Q. Why did the persistent store go inaccessible for 30 minutes? 

A. MaxResourceUnavailableMillis  defines the maximum duration (in milliseconds) that an XA resource is marked as unhealthy. This is by default set to 30 Minutes After this duration, the XA resource is declared available again.

Q. Why did the  JMS Store not respond to transaction manager on time?

A. There could be various reasons. It could be because –

1. As per Oracle Note# 1358303.1 which has the same error code we faced - file store itself had an issue. It had grown very big, so it was showing as unhealthy and compromising the JTA health as it is a participating resource in the complete transaction. 

2. There could be a minor NW Issue that would have caused accessibility issue between server and JMS Store which resides on disks. I could not see anything in logs regarding NW connectivity errors so far.

3. The JMS store could be busy processing other transactions and would need more time to respond than configured as per MaxXACallMillis. Talk to developers and understand the code design and see how busy the JMS queues/topics are?

Q. What are the tuning recommendations to prevent this error/issue in future?

1. Set WLS domain parameter MaxResourceUnavailableMillis to lesser minutes from existing 30 minutes, I would start with 10. (This recommendation is as per Metalink Note # 1320141.1). This will ensure the WLS resources are tried for availability after 10 minutes instead of current 30, hence causing minimal system downtime. This will also cause less messages to queue up for processing once the store comes back up in case of similar failures in future. Fewer back logs will prevent server to go into long duration GCs which happened in above case.

2. In case you see this issue reappearing and anticipate a busy store, increase MaxXACallMillis to 3 Minutes and see if the issue reappears. By making this change we will allow more time to store to respond before being declared unhealthy. Keep tuning this parameter until you see optimal performance in your environment based on the application design and usage. Again no one size fits all, so try coming up with number that will work for your environment/application.

3. Compacting the file store would help to compact and fragment the space occupied by the file store. The compact command does not delete current data, and only works when the WebLogic Server that hosts the store is off-line. Make sure you back up the old store file before you run the compact command. Refer here to see how you can run the compaction commands.

4. In most situations, file stores do not grow too large. After a message is consumed, it is deleted from the file store and the space it consumed is made available for other messages. However, if too many messages are stored in the file store so that the file store keeps getting too large repeatedly, then we must set lower quotas so that producers are blocked from sending more messages into the destination until the consumers have consumed and deleted the message. Note that it is recommended that JMS configurations should configure quotas on each JMS server.  The quota can be set based on application requirements. I will try and discuss this at length in another post. 


Please let me know in the comment section if above tuning helped you. I will be glad to listen to your stories and experiences. Happy learning ! 

Soumya Mishra