IFS Application Server fails after RAC node reboot?

Question

We’ve recently seen a couple of occasions where a reboot of one of our Oracle RAC nodes causes the IFS Application Server to fail after some period of time. Does anyone know why this is or how to prevent it? We have both Apps9 and Apps10 systems (2 different production environments) and both versions of Application Server fail in the same way but at different times, with Apps10 typically failing first.

It seems like the Apps Servers run OK with minimal session failure at the time of RAC node reboot (which is good and is expected) but then some time later the Application Servers become unresponsive to users and nobody can connect. The fix is to restart the Applications Server instance and then it works fine again indefinitely.

There is very minimal messaging in any of the Apps Server logs resulting from this so it doesn’t really show anything.

In the past we’ve seen it take a couple of days to reach the point of Apps Server crash, but the last time it occurred it took about 8 hours to fail the Apps 10 environment and somewhere between 10 and 14 hours for Apps 9 to fail.

Obviously we can avoid this by forcibly restarting the Application Server instances soon after an Oracle RAC node reboots unexpectedly, but that sort of defeats the purpose of having RAC for continuous operation.

Has anyone else seen this or have any thoughts on what we might be able to do to solve to maintain uptime?

It feels like there is some kind of pool in the Application Server that allows users to keep working after the node restarts, but which eventually exhausts itself, or something in the system breaks and then eventually times out even though users are not impacted. As noted, the only log entries found are in ManagedServer1 for a timeout that happen around the time of the failure, indicating that some internal processes have been struggling to run since the RAC node went down… here’s an example (36,012 seconds is almost exactly 10 hours and it would have been that long after the RAC node failed, but it isn’t clear if this is exactly when the Apps Server stopped responding as we did not have users in the system at this time)

####<Mar 24, 2022 12:04:12,574 AM EDT> <Error> <JTA> <EBAD-SIMS-IFSBC> <ManagedServer1> <[active] ExecuteThread: '27' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <5de80424-f95c-4999-86f1-7d9c0da9f0a0-0059c920> <1648094652574> <[severity-value: 8] [rid: 0] [partition-id: 0] [partition-name: DOMAIN] > <BEA-110423> <Abandoning transaction after 36,012 seconds:

We see a few of these errors in the same log for different thread numbers.

Thanks in advance for any thoughts or input.

Nick

<[active] ExecuteThread: '27' for queue: 'weblogic.kernel.Default (self-tuning)'> <> <> <5de80424-f95c-4999-86f1-7d9c0da9f0a0-0059c920> <1648094652574> <[severity-value: 8] [rid: 0] [partition-id: 0] [partition-name: DOMAIN] >

MikeArbon · Answer

Hi NickWe have experienced the same in Apps9.We now monitor the ManagedServer.log, and if it gets larger than around 200Kb we get an alert. If we see the Abandoning transaction errors (hopefully in time), we restart the effected ManagedServer.This seems to happen to us when there has been some kind of disruption between database and MW. Like you say, it catches you out as often many hours after the actual network blip.This posting suggests a potential fix, although we have not yet attempted it:Error: "Abandoning transaction after xxx,xxx seconds" in Managed Server Logs causing system downs - IFSAPP9 | IFS Community

Did you find what you're looking for? If not:

Sign up

Login to the community

Scanning file for viruses.

This file cannot be downloaded