Recently we came across a situation where customers have noted a significant performance behaviours based on how a BA report is run. If a report is scheduled, it happens to run within an acceptable time frame, but on the contrary it becomes laggard if it is ordered from Order Report functionality. Literally users had to experience hours of hanging up in IFS IEE Client.
When investigating further it was noticed that there are different behaviors in customer’s network and support net. Besides the problem is not pertinent to BA reports but for other application flows as well. We observed this trend among managed services customers.
Problem
When the client call is sent from application, it goes to the middleware server and then to the database. Then the response is returned from the database, to the middleware server, but there is a loss in the communication of middleware server call to client. Due to that reason, we initiate a further investigation for network configuration why the response from client to middleware server behaves differently when public vs VPN access is used. This does not occur in support net as well.
- Client to Middleware server
Succeeds -> The DB gets called
- Middleware Server to DB
Succeeds -> The DB call is ended
- Response from Middleware Server to Client
Fails (when accessed via public network)
Furthermore, this issue does not occur when the report is scheduled, because there is no need to return the response to the client. Therefore, we were able to narrow down that there is a hassle in communication channel between the middleware server and the client.
Solution
Azure configuration for the public IP of the VM was discovered as the reason for this issue. This was the TCP idle timeout settings for Azure Load Balancer which was setup for 4 minutes.
Once we tried to increase that value to 30 minutes and test the report again from a public network, it worked perfectly.
https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-tcp-idle-timeout
Suggestions for Improvements
If the TCP idle timeout is exceeded, IFS application started to hang up, during the communication between the Middleware Server to Client. Although the IEE client waits for the response, there has been a lost due to the TCP idle timeout settings for Azure Load Balancer.
May be .net access provider could not identify the response of time out, or that response is dropped. Introducing a 'keep-alive' message from Application server, to the client, in a way the connection won't be timed out while the client is waiting for the report.
Summary
Once the TCP idle timeout settings for Azure Load Balancer is exceeded, the whole IFS application gets stuck without any timeout messages.
The maximum timeout which could be set is 30 minutes. If there is an operation going to be running more than 30 minutes (Operations which return to the client), IFS application will hang up. Therefore, this would be problematic to customers who are running IFS on Azure on reliability aspects. For now, we could survive with scheduling the report or any functional flow if it is possible. To conclude we could suggest this would be valid point in reviewing the existing architecture in future releases.