Question

ifsapp-connect pod

  • 25 March 2024
  • 3 replies
  • 49 views

Badge +6

Hi, 

We had issue with constantly restarting POD ifsapp-connect, which caused a lot of issues in system. For example if our ifsapp-connect had restart our EDI orders was not proceed. We increased number of replicas to 2 and for now it seems all okay, but when I monitor our environment I can see that there are restart also but without any issues for example with EDI. For now I have questions:

  • Why is it caused that those pods are still restarting?
  • If POD has been restarted we should delete it and let it get up again?
  • How we can see what happening and what is the cause of restart on this certain pod?

3 replies

Userlevel 5
Badge +12

Hi Kacper,

> Why is it caused that those pods are still restarting?

There are many reasons that a pod may restart, including an unhandled exception (=error), insufficient CPU or memory, or a failed health probe.

> If POD has been restarted we should delete it and let it get up again?

Most pods are managed by a deployment, which manages the pod lifetime and replicas. If a pod fails and the replica count is not met, the deployment will create a replacement pod. You can optionally cleanup after failed pods but it is not required. Typically you only remove pods to force a new one to be created.

> How we can see what happening and what is the cause of restart on this certain pod?

Please check the pod logs and node description.

Node description:

kubectl describe node <node name if there are multiple>

Please pay particular attention to the CPU and memory/resources and events sections near the end.

Pod logs:

kubectl logs deployments/ifsapp-connect -n <ifs namespace>
kubectl logs pod/<pod name> -n <ifs namespace>

Pods generally container multiple containers. Please append "-c <container name>" if you need to focus on logs for a particular container.

Best regards -- Ben

Badge +6

Fine, I attached logs from restarting and no-restarting pod. Can you help me with investigating them or just point line which might be a reason for restarting pod? Im attaching both of dumps to compare: 

Badge +6

I used AI to verify these logs and it returned a message: 

The analysis of the logs from both Kubernetes pods shows that the restarting pod ("2_bad.txt") contains errors not present in the logs of the stable pod ("1_good.txt"). Key differences and issues detected in the "2_bad.txt" logs include:

  1. Errors related to ExecutionException and PermanentFailureException: There's an error message indicating a permanent failure during data transmission, suggesting that the pod attempted an operation that failed and cannot be automatically recovered. Specifically, the error points to a problem with ifs.fnd.connect.senders.ConnectSenderManager attempting to send data, resulting in an ExecutionException and a PermanentFailureException​​.
  2. ConcurrentAccessTimeoutException: Another critical error occurs when the pod tries to access a resource but cannot obtain it within the designated time. Notably, ConnectTimerPostForwarder cannot obtain a lock within 5000 milliseconds, leading to a ConcurrentAccessTimeoutException​​.

These errors are critical and can directly contribute to the pod's restarting, as they indicate problems that prevent the application from functioning properly. Specifically, errors related to the permanent operation failure and access time out to resources can result in pod instability, ultimately leading to its restart.

In summary, the main difference between the "1_good.txt" and "2_bad.txt" logs is the presence of critical errors in "2_bad.txt" that are not present in "1_good.txt". These errors are likely the cause of the stability issues and restarts of the pod represented by "2_bad.txt". Addressing these problems would require further analysis and possibly modifications to the configuration or application code to handle ExecutionException and ConcurrentAccessTimeoutException errors and ensure proper error handling.

 

Obviously, 1_good.txt is a log with no restarting pod and 2_bad.txt is a log with a restarting pod. And I wonder, what operation in system can cause errors pointed by AI in attached text?

Reply