Hi, me again! I'm getting a strange behavior with ...
# help
t
Hi, me again! I'm getting a strange behavior with Cerbos. Every now and then, a request to Cerbos get's "lost". I make a call to the API with the Python SDK and get the following error:
The read operation timed out
. I thought that increasing the
timeout_secs
param in the client could fix this issue, but id didn't. I enabled audit logs on my Cerbos container to see if there is any error in that side, but I discovered that whenever I get the error
The read operation timed out
, I don't get any logs, not even the one about
Handled Request
Could it be that the traffic on the Cerbos server is too high that some of the requests are lost?
But I don't think so, because right I'm the only one making the requests and still getting the error.
c
How is Cerbos deployed in your system? Do you know what the peak RPS is?
RPS = requests per second. Just wanted to understand how busy the Cerbos instance is.
Does this happen in production or while you're developing? Can it be a spotty WiFi connection or a slow VPN?
t
When I'm developing I have no problem. This happens with Production and QA envs
I have received reports form users with the same issue
c
OK. How is Cerbos deployed to production? Is it running on a VM, Kubernetes cluster or some other mechanism?
t
It's running in a Kubernetes cluster
c
Cool. Just one pod or multiple pods? Accessed through a service or directly?
t
Right now we have just one pod, but it is configured for scaling up to two replicas
c
Is the pod flapping because of a failed health check or because it's exceeding resource limits?
t
Hmm I think it is not flapping, I haven't seen any exceeding in resource limits
c
K8s scheduler would kill and restart the pod if it exceeds the resource limit. You should be able to see that by looking at the deployment/statefulset object
The other thing to check is whether you have a service mesh or a network overlay that could be identifying the pod as unavailable intermittently.
t
Ok, thanks, let me check
I found that the pod has been up for the last 18 hours, which is when we made the last deploy
c
No restarts, no healthcheck failures?
t
Yes, no events
c
How about the service?
and the deployment
This is a bit hard to debug like this. Based on the error message
The read operation timed out
it sounds to me like there's something interrupting the connection. Could be a proxy, load balancer, network blip or any number of things 😞
t
@Salvador any ideas?
s
Let me review the service and deployment in the k8s cluster
Everything looks good, no restarts in the pods, services and deployments are ok
c
Interesting... Are the applications accessing Cerbos running in the same cluster as well?
s
Yeah, pods, services are in the same cluster, namespace
c
I mean the client applications. They are on the same cluster as Cerbos?
s
yes, they too
c
Do you collect Cerbos metrics by any chance?
Or even better, do you have trace collection?
s
We have a datadog agent to collect logs 🤔 that’s what you mean?
c
No. Cerbos has a metrics endpoint that can be scraped by Prometheus. We also support distributed traces with Jaeger or OTLP.
s
Oh no, that’s not enabled or used
c
Those would have helped us see what was happening in the server during these blips. It doesn't sound like you're making many thousands of requests/second to Cerbos anyway. So I think it's quite unlikely that the server is too busy to handle requests. But, just to rule that out for sure, could you maybe scale up the deployment to 2 pods and see if that makes a difference?
s
Nice, I’ll scale up the deployment
👍 1