Hi me again I m getting a strange behavior with Cerbos Every Cerbos Community #help

Hi, me again! I'm getting a strange behavior with ...

Topi Hernández Mares

07/06/2022, 1:43 PM

Hi, me again! I'm getting a strange behavior with Cerbos. Every now and then, a request to Cerbos get's "lost". I make a call to the API with the Python SDK and get the following error:

The read operation timed out

. I thought that increasing the

timeout_secs

param in the client could fix this issue, but id didn't. I enabled audit logs on my Cerbos container to see if there is any error in that side, but I discovered that whenever I get the error

The read operation timed out

, I don't get any logs, not even the one about

Handled Request

Topi Hernández Mares

07/06/2022, 1:44 PM

Could it be that the traffic on the Cerbos server is too high that some of the requests are lost?

Topi Hernández Mares

07/06/2022, 1:45 PM

But I don't think so, because right I'm the only one making the requests and still getting the error.

Charith (Cerbos)

07/06/2022, 1:52 PM

How is Cerbos deployed in your system? Do you know what the peak RPS is?

Charith (Cerbos)

07/06/2022, 2:07 PM

RPS = requests per second. Just wanted to understand how busy the Cerbos instance is.

Charith (Cerbos)

07/06/2022, 2:09 PM

Does this happen in production or while you're developing? Can it be a spotty WiFi connection or a slow VPN?

Topi Hernández Mares

07/06/2022, 2:10 PM

When I'm developing I have no problem. This happens with Production and QA envs

Topi Hernández Mares

07/06/2022, 2:10 PM

I have received reports form users with the same issue

Charith (Cerbos)

07/06/2022, 2:12 PM

OK. How is Cerbos deployed to production? Is it running on a VM, Kubernetes cluster or some other mechanism?

Topi Hernández Mares

07/06/2022, 2:12 PM

It's running in a Kubernetes cluster

Charith (Cerbos)

07/06/2022, 2:13 PM

Cool. Just one pod or multiple pods? Accessed through a service or directly?

Topi Hernández Mares

07/06/2022, 2:16 PM

Right now we have just one pod, but it is configured for scaling up to two replicas

Charith (Cerbos)

07/06/2022, 2:17 PM

Is the pod flapping because of a failed health check or because it's exceeding resource limits?

Topi Hernández Mares

07/06/2022, 2:18 PM

Hmm I think it is not flapping, I haven't seen any exceeding in resource limits

Charith (Cerbos)

07/06/2022, 2:19 PM

K8s scheduler would kill and restart the pod if it exceeds the resource limit. You should be able to see that by looking at the deployment/statefulset object

Charith (Cerbos)

07/06/2022, 2:21 PM

The other thing to check is whether you have a service mesh or a network overlay that could be identifying the pod as unavailable intermittently.

Topi Hernández Mares

07/06/2022, 2:23 PM

Ok, thanks, let me check

Topi Hernández Mares

07/06/2022, 2:25 PM

I found that the pod has been up for the last 18 hours, which is when we made the last deploy

Charith (Cerbos)

07/06/2022, 2:28 PM

No restarts, no healthcheck failures?

Topi Hernández Mares

07/06/2022, 2:30 PM

Yes, no events

Charith (Cerbos)

07/06/2022, 2:30 PM

How about the service?

Charith (Cerbos)

07/06/2022, 2:30 PM

and the deployment

Charith (Cerbos)

07/06/2022, 2:33 PM

This is a bit hard to debug like this. Based on the error message

The read operation timed out

it sounds to me like there's something interrupting the connection. Could be a proxy, load balancer, network blip or any number of things 😞

Topi Hernández Mares

07/06/2022, 2:37 PM

@Salvador any ideas?

Salvador

07/06/2022, 2:46 PM

Let me review the service and deployment in the k8s cluster

Salvador

07/06/2022, 2:53 PM

Everything looks good, no restarts in the pods, services and deployments are ok

Charith (Cerbos)

07/06/2022, 2:53 PM

Interesting... Are the applications accessing Cerbos running in the same cluster as well?

Salvador

07/06/2022, 2:54 PM

Yeah, pods, services are in the same cluster, namespace

Charith (Cerbos)

07/06/2022, 2:55 PM

I mean the client applications. They are on the same cluster as Cerbos?

Salvador

07/06/2022, 2:55 PM

yes, they too

Charith (Cerbos)

07/06/2022, 2:56 PM

Do you collect Cerbos metrics by any chance?

Charith (Cerbos)

07/06/2022, 2:57 PM

Or even better, do you have trace collection?

Salvador

07/06/2022, 2:58 PM

We have a datadog agent to collect logs 🤔 that’s what you mean?

Charith (Cerbos)

07/06/2022, 3:00 PM

No. Cerbos has a metrics endpoint that can be scraped by Prometheus. We also support distributed traces with Jaeger or OTLP.

Salvador

07/06/2022, 3:01 PM

Oh no, that’s not enabled or used

Charith (Cerbos)

07/06/2022, 3:03 PM

Those would have helped us see what was happening in the server during these blips. It doesn't sound like you're making many thousands of requests/second to Cerbos anyway. So I think it's quite unlikely that the server is too busy to handle requests. But, just to rule that out for sure, could you maybe scale up the deployment to 2 pods and see if that makes a difference?

Salvador

07/06/2022, 3:05 PM

Nice, I’ll scale up the deployment

👍 1

23 Views

Open in Slack

Previous Next