is there some kind of caching in cerbos? i changed...
# help
j
is there some kind of caching in cerbos? i changed my resource policy versions from "abc" to "def" and i find that when i call Cerbos check_resources, i can still get EFFECT.ALLOW for policy version "abc"
if i restart the cerbos pods, the problem goes away
c
Yes, policies are cached but
abc
should have been evicted when you changed it. Which store are you using?
j
i am using git driver - backend is gitlab
cerbos version = 0.24.0
i have 3 pods running. if i kill just 1 pod and let it restart, then i find some of my calls to cerbos will return EFFECT.ALLOW, and some of the calls will return EFFECT.DENY (no changes in code - just repeated runs of the same code).
e
How long is your
updatePollInterval
set to?
When you update the policy and restart only one pod, it will automatically start with the latest policy. However, those that are alive, will only update when the polling interval expires.
j
1minute.
e
Do the pods start responding consistently 1 minute after you make the changes to the policy?
j
Let me test that. I suspect the worst case scenario is 2minutes - a few seconds
e
We have users running in production with 10 seconds update interval with no issues.
j
Ok. I will test that.
I've just reran the test with no changes. I'm still getting inconsistent results - this is more than 10mins after.
Let me check the logs for the pods
Interesting. The pod logs show a consistent result of effect.deny but I'm getting different results in the cloud shell where I run the code. My only conclusion is that Google's cloud shell has some kind of caching mechanism in place.
c
Do you have a proxy or load balancer in front of the Cerbos service? Maybe a service mesh?
j
Yes. Istio.
I need to check there as well.
c
Do you have a traffic split that's potentially sending some of your requests to a different service?
j
No. It's a single gke cluster with only cerbos in it. Plus, I check the output from the python sdk and it's properly formatted as a cerbos response. No errors from the sdk.
I can't find any cache config in istio virtual services and gateways.
I'll try the same python code later outside of the Google cloud shell - from a local laptop.
c
Hey, if you edit an existing policy file and change its version in place, the old version will still remain in the compile cache until either the store is reloaded using the Admin API or Cerbos itself is restarted. We hadn't anticipated that people will change the policy identifiers while the system is live. We'll put out a fix for that soon. Is your script doing something like that? If so, that could explain why you're getting inconsistent results. The instances where the old policy is cached will return the results as if the policy still exists while instances where it is not cached will simply return a
DENY
.
j
it's been 12 hours and i'm still getting the same problem. so i just submitted 3 API calls to cerbos. two of them returned EFFECT.ALLOW for resource-id "cspm123". one returned EFFECT.DENY.
based on the pod logs, i can see that cerbos IS evaluating the policies wrongly. the return values i get (EFFECT.ALLOW and EFFECT.DENY) in the logs is consistent with what my python code is getting.
the strangest part of this is in my resource policies, the resource-kind is actually "cspm". there is no "cspm123" (there was - i defined it previously but i've since changed it)
i suspect what i am experiencing is the behaviour you mentioned Charith. i do remember modifying the policy versions in-place (switching from
development
version to
production
version for testing).
c
Yeah, I think that's probably what's happening. If you have the Admin API enabled, you can force each pod to reload itself using
cerbosctl store reload
(https://docs.cerbos.dev/cerbos/latest/cli/cerbosctl.html#reload) or just roll the pods manually and I think the issue will go away. We have a fix in progress. Will get it out soon.