Hi All, I'm experiencing a fairly slow startup of ...
# help
Hi All, I'm experiencing a fairly slow startup of cerbos docker containers. It takes about 60 seconds to transition from (health: starting) to {healthy). Is this normal?
No, it is not. Could you let me know if you are using the default health-check setup? If not, please share its configuration.
Copy code
  httpListenAddr: ":3592"
  grpcListenAddr: ":3593"

  cors: # CORS defines the CORS configuration for the server.
    allowedHeaders: [ 'content-type', 'user-agent' ] # AllowedHeaders is the contents of the allowed-headers header.
    allowedOrigins: [ '*' ] # AllowedOrigins is the contents of the allowed-origins header.
    disabled: false # Disabled sets whether CORS is disabled.
    maxAge: 10s # MaxAge is the max age of the CORS preflight check.

  defaultPolicyVersion: "default"

    disableVerification: true

  driver: "disk"
    directory: "/policies"
    watchForChanges: true

  disabled: true
container is started with docker compose:
Copy code
    image: ghcr.io/cerbos/cerbos:latest
    command: "server --config=/conf/config.yaml"
      - "3592:3592"
      - "3593:3593"
      - ../cerbos/conf:/conf
      - ../cerbos/policies:/policies
      - xxxx_network
    restart: unless-stopped
      - "traefik.enable=true"
      - "traefik.http.routers.cerbos.rule=Host(`xxx.xxx.com`)"
      - "traefik.http.routers.cerbos.entrypoints=websecure"
      - "traefik.http.routers.cerbos.tls.certresolver=myresolver"
      - "traefik.http.services.cerbos.loadbalancer.server.port=3592"
here is a log snippet:
Copy code
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.661Z","log.logger":"cerbos.server","message":"Shutting down"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.661Z","log.logger":"cerbos.dir.watch","message":"Stopped watching directory for changes","dir":"/policies"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.661Z","log.logger":"cerbos.http","message":"HTTP server stopped"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.661Z","log.logger":"cerbos.grpc","message":"gRPC server stopped"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.662Z","log.logger":"cerbos.server","message":"Shutdown complete"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:08.662Z","log.logger":"cerbos.server","message":"maxprocs: No GOMAXPROCS change to reset"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.030Z","log.logger":"cerbos.server","message":"maxprocs: Leaving GOMAXPROCS=20: CPU quota undefined"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.030Z","log.logger":"cerbos.server","message":"Loading configuration from /conf/config.yaml"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.032Z","log.logger":"cerbos.disk.store","message":"Initializing disk store from /policies"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.107Z","log.logger":"cerbos.index","message":"Found 50 executable policies"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.108Z","log.logger":"cerbos.telemetry","message":"Telemetry disabled"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.108Z","log.logger":"cerbos.dir.watch","message":"Watching directory for changes","dir":"/policies"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.108Z","log.logger":"cerbos.grpc","message":"Starting gRPC server at :3593"}
{"log.level":"info","@timestamp":"2024-02-06T23:52:22.109Z","log.logger":"cerbos.http","message":"Starting HTTP server at :3592"}
{"log.level":"info","@timestamp":"2024-02-06T23:53:38.309Z","log.logger":"cerbos.grpc","message":"Handled request","protocol":"grpc","grpc.component":"server","grpc.service":"cerbos.svc.v1.CerbosService","grpc.method":"CheckResources","grpc.method_type":"unary","cerbos":{"call_id":"01HP0DEV85Y1PPYA3Y74EWVTP1"},"http":{"x_forwarded_for":[""]
so it took a little bit over a minute to accept a first connection after a restart of container
and it's not traefik cause there are other services going through traefik that are up instantly
So, the PDP served the check request as soon as it got healthy. Right?
That’s interesting. The config looks pretty standard.
one thing: other services in the compose file depend on cerbos, traefik and cerbos are the only ones configured without dependencies
not sure if this plays a role...
if you remove Cerbos as a dependency for other services, will those get ready quickly?
The idea is there might be a hidden dependency for all services, but Cerbos is the first one to require it, so it waits. Once the dependency is ready, everyone starts quickly.
so every other container starts VERY quickly. There are 7 containers defined, all of them start within 2-5 seconds
but, yes, let me try removing dependency definitions...
did not help
Screenshot 2024-02-06 at 4.20.35 PM.png
So you removed dependency on cerbos for other services and they still start quickly. Cerbos should also get ready quickly, but let’s wait to hear what others say about the delay. The rest of our team will be online in a few hours.
I get same behavior when using docker run:
Copy code
docker run --name cerbos-test -d -v ./conf:/conf -v ./policies:/policies -p 3592:3592 -p 3593:3593 cerbos/cerbos:latest server --config=/conf/config.yaml
and result is:
Copy code
419776b826b7   cerbos/cerbos:latest   "/cerbos server --co…"   37 seconds ago   Up 37 seconds (health: starting)>3592-3593/tcp   cerbos-test
so it has to be my conf or my policies? (I used this on multiple hosts and they all exhibit same behavior so not sure if host factors in)
Did you try it with some basic policies?
I only have 50 policies and they are all super basic. But, no, have not tried it with any sample policies
50 policies is a small number. My next guess is that Cerbos has a bug revealed by a particular expression in your policies. You can get a basic policy here.
same issue with the demo policies. Command:
Copy code
docker run --rm --name cerbos-test -d -v ./conf:/conf -v ./test_policies:/policies -p 3592:3592 -p 3593:3593 cerbos/cerbos:latest server --config=/conf/config.yaml
test_policies only has the two yaml files from github
I tried attaching to the container to see if there are any additional clues but bash/sh is not available:
Copy code
docker exec -it cerbos-test /bin/sh
Do you have any other means of attaching to the container?
There’s no shell in there. You can only run cerbos itself. For example, a health-check.
I also see it takes 60 seconds for the Cerbos container to start. Looks like a timeout.
Hey @Rafael Szuminski, this is probably due to the default Docker healthcheck interval defined in the Cerbos container which has an interval of 60s. So, it'd take 60s for Docker to run the first healthcheck and mark the container as healthy. You can override the healthcheck interval in the compose file like this:
Copy code
    image: ghcr.io/cerbos/cerbos:latest
      test: ["CMD", "/cerbos", "healthcheck"]
      interval: 5s
Cerbos itself starts almost instantly. You can confirm this with
curl localhost:3592/_cerbos/health
. It's the Docker Compose orchestrator that's adding the extra delay until all healthchecks are accounted.
setting the healthcheck interval as above solves the issue. Thanks you for your help!
🙇 1