将 dapr 从 1.9.5 升级至 1.11.2 后遇到的问题,对应的 pod 日志:
Container image "docker.io/daprio/daprd:1.11.2" already present on machine
Created container daprd
Started container daprd
Readiness probe failed: HTTP probe failed with statuscode: 500
Back-off restarting failed container daprd in pod i-api-2.1.465-7f998f6798-xtpql_production(63ba2bda-8659-4c4d-87e2-fc2ff2c63c22
请问如何解决?
启用 dapr sub 的 pod 才会出现这个问题
dapr.io/app-port: "80"
通过 下面的命令
kubectl logs i-api-2.1.465-6b8ff4d6bd-6d479 -c daprd
发现错误日志
time="2023-08-06T13:45:30.657281397Z" level=error msg="rabbitmq pub/sub: reset: channel.Close() failed: Exception (504) Reason: \"channel/connection is not open\"" app_id=i-api component="pubsub (pubsub.rabbitmq/v1)" instance=i-api-2.1.465-6b8ff4d6bd-6d479 scope=dapr.contrib type=log ver=1.11.2
time="2023-08-06T13:45:30.657844894Z" level=error msg="rabbitmq pub/sub: error in subscriber for i-api-snapshot.SnapshotFinishedIntegrationEvent in ensureSubscription: channel not initialized" app_id=i-api component="pubsub (pubsub.rabbitmq/v1)" instance=i-api-2.1.465-6b8ff4d6bd-6d479 scope=dapr.contrib type=log ver=1.11.2
进入 rabbitmq 容器删除这个队列后,上面的错误没有了
rabbitmqctl delete_queue i-api-snapshot.SnapshotFinishedIntegrationEvent
但问题依旧,日志中发现下面的错误,500 错误应该就是 ERR_HEALTH_NOT_READY
引起的
level=debug msg="{ERR_HEALTH_NOT_READY dapr is not ready}" app_id=i-api instance=i-api-2.1.465-6f87869b68-6mhl4 scope=dapr.runtime.http type=log ver=1.11.2
对应的 dapr 源码 api_healthz.go#L43
终于解决了!问题与 1.11.2 版的 daprd 在启动时连不上 k8s 集群中已有的 rabbitmq 有关。
最终采用重装系统式的解决方法,换了个命名空间部署新的 rabbitmq
,将 pubsub.rabbitmq
切换到这个新的 rabbitmq,然后重启应用 pod ,问题就解决了。
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
name: pubsub
namespace: production
spec:
type: pubsub.rabbitmq
version: v1
metadata:
- name: host
value: "amqp://username:password@rabbitmq.dapr-system.svc.cluster.local:5672"
- name: durable
value: true
- name: deletedWhenUnused
value: false
注:rabbitmq.dapr-system.svc.cluster.local
中的 dapr-system
就是命名空间名称。
对解决这个问题帮助最大的是 github 上这个 issue daprd startup logs are not useful for debugging 的评论:
When the kafka pubsub component can't start a session with kafka because the server is in error, it fails silently and causes the sidecar to fail readiness with no logging.
相关链接 https://stackoverflow.com/q/76362984
– dudu 1年前相关链接 https://github.com/dapr/dapr/issues/6745
– dudu 1年前重要线索 https://github.com/dapr/dapr/issues/1688
– dudu 1年前相关链接 https://github.com/dapr/components-contrib/issues/1371
– dudu 1年前相关链接 https://github.com/dapr/dapr/issues/3531
– dudu 1年前