Ceph RBD 问题分析
使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”后无法正常启动,会阻塞在 ContainerCreating 状态。
PVC 和 Deployent 定义如下:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-test
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
volumeMounts:
- mountPath: /var/log/nginx
name: log-vol
volumes:
- name: log-vol
persistentVolumeClaim:
claimName: pvc-test
当 nginx Pod 所在节点异常(kubelet 挂掉):
$ $ kubectl get nodes
NAME STATUS ROLES AGE VERSION
mec51 Ready control-plane,worker 103d v1.27.2
mec52 NotReady control-plane,worker 103d v1.27.2
mec53 Ready control-plane,worker 103d v1.27.2
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-5ccff8b49c-6w5p7 0/1 ContainerCreating 0 23s <none> mec51 <none> <none>
nginx-5ccff8b49c-pc2z4 1/1 Terminating 0 7m15s 172.10.0.2 mec52 <none> <none>
因为 PVC 的访问模式为 ReadWriteOnce,kube-controller-manager 中的 attachdetach-controller 在 PVC 所绑定的 PV 从原节点上 detach 前,不会将 PV attach 至新的节点。
PV 与节点的绑定关系记录在 VolumeAttachment 对象中,删除相关 VolumeAttachment:
$ kubectl get volumeattachments | grep pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93
csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c rook-ceph.rbd.csi.ceph.com pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93 mec52 true 9m41s
$ kubectl delete volumeattachments csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c
虽然 VolumeAttachment 会由 kube-controller-manager 重建,跟随新的 Pod nginx-5ccff8b49c-6w5p7 绑定至 mec51 节点,但 Ceph RBD CSI node 插件仍拒绝将 PV 所关联的 RBD 卷 attach 至新节点:
# node mec51
$ journal -u kubelet -f
Feb 01 14:26:24 mec51 kubelet[4010]: E0201 14:26:24.206515 4010 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 podName: nodeName:}" failed. No retries permitted until 2024-02-01 14:26:40.206480124 +0800 CST m=+7759.293368765 (durationBeforeRetry 16s). Error: MountVolume.MountDevice failed for volume "pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998") pod "nginx-5ccff8b49c-6w5p7" (UID: "e85b2bad-d4d4-4739-b187-25d0a11008d3") : rpc error: code = Internal desc = rbd image mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 is still being used
这是 kubelet NodeStageVolume RPC 调用失败。
查看 Ceph RBD CSI node 插件源码:
调用链为 NodeStageVolume -> stageTransaction -> attachRBDImage -> waitForrbdImage
func waitForrbdImage(ctx context.Context, backoff wait.Backoff, volOptions *rbdVolume) error {
imagePath := volOptions.String()
err := wait.ExponentialBackoff(backoff, func() (bool, error) {
used, err := volOptions.isInUse()
if err != nil {
return false, fmt.Errorf("fail to check rbd image status: (%w)", err)
}
if (volOptions.DisableInUseChecks) && (used) {
log.UsefulLog(ctx, "valid multi-node attach requested, ignoring watcher in-use result")
return used, nil
}
return !used, nil
})
// return error if rbd image has not become available for the specified timeout
if errors.Is(err, wait.ErrWaitTimeout) {
return fmt.Errorf("rbd image %s is still being used", imagePath)
}
// return error if any other errors were encountered during waiting for the image to become available
return err
}
Ceph RBD CSI node 插件会先查看 RBD 卷是否正在被使用:
// isInUse checks if there is a watcher on the image. It returns true if there
// is a watcher on the image, otherwise returns false.
func (ri *rbdImage) isInUse() (bool, error) {
image, err := ri.open()
if err != nil {
if errors.Is(err, ErrImageNotFound) || errors.Is(err, util.ErrPoolNotFound) {
return false, err
}
// any error should assume something else is using the image
return true, err
}
defer image.Close()
watchers, err := image.ListWatchers()
if err != nil {
return false, err
}
mirrorInfo, err := image.GetMirrorImageInfo()
if err != nil {
return false, err
}
ri.Primary = mirrorInfo.Primary
// because we opened the image, there is at least one watcher
defaultWatchers := 1
if ri.Primary {
// if rbd mirror daemon is running, a watcher will be added by the rbd
// mirror daemon for mirrored images.
defaultWatchers++
}
return len(watchers) > defaultWatchers, nil
}
根据当前 RBD 卷 watcher 的数量来判断该卷是否正在被使用(attach 至节点上)。
$ $ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers:
watcher=192.168.73.52:0/319745466 client.2474196 cookie=18446462598732840999
192.168.73.52
是 mec52 节点的 IP,而此节点完全宕机,只是 kubelet 不在运行。
手动将 mec52 节点加入 Ceph OSD 黑名单:
$ ceph osd blacklist add 192.168.73.52
blocklisting 192.168.73.52:0/319745466 until 2024-02-01T07:57:05.125665+0000 (3600 sec)
$ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers: none
过一会后 RBD 卷被 node 插件成功 attach 至 mec51 节点,新的 Pod 就变会变成 Running 状态:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-5ccff8b49c-6w5p7 1/1 Running 0 45m
随后清空 OSD 黑名单:
$ ceph osd blacklist clear
接下来介绍利用 CSI-Addons 和 NetworkFence 实现上述加入黑名单操作。
CSI-Addons
需要确认 CSI Driver 是否实现 CSI-Addons 规范!
顾名思义,CSI-Addons 是对 CSI 现有能力的扩展与增强。
.------. CR .------------.
| User |-------| CSI-Addons |
'------' | Controller |
'------------'
|
| gRPC
|
.---------+------------------------------.
| | |
| .------------. .------------. |
| | CSI-Addons | gRPC | CSI | |
| | sidecar |--------| Controller | |
| '------------' | NodePlugin | |
| '------------' |
| CSI-driver Pod |
'----------------------------------------'
-
同 Kubernetes 官方的 CSI sidecar 容器一样,CSI-Addons 也需要在 CSI driver Pod 内额外部署一个 CSI sidecar 容器:
# 原先 csi-rbdplugin-provisioner Pod 内容器数量 5 个 $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin-provisioner" NAME READY STATUS RESTARTS AGE csi-rbdplugin-provisioner-77fbb96487-hg8jz 6/6 Running 0 2d19h csi-rbdplugin-provisioner-77fbb96487-w2zpj 6/6 Running 0 2d19h # 原先 csi-rbdplugin Pod 内容器数量 2 个 $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin" NAME READY STATUS RESTARTS AGE csi-rbdplugin-62h5f 3/3 Running 0 2d19h csi-rbdplugin-qb7g9 3/3 Running 0 2d19h csi-rbdplugin-tsphv 3/3 Running 0 2d19h
如果使用 rook 部署 Ceph CSI Driver,在 rook-ceph-operator 的配置中设置
CSI_ENABLE_CSIADDONS: true
,或者在 helm 部署时就开启。 -
CSI-Addons controller 部分还是需要额外部署:
$ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/crds.yaml $ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/rbac.yaml $ curl -s https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/setup-controller.yaml | sed 's/k8s-controller:latest/k8s-controller:v0.7.0/g' | kubectl create -f - # 注意修改镜像版本至 v0.7.0
CSIAddonsNode 会由 csi-rbdplugin 中的 CSI Addons sidecar 自动创建出来,无需手动创建:
$ kubectl get CSIAddonsNode -A
NAMESPACE NAME NAMESPACE AGE DRIVERNAME ENDPOINT NODEID
rook-ceph csi-rbdplugin-62h5f rook-ceph 32m rook-ceph.rbd.csi.ceph.com 172.10.0.73:9070 mec52
rook-ceph csi-rbdplugin-provisioner-77fbb96487-hg8jz rook-ceph 32m rook-ceph.rbd.csi.ceph.com 172.10.0.40:9070 mec51
rook-ceph csi-rbdplugin-provisioner-77fbb96487-w2zpj rook-ceph 32m rook-ceph.rbd.csi.ceph.com 172.10.0.46:9070 mec52
rook-ceph csi-rbdplugin-qb7g9 rook-ceph 32m rook-ceph.rbd.csi.ceph.com 172.10.0.69:9070 mec51
rook-ceph csi-rbdplugin-tsphv rook-ceph 32m rook-ceph.rbd.csi.ceph.com 172.10.0.50:9070 mec53
CSI-Addons controller 通过 CSIAddonsNode 对象中的 Endpoint 信息与 CSI Driver Pod 中的 sidecar 通信。
创建 NetworkFence CR 来自动将 mec52 节点的 IP 加入 OSD 黑名单:
$ cat <<EOF | kubectl apply -f -
apiVersion: csiaddons.openshift.io/v1alpha1
kind: NetworkFence
metadata:
name: network-fence-sample
spec:
driver: rook-ceph.rbd.csi.ceph.com # 固定
fenceState: Unfenced
cidrs:
- 192.168.73.52/32 # node mec52's IP
secret:
name: rook-csi-rbd-provisioner # 固定
namespace: rook-ceph-external # 固定
parameters:
clusterID: rook-ceph-external # 固定
EOF
$ kubectl get NetworkFence network-fence-sample
NAME DRIVER CIDRS FENCESTATE AGE RESULT
network-fence-sample rook-ceph.rbd.csi.ceph.com ["192.168.73.52/32"] Fenced 7s Succeeded
$ ceph osd blacklist ls
192.168.73.52:0/0 2029-02-06T08:13:22.359191+0000
listed 1 entries
NetworkFence 使用 .spec.fenceState
来设置是否拉黑 CIDR,将其修改为 Unfenced
来把 mec52 节点的 IP 移出黑名单:
$ kubectl patch NetworkFence network-fence-sample -p '{"spec":{"fenceState": "Unfenced"}}' --type=merge
$ kubectl get NetworkFence network-fence-sample
NAME DRIVER CIDRS FENCESTATE AGE RESULT
network-fence-sample rook-ceph.rbd.csi.ceph.com ["192.168.73.52/32"] Unfenced 2d18h Succeeded
$ ceph osd blacklist ls
listed 0 entries
建议预先为集群中每个节点的 IP 都创建一个
Unfenced
状态的 NetworkFence 对象,有需要时直接 patch 为Fenced
。
NetworkFence 实现原理
毫无疑问,CSI-Addons controller 会监听 NetworkFence 对象:
从 CSIAddonsNode 的端点中选出存活的一个作为目标 server,封装 FenceClusterNetwork 请求并发送。
CSI-Addons sidecar 收到请求后,将其转发至 CSI Driver Pod:
所以 CSI Driver 必须已经实现好 CSI-Addons 规范,这里 Ceph RBD CSI 中的 FenceControllerServer 收到 FenceClusterNetwork 请求:
func (fcs *FenceControllerServer) FenceClusterNetwork(
ctx context.Context,
req *fence.FenceClusterNetworkRequest) (*fence.FenceClusterNetworkResponse, error) {
// a lot of code here
nwFence, err := nf.NewNetworkFence(ctx, cr, req.Cidrs, req.GetParameters())
if err != nil {
return nil, status.Error(codes.Internal, err.Error())
}
err = nwFence.AddNetworkFence(ctx)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to fence CIDR block %q: %s", nwFence.Cidr, err.Error())
}
return &fence.FenceClusterNetworkResponse{}, nil
}
func (nf *NetworkFence) AddNetworkFence(ctx context.Context) error {
// for each CIDR block, convert it into a range of IPs so as to perform blocklisting operation.
for _, cidr := range nf.Cidr {
// fetch the list of IPs from a CIDR block
hosts, err := getIPRange(cidr)
if err != nil {
return fmt.Errorf("failed to convert CIDR block %s to corresponding IP range: %w", cidr, err)
}
// add ceph blocklist for each IP in the range mentioned by the CIDR
for _, host := range hosts {
err = nf.addCephBlocklist(ctx, host)
if err != nil {
return err
}
}
}
return nil
}
最后来到 addCephBlocklist
方法,其实和我们上面手动执行 ceph osd blacklist add
命令是一模一样的:
func (nf *NetworkFence) addCephBlocklist(ctx context.Context, ip string) error {
arg := []string{
"--id", nf.cr.ID,
"--keyfile=" + nf.cr.KeyFile,
"-m", nf.Monitors,
}
// TODO: add blocklist till infinity.
// Currently, ceph does not provide the functionality to blocklist IPs
// for infinite time. As a workaround, add a blocklist for 5 YEARS to
// represent infinity from ceph-csi side.
// At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.
// This needs to be updated once ceph provides functionality for the same.
cmd := []string{"osd", "blocklist", "add", ip, blocklistTime}
cmd = append(cmd, arg...)
_, _, err := util.ExecCommand(ctx, "ceph", cmd...)
if err != nil {
return fmt.Errorf("failed to blocklist IP %q: %w", ip, err)
}
log.DebugLog(ctx, "blocklisted IP %q successfully", ip)
return nil
}
这里 Ceph RBD CSI Driver 直接执行命令 ceph osd blacklist add
,将 NetworkFence 对象 .spec.cidrs
中的 IP 地址一一拉黑。个人觉得这里写的不太好,应该使用 Ceph OSD 的 API 向 Ceph 发送拉黑请求。
总结
当节点异常(比如 kubelet 挂掉)时,使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”到新节点后无法正常启动(这是一定的),可以考虑使用 CSI-Addons 的 NetworkFence API 将异常节点的 IP 地址拉黑后先保证业务 Pod 顺利启动,再排查或者修复原节点的故障。