使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”后无法正常启动,会阻塞在 ContainerCreating 状态。
PVC 和 Deployent 定义如下:
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /var/log/nginx
          name: log-vol
      volumes:
      - name: log-vol
        persistentVolumeClaim:
          claimName: pvc-test
当 nginx Pod 所在节点异常(kubelet 挂掉):
$ $ kubectl get nodes
NAME    STATUS     ROLES                  AGE    VERSION
mec51   Ready      control-plane,worker   103d   v1.27.2
mec52   NotReady   control-plane,worker   103d   v1.27.2
mec53   Ready      control-plane,worker   103d   v1.27.2
$ kubectl get po -o wide
NAME                     READY   STATUS              RESTARTS   AGE     IP           NODE    NOMINATED NODE   READINESS GATES
nginx-5ccff8b49c-6w5p7   0/1     ContainerCreating   0          23s     <none>       mec51   <none>           <none>
nginx-5ccff8b49c-pc2z4   1/1     Terminating         0          7m15s   172.10.0.2   mec52   <none>           <none>
因为 PVC 的访问模式为 ReadWriteOnce,kube-controller-manager 中的 attachdetach-controller 在 PVC 所绑定的 PV 从原节点上 detach 前,不会将 PV attach 至新的节点。
PV 与节点的绑定关系记录在 VolumeAttachment 对象中,删除相关 VolumeAttachment:
$ kubectl get volumeattachments | grep pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93
csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c   rook-ceph.rbd.csi.ceph.com      pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93   mec52   true       9m41s
$ kubectl delete volumeattachments csi-e94a3f5697b001e59f9f61f17f55d76d34b0e63468e00bc75571d5cbf3d2c79c
虽然 VolumeAttachment 会由 kube-controller-manager 重建,跟随新的 Pod nginx-5ccff8b49c-6w5p7 绑定至 mec51 节点,但 Ceph RBD CSI node 插件仍拒绝将 PV 所关联的 RBD 卷 attach 至新节点:
# node mec51
$ journal -u kubelet -f
Feb 01 14:26:24 mec51 kubelet[4010]: E0201 14:26:24.206515    4010 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 podName: nodeName:}" failed. No retries permitted until 2024-02-01 14:26:40.206480124 +0800 CST m=+7759.293368765 (durationBeforeRetry 16s). Error: MountVolume.MountDevice failed for volume "pvc-d04a31e5-bdcc-46ad-86c6-7e4bf91a8c93" (UniqueName: "kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com^0001-0012-rook-ceph-external-0000000000000002-10ff5cc1-c00b-11ee-bb20-fa163e1aa998") pod "nginx-5ccff8b49c-6w5p7" (UID: "e85b2bad-d4d4-4739-b187-25d0a11008d3") : rpc error: code = Internal desc = rbd image mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998 is still being used
这是 kubelet NodeStageVolume RPC 调用失败。
查看 Ceph RBD CSI node 插件源码:
调用链为 NodeStageVolume -> stageTransaction -> attachRBDImage -> waitForrbdImage
func waitForrbdImage(ctx context.Context, backoff wait.Backoff, volOptions *rbdVolume) error {
    imagePath := volOptions.String()
    err := wait.ExponentialBackoff(backoff, func() (bool, error) {
        used, err := volOptions.isInUse()
        if err != nil {
            return false, fmt.Errorf("fail to check rbd image status: (%w)", err)
        }
        if (volOptions.DisableInUseChecks) && (used) {
            log.UsefulLog(ctx, "valid multi-node attach requested, ignoring watcher in-use result")
            return used, nil
        }
        return !used, nil
    })
    // return error if rbd image has not become available for the specified timeout
    if errors.Is(err, wait.ErrWaitTimeout) {
        return fmt.Errorf("rbd image %s is still being used", imagePath)
    }
    // return error if any other errors were encountered during waiting for the image to become available
    return err
}
Ceph RBD CSI node 插件会先查看 RBD 卷是否正在被使用:
// isInUse checks if there is a watcher on the image. It returns true if there
// is a watcher on the image, otherwise returns false.
func (ri *rbdImage) isInUse() (bool, error) {
    image, err := ri.open()
    if err != nil {
        if errors.Is(err, ErrImageNotFound) || errors.Is(err, util.ErrPoolNotFound) {
            return false, err
        }
        // any error should assume something else is using the image
        return true, err
    }
    defer image.Close()
    watchers, err := image.ListWatchers()
    if err != nil {
        return false, err
    }
    mirrorInfo, err := image.GetMirrorImageInfo()
    if err != nil {
        return false, err
    }
    ri.Primary = mirrorInfo.Primary
    // because we opened the image, there is at least one watcher
    defaultWatchers := 1
    if ri.Primary {
        // if rbd mirror daemon is running, a watcher will be added by the rbd
        // mirror daemon for mirrored images.
        defaultWatchers++
    }
    return len(watchers) > defaultWatchers, nil
}
根据当前 RBD 卷 watcher 的数量来判断该卷是否正在被使用(attach 至节点上)。
$ $ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers:
        watcher=192.168.73.52:0/319745466 client.2474196 cookie=18446462598732840999
192.168.73.52 是 mec52 节点的 IP,而此节点完全宕机,只是 kubelet 不在运行。
手动将 mec52 节点加入 Ceph OSD 黑名单:
$ ceph osd blacklist add 192.168.73.52
blocklisting 192.168.73.52:0/319745466 until 2024-02-01T07:57:05.125665+0000 (3600 sec)
$ rbd status mec-ecs-pool/csi-vol-10ff5cc1-c00b-11ee-bb20-fa163e1aa998
Watchers: none
过一会后 RBD 卷被 node 插件成功 attach 至 mec51 节点,新的 Pod 就变会变成 Running 状态:
$ kubectl get po
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5ccff8b49c-6w5p7   1/1     Running   0          45m
随后清空 OSD 黑名单:
$ ceph osd blacklist clear
接下来介绍利用 CSI-Addons 和 NetworkFence 实现上述加入黑名单操作。
CSI-Addons
需要确认 CSI Driver 是否实现 CSI-Addons 规范!
顾名思义,CSI-Addons 是对 CSI 现有能力的扩展与增强。
.------.   CR  .------------.
| User |-------| CSI-Addons |
'------'       | Controller |
               '------------'
                      |
                      | gRPC
                      |
            .---------+------------------------------.
            |         |                              |
            |  .------------.        .------------.  |
            |  | CSI-Addons |  gRPC  |    CSI     |  |
            |  |  sidecar  |--------| Controller |  |
            |  '------------'        | NodePlugin |  |
            |                        '------------'  |
            | CSI-driver Pod                         |
            '----------------------------------------'
- 
同 Kubernetes 官方的 CSI sidecar 容器一样,CSI-Addons 也需要在 CSI driver Pod 内额外部署一个 CSI sidecar 容器:
# 原先 csi-rbdplugin-provisioner Pod 内容器数量 5 个 $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin-provisioner" NAME READY STATUS RESTARTS AGE csi-rbdplugin-provisioner-77fbb96487-hg8jz 6/6 Running 0 2d19h csi-rbdplugin-provisioner-77fbb96487-w2zpj 6/6 Running 0 2d19h # 原先 csi-rbdplugin Pod 内容器数量 2 个 $ kubectl get po -n rook-ceph -l "app=csi-rbdplugin" NAME READY STATUS RESTARTS AGE csi-rbdplugin-62h5f 3/3 Running 0 2d19h csi-rbdplugin-qb7g9 3/3 Running 0 2d19h csi-rbdplugin-tsphv 3/3 Running 0 2d19h如果使用 rook 部署 Ceph CSI Driver,在 rook-ceph-operator 的配置中设置
CSI_ENABLE_CSIADDONS: true,或者在 helm 部署时就开启。 - 
CSI-Addons controller 部分还是需要额外部署:
$ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/crds.yaml $ kubectl apply -f https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/rbac.yaml $ curl -s https://raw.githubusercontent.com/csi-addons/kubernetes-csi-addons/v0.7.0/deploy/controller/setup-controller.yaml | sed 's/k8s-controller:latest/k8s-controller:v0.7.0/g' | kubectl create -f - # 注意修改镜像版本至 v0.7.0 
CSIAddonsNode 会由 csi-rbdplugin 中的 CSI Addons sidecar 自动创建出来,无需手动创建:
$ kubectl get CSIAddonsNode -A
NAMESPACE   NAME                                         NAMESPACE   AGE   DRIVERNAME                   ENDPOINT           NODEID
rook-ceph   csi-rbdplugin-62h5f                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.73:9070   mec52
rook-ceph   csi-rbdplugin-provisioner-77fbb96487-hg8jz   rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.40:9070   mec51
rook-ceph   csi-rbdplugin-provisioner-77fbb96487-w2zpj   rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.46:9070   mec52
rook-ceph   csi-rbdplugin-qb7g9                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.69:9070   mec51
rook-ceph   csi-rbdplugin-tsphv                          rook-ceph   32m   rook-ceph.rbd.csi.ceph.com   172.10.0.50:9070   mec53
CSI-Addons controller 通过 CSIAddonsNode 对象中的 Endpoint 信息与 CSI Driver Pod 中的 sidecar 通信。
创建 NetworkFence CR 来自动将 mec52 节点的 IP 加入 OSD 黑名单:
$ cat <<EOF | kubectl apply -f -
apiVersion: csiaddons.openshift.io/v1alpha1
kind: NetworkFence
metadata:
  name: network-fence-sample
spec:
  driver: rook-ceph.rbd.csi.ceph.com # 固定
  fenceState: Unfenced
  cidrs:
    - 192.168.73.52/32 # node mec52's IP
  secret:
    name: rook-csi-rbd-provisioner # 固定
    namespace: rook-ceph-external # 固定
  parameters:
    clusterID: rook-ceph-external # 固定
EOF
$ kubectl get NetworkFence network-fence-sample
NAME                   DRIVER                       CIDRS                  FENCESTATE   AGE   RESULT
network-fence-sample   rook-ceph.rbd.csi.ceph.com   ["192.168.73.52/32"]   Fenced       7s    Succeeded
$ ceph osd blacklist ls
192.168.73.52:0/0 2029-02-06T08:13:22.359191+0000
listed 1 entries
NetworkFence 使用 .spec.fenceState 来设置是否拉黑 CIDR,将其修改为 Unfenced 来把 mec52 节点的 IP 移出黑名单:
$ kubectl patch NetworkFence network-fence-sample -p '{"spec":{"fenceState": "Unfenced"}}' --type=merge
$ kubectl get NetworkFence network-fence-sample
NAME                   DRIVER                       CIDRS                  FENCESTATE   AGE     RESULT
network-fence-sample   rook-ceph.rbd.csi.ceph.com   ["192.168.73.52/32"]   Unfenced     2d18h   Succeeded
$ ceph osd blacklist ls
listed 0 entries
建议预先为集群中每个节点的 IP 都创建一个
Unfenced状态的 NetworkFence 对象,有需要时直接 patch 为Fenced。
NetworkFence 实现原理
毫无疑问,CSI-Addons controller 会监听 NetworkFence 对象:
从 CSIAddonsNode 的端点中选出存活的一个作为目标 server,封装 FenceClusterNetwork 请求并发送。
CSI-Addons sidecar 收到请求后,将其转发至 CSI Driver Pod:
所以 CSI Driver 必须已经实现好 CSI-Addons 规范,这里 Ceph RBD CSI 中的 FenceControllerServer 收到 FenceClusterNetwork 请求:
func (fcs *FenceControllerServer) FenceClusterNetwork(
    ctx context.Context,
    req *fence.FenceClusterNetworkRequest) (*fence.FenceClusterNetworkResponse, error) {
    // a lot of code here
    nwFence, err := nf.NewNetworkFence(ctx, cr, req.Cidrs, req.GetParameters())
    if err != nil {
        return nil, status.Error(codes.Internal, err.Error())
    }
    err = nwFence.AddNetworkFence(ctx)
    if err != nil {
        return nil, status.Errorf(codes.Internal, "failed to fence CIDR block %q: %s", nwFence.Cidr, err.Error())
    }
    return &fence.FenceClusterNetworkResponse{}, nil
}
func (nf *NetworkFence) AddNetworkFence(ctx context.Context) error {
    // for each CIDR block, convert it into a range of IPs so as to perform blocklisting operation.
    for _, cidr := range nf.Cidr {
        // fetch the list of IPs from a CIDR block
        hosts, err := getIPRange(cidr)
        if err != nil {
            return fmt.Errorf("failed to convert CIDR block %s to corresponding IP range: %w", cidr, err)
        }
        // add ceph blocklist for each IP in the range mentioned by the CIDR
        for _, host := range hosts {
            err = nf.addCephBlocklist(ctx, host)
            if err != nil {
                return err
            }
        }
    }
    return nil
}
最后来到 addCephBlocklist 方法,其实和我们上面手动执行 ceph osd blacklist add 命令是一模一样的:
func (nf *NetworkFence) addCephBlocklist(ctx context.Context, ip string) error {
    arg := []string{
        "--id", nf.cr.ID,
        "--keyfile=" + nf.cr.KeyFile,
        "-m", nf.Monitors,
    }
    // TODO: add blocklist till infinity.
    // Currently, ceph does not provide the functionality to blocklist IPs
    // for infinite time. As a workaround, add a blocklist for 5 YEARS to
    // represent infinity from ceph-csi side.
    // At any point in this time, the IPs can be unblocked by an UnfenceClusterReq.
    // This needs to be updated once ceph provides functionality for the same.
    cmd := []string{"osd", "blocklist", "add", ip, blocklistTime}
    cmd = append(cmd, arg...)
    _, _, err := util.ExecCommand(ctx, "ceph", cmd...)
    if err != nil {
        return fmt.Errorf("failed to blocklist IP %q: %w", ip, err)
    }
    log.DebugLog(ctx, "blocklisted IP %q successfully", ip)
    return nil
}
这里 Ceph RBD CSI Driver 直接执行命令 ceph osd blacklist add,将 NetworkFence 对象 .spec.cidrs 中的 IP 地址一一拉黑。个人觉得这里写的不太好,应该使用 Ceph OSD 的 API 向 Ceph 发送拉黑请求。
总结
当节点异常(比如 kubelet 挂掉)时,使用 ReadWriteOnce 访问模式 Ceph RBD PV 的 Pod 在“迁移”到新节点后无法正常启动(这是一定的),可以考虑使用 CSI-Addons 的 NetworkFence API 将异常节点的 IP 地址拉黑后先保证业务 Pod 顺利启动,再排查或者修复原节点的故障。