关于k8s中etcd集群备份灾难恢复的一些笔记(代码片段)

山河已无恙 山河已无恙     2023-03-08     387

关键词:

写在前面


  • 集群电源不稳定,或者节点动不动就 宕机,一定要做好备份,ETCD 的快照文件很容易受影响损坏。
  • 重置了很多次集群,才认识到备份的重要
  • 博文内容涉及
    • etcd 运维基础知识了解
    • 静态 Pod 方式 etcd 集群灾备与恢复 Demo
    • 定时备份的任务编写
    • 二进制 etcd 集群灾备恢复 Demo
  • 理解不足小伙伴帮忙指正

我所渴求的,無非是將心中脫穎語出的本性付諸生活,為何竟如此艱難呢 ------赫尔曼·黑塞《德米安》


etcd 概述

etcdCoreOS团队于2013年6月发起的开源项目,它的目标是构建一个高可用的分布式键值(key-value)数据库

etcd 内部采用 raft 协议作为一致性算法,etcd基于Go语言实现。

  • 完全复制:集群中的每个节点都可以使用完整的存档
  • 高可用性:Etcd可用于避免硬件的单点故障或网络问题
  • 一致性:每次读取都会返回跨多主机的最新写入
  • 简单:包括一个定义良好、面向用户的API(gRPC)
  • 安全:实现了带有可选的客户端证书身份验证的自动化TLS
  • 快速:每秒10000次写入的基准速度
  • 可靠:使用Raft算法实现了强一致、高可用的服务存储目录

ETCD 集群运维相关的基本知识:

  • 读写端口为: 2379, 数据同步端口: 2380
  • ETCD集群是一个分布式系统,使用Raft协议来维护集群内各个节点状态的一致性。
  • 主机状态 Leader, Follower, Candidate
  • 当集群初始化时候,每个节点都是Follower角色,通过心跳与其他节点同步数据
  • 通过Follower读取数据,通过Leader写入数据
  • Follower在一定时间内没有收到来自主节点的心跳,会将自己角色改变为Candidate,并发起一次选主投票
  • 配置etcd集群,建议尽可能是奇数个节点,而不要偶数个节点,推荐的数量为 3、5 或者 7 个节点构成一个集群。
  • 使用 etcd 的内置备份/恢复工具从源部署备份数据并在新部署中恢复数据。恢复前需要清理数据目录
  • 数据目录下 snap: 存放快照数据,etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
  • 数据目录下 wal: 存放预写式日志,最大的作用是记录了整个数据变化的全部历程。在etcd中,所有数据的修改在提交前,都要先写入到WAL中。
  • 一个 etcd 集群可能不应超过七个节点,写入性能会受影响,建议运行五个节点。一个 5 成员的 etcd 集群可以容忍两个成员故障,三个成员可以容忍1个故障。

常用配置参数:

  • ETCD_NAME 节点名称,默认为defaul
  • ETCD_DATA_DIR 服务运行数据保存的路
  • ETCD_LISTEN_PEER_URLS 监听的同伴通信的地址,比如http://ip:2380,如果有多个,使用逗号分隔。需要所有节点都能够访问,所以不要使用 localhost
  • ETCD_LISTEN_CLIENT_URLS 监听的客户端服务地址
  • ETCD_ADVERTISE_CLIENT_URLS 对外公告的该节点客户端监听地址,这个值会告诉集群中其他节点
  • ETCD_INITIAL_ADVERTISE_PEER_URLS 对外公告的该节点同伴监听地址,这个值会告诉集群中其他节
  • ETCD_INITIAL_CLUSTER 集群中所有节点的信息
  • ETCD_INITIAL_CLUSTER_STATE 新建集群的时候,这个值为 new;假如加入已经存在的集群,这个值为existing
  • ETCD_INITIAL_CLUSTER_TOKEN 集群的ID,多个集群的时候,每个集群的ID必须保持唯一

静态 Pod方式 集群备份恢复

单节点ETCD备份恢复

如果 etcd 为单节点部署,可以直接 物理备份,直接备份对应的数据文件目录即可,恢复 的话可以直接把备份的 etcd 数据目录复制到 etcd 指定的目录。恢复完成需要恢复 /etc/kubernetes/manifestsetcd.yaml 文件原来的状态。

也可以基于快照进行备份

备份命令

┌──[root@vms81.liruilongs.github.io]-[/backup_20230127]
└─$ETCDCTL_API=3 etcdctl --endpoints="https://127.0.0.1:2379" \\
 --cert="/etc/kubernetes/pki/etcd/server.crt"  \\
 --key="/etc/kubernetes/pki/etcd/server.key"  \\
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   
snapshot save snap-$(date +%Y%m%d%H%M).db
Snapshot saved at snap-202301272133.db

恢复命令

┌──[root@vms81.liruilongs.github.io]-[/backup_20230127]
└─$ETCDCTL_API=3 etcdctl snapshot restore ./snap-202301272133.db \\
 --name vms81.liruilongs.github.io  \\
 --cert="/etc/kubernetes/pki/etcd/server.crt"  \\
 --key="/etc/kubernetes/pki/etcd/server.key"  \\
 --cacert="/etc/kubernetes/pki/etcd/ca.crt" \\
 --initial-advertise-peer-urls=https://192.168.26.81:2380  \\
 --initial-cluster="vms81.liruilongs.github.io=https://192.168.26.81:2380"  \\
 --data-dir=/var/lib/etcd
2023-01-27 21:40:01.193420 I | mvcc: restore compact to 484325
2023-01-27 21:40:01.199682 I | etcdserver/membership: added member cbf506fa2d16c7 [https://192.168.26.81:2380] to cluster 46c9df5da345274b
┌──[root@vms81.liruilongs.github.io]-[/backup_20230127]
└─$

具体对应的参数值,可以通过 etcd 静态 pod 的 yaml 文件获取

┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$kubectl describe  pods etcd-vms81.liruilongs.github.io | grep -e "--"
      --advertise-client-urls=https://192.168.26.81:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --initial-advertise-peer-urls=https://192.168.26.81:2380
      --initial-cluster=vms81.liruilongs.github.io=https://192.168.26.81:2380
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.81:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://192.168.26.81:2380
      --name=vms81.liruilongs.github.io
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
┌──[root@vms81.liruilongs.github.io]-[/var/lib/etcd/member]
└─$

集群ETCD备份恢复

集群节点状态

┌──[root@vms100.liruilongs.github.io]-[~/ansible/helm]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" member list -w table
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|        ID        | STATUS  |            NAME             |         PEER ADDRS          |        CLIENT ADDRS         |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
|  ee392e5273e89e2 | started | vms100.liruilongs.github.io | https://192.168.26.100:2380 | https://192.168.26.100:2379 |
| 11486647d7f3a17b | started | vms102.liruilongs.github.io | https://192.168.26.102:2380 | https://192.168.26.102:2379 |
| e00e3877df8f76f4 | started | vms101.liruilongs.github.io | https://192.168.26.101:2380 | https://192.168.26.101:2379 |
+------------------+---------+-----------------------------+-----------------------------+-----------------------------+
┌──[root@vms100.liruilongs.github.io]-[~/ansible/helm]

version 及 leader 信息。

┌──[root@vms100.liruilongs.github.io]-[~/ansible/kubescape]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 |  ee392e5273e89e2 |   3.5.4 |   37 MB |     false |       100 |    3152364 |
| https://192.168.26.102:2379 | 11486647d7f3a17b |   3.5.4 |   36 MB |     false |       100 |    3152364 |
| https://192.168.26.101:2379 | e00e3877df8f76f4 |   3.5.4 |   36 MB |      true |       100 |    3152364 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~/ansible/kubescape]
└─$

集群情况下,备份可以单节点备份,前面我们也讲过,etcd 集群为完全复制,单节点备份

┌──[root@vms100.liruilongs.github.io]-[~]
└─$yum -y install etcd

没有 etcdctl 工具,需要安装一下 etcd 或者从其他的地方单独拷贝一下。这里我们安装下,然后把 etcetl 拷贝到其他集群节点。

备份

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ENDPOINT=https://127.0.0.1:2379
┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" snapshot save snapshot.db
Snapshot saved at snapshot.db

校验快照 hash 值

┌──[root@vms100.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl --write-out=table snapshot status snapshot.db
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEYS | TOTAL SIZE |
+----------+----------+------------+------------+
| 46aa26ed |   217504 |       2711 |      27 MB |
+----------+----------+------------+------------+
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

恢复

这里的 etcd 集群部署,采用堆叠的方式,通过静态 pod 运行,位于每个控制节点的上。

一定要备份,恢复前需要把原来的数据文件备份清理,在恢复前需要确保 etcdapi-Service 已经停掉。获取必要的参数

┌──[root@vms100.liruilongs.github.io]-[~]
└─$kubectl describe pod etcd-vms100.liruilongs.github.io -n kube-system  | grep -e '--'
      --advertise-client-urls=https://192.168.26.100:2379
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
      --client-cert-auth=true
      --data-dir=/var/lib/etcd
      --experimental-initial-corrupt-check=true
      --experimental-watch-progress-notify-interval=5s
      --initial-advertise-peer-urls=https://192.168.26.100:2380
      --initial-cluster=vms100.liruilongs.github.io=https://192.168.26.100:2380
      --key-file=/etc/kubernetes/pki/etcd/server.key
      --listen-client-urls=https://127.0.0.1:2379,https://192.168.26.100:2379
      --listen-metrics-urls=http://127.0.0.1:2381
      --listen-peer-urls=https://192.168.26.100:2380
      --name=vms100.liruilongs.github.io
      --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
      --peer-client-cert-auth=true
      --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --snapshot-count=10000
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
┌──[root@vms100.liruilongs.github.io]-[~]
└─$

恢复的时候:停掉所有 Master 节点的 kube-apiserveretcd 这两个静态pod 。 kubelet 每隔 20s 会扫描一次这个目录确定是否发生静态 pod 变动。 移动Yaml文件 即可停掉。

这是使用 Ansible ,集群所有节点执行。

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "mv  /etc/kubernetes/manifests/etcd.yaml  /tmp/ " -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>

192.168.26.101 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "mv   /etc/kubernetes/manifests/kube-apiserver.yaml  /tmp/ " -i host.yaml
192.168.26.101 | CHANGED | rc=0 >>

192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

确实 静态 Yaml 文件发生移动

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.102 | CHANGED | rc=0 >>
haproxy.yaml
keepalived.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.100 | CHANGED | rc=0 >>
haproxy.yaml
keepalived.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
haproxy.yaml
keepalived.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

清空所有集群节点的 etcd 数据目录

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "rm -rf /var/lib/etcd/" -i host.yaml
[WARNING]: Consider using the file module with state=absent rather than running 'rm'.  If you need to use command because file is insufficient you can add 'warn:
false' to this command task or set 'command_warnings=False' in ansible.cfg to get rid of this message.
192.168.26.101 | CHANGED | rc=0 >>

192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

复制快照备份文件到集群所有节点

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master  -m copy -a "src=snap-202302070000.db dest=/root/" -i host.yaml

vms100.liruilongs.github.io 上面恢复

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl snapshot restore snap-202302070000.db \\
 --name vms100.liruilongs.github.io  \\
 --cert="/etc/kubernetes/pki/etcd/server.crt" \\
 --key="/etc/kubernetes/pki/etcd/server.key"  \\
 --cacert="/etc/kubernetes/pki/etcd/ca.crt"   \\
 --endpoints="https://127.0.0.1:2379" \\
 --initial-advertise-peer-urls="https://192.168.26.100:2380"  \\
 --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" \\
 --data-dir=/var/lib/etcd
2023-02-08 12:50:27.598250 I | mvcc: restore compact to 2837993
2023-02-08 12:50:27.609440 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2023-02-08 12:50:27.609480 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2023-02-08 12:50:27.609487 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

vms101.liruilongs.github.io 上恢复

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ssh 192.168.26.101
Last login: Wed Feb  8 12:48:31 2023 from 192.168.26.100
┌──[root@vms101.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl snapshot restore snap-202302070000.db --name vms101.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.101:2380"  --initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https://192.168.26.102:2380" --data-dir=/var/lib/etcd
2023-02-08 12:52:21.976748 I | mvcc: restore compact to 2837993
2023-02-08 12:52:21.991588 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2023-02-08 12:52:21.991622 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2023-02-08 12:52:21.991629 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7

vms102.liruilongs.github.io 上恢复

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ssh 192.168.26.102
Last login: Wed Feb  8 12:48:31 2023 from 192.168.26.100
┌──[root@vms102.liruilongs.github.io]-[~]
└─$ETCDCTL_API=3 etcdctl snapshot restore snap-202302070000.db --name vms102.liruilongs.github.io  --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes
/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt"   --endpoints="https://127.0.0.1:2379" --initial-advertise-peer-urls="https://192.168.26.102:2380"
--initial-cluster="vms100.liruilongs.github.io=https://192.168.26.100:2380,vms101.liruilongs.github.io=https://192.168.26.101:2380,vms102.liruilongs.github.io=https:/
/192.168.26.102:2380" --data-dir=/var/lib/etcd
2023-02-08 12:53:32.338663 I | mvcc: restore compact to 2837993
2023-02-08 12:53:32.354619 I | etcdserver/membership: added member ee392e5273e89e2 [https://192.168.26.100:2380] to cluster 4816f346663d82a7
2023-02-08 12:53:32.354782 I | etcdserver/membership: added member 70059e836d19883d [https://192.168.26.101:2380] to cluster 4816f346663d82a7
2023-02-08 12:53:32.354790 I | etcdserver/membership: added member b8cb9f66c2e63b91 [https://192.168.26.102:2380] to cluster 4816f346663d82a7
┌──[root@vms102.liruilongs.github.io]-[~]
└─$

恢复完成后移动 etcd,api-service 静态pod 配置文件

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "mv /tmp/kube-apiserver.yaml  /etc/kubernetes/manifests/ " -i host.yaml
192.168.26.101 | CHANGED | rc=0 >>

192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "mv /tmp/etcd.yaml  /etc/kubernetes/manifests/etcd.yaml " -i host.yaml
192.168.26.101 | CHANGED | rc=0 >>

192.168.26.102 | CHANGED | rc=0 >>

192.168.26.100 | CHANGED | rc=0 >>

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

确认移动成功。

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ansible k8s_master -m command -a "ls /etc/kubernetes/manifests/" -i host.yaml
192.168.26.100 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.101 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
192.168.26.102 | CHANGED | rc=0 >>
etcd.yaml
haproxy.yaml
keepalived.yaml
kube-apiserver.yaml
kube-controller-manager.yaml
kube-scheduler.yaml
┌──[root@vms100.liruilongs.github.io]-[~/ansible]

任意节点查看 etcd 集群信息。恢复成功

┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$kubectl get pods
The connection to the server 192.168.26.99:30033 was refused - did you specify the right host or port?
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 |  ee392e5273e89e2 |   3.5.4 |   37 MB |     false |         2 |        146 |
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   37 MB |      true |         2 |        146 |
| https://192.168.26.102:2379 | b8cb9f66c2e63b91 |   3.5.4 |   37 MB |     false |         2 |        146 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
┌──[root@vms100.liruilongs.github.io]-[~/ansible]
└─$

遇到的问题:

如果某一节点有下面的报错,或者集群节点添加不成功,添加了两个,需要按照上面的步骤重复进行。

panic: tocommit(258) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? 问题处理

┌──[root@vms100.liruilongs.github.io]-[~/back]
└─$ETCDCTL_API=3 etcdctl  --endpoints https://127.0.0.1:2379  --cert="/etc/kubernetes/pki/etcd/server.crt"  --key="/etc/kubernetes/pki/etcd/server.key"  --cacert="/etc/kubernetes/pki/etcd/ca.crt" endpoint status --cluster  -w table
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+
| https://192.168.26.100:2379 |  ee392e5273e89e2 |   3.5.4 |   37 MB |      true |         2 |      85951 |
| https://192.168.26.101:2379 | 70059e836d19883d |   3.5.4 |   37 MB |     false |         2 |      85951 |
+-----------------------------+------------------+---------+---------+-----------+-----------+------------+

备份定时任务编写

这里的定时备份通过,systemd.servicesystemd.timer 实现,定时运行 etcd_back.sh 备份脚本,并设置开机自启

很简单没啥说的

┌──[root@vms81.liruilongs.github.io]-[~/back]
└─$systemctl cat etcd-backup
# /usr/lib/systemd/system/etcd-backup.service
[Unit]
Description= "ETCD 备份"
After=network-online.target

[Service]
Type=oneshot
Environment=ETCDCTL_API=3
ExecStart=/usr/bin/bash /usr/lib/systemd/system/etcd_back.sh


[Install]
WantedBy=multi-user.target

每天午夜执行一次

┌──[root@vms81.liruilongs.github.io]-[~/back]
└─$systemctl cat etcd-backup.timer
# /usr/lib/systemd/system/etcd-backup.timer
[Unit]
Description="每天备份一次 ETCD"

[Timer]
OnBootSec=3s
OnCalendar=*-*-* 00:00:00
Unit=etcd-backup.service

[Install]
WantedBy=multi-user.target

备份脚本

┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup]
└─$cat etcd_back.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份
#@Contact :   1224965096@qq.com

if [ ! -d /root/back/ ];then
   mkdir -p /root/back/
fi
STR_DATE=$(date +%Y%m%d%H%M)

ETCDCTL_API=3 etcdctl \\
--endpoints="https://127.0.0.1:2379"  \\
--cert="/etc/kubernetes/pki/etcd/server.crt"  \\
--key="/etc/kubernetes/pki/etcd/server.key"  \\
--cacert="/etc/kubernetes/pki/etcd/ca.crt"   \\
snapshot save /root/back/snap-$STR_DATE.db

ETCDCTL_API=3 etcdctl --write-out=table snapshot status /root/back/snap-$STR_DATE.db

sudo chmod  o-w,u-w,g-w  /root/back/snap-$STR_DATE.db

服务和定时任务的备份部署

┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup]
└─$cat deply.sh
#!/bin/bash

#@File    :   erct_break.sh
#@Time    :   2023/01/27 23:00:27
#@Author  :   Li Ruilong
#@Version :   1.0
#@Desc    :   ETCD 备份部署
#@Contact :   1224965096@qq.com

cp ./* /usr/lib/systemd/system/
systemctl enable etcd-backup.timer --now
systemctl enable etcd-backup.service --now
ls /root/back/

日志查看

┌──[root@vms100.liruilongs.github.io]-[~/ansible/backup]
└─$journalctl -u etcd-backup.service -o cat
...................
Starting "ETCD 备份"...
Snapshot saved at /root/back/snap-202301290120.db
+----------+----------+------------+-

k8s:通过velero实现集群备份和恢复(代码片段)

...cli)什么时候使用Velero代替etcd的内置备份/恢复是合适的?关于Velero和etcd的快照备份如何选择?我个人认为etcd快照备份适用于比较严重的集群灾难。比如所有etcd集群所有节点宕机,快照文件丢失损坏的情况。k8s集群挂掉的情况,... 查看详情

关于kubernetes中etcd的一些笔记(代码片段)

写在前面学k8s涉及到一些etcd的知识,所以学习后这里系统整理下。博文内容涉及:单节点etcd搭建,版本切换etcd集群搭建:用两个节点初始化,然后动态添加一个节点到集群。etcd集群数据快照恢复为了方便,使... 查看详情

rancherrkek8s集群etcd恢复(代码片段)

背景在Rancher中基于RKE创建的K8s集群,因为服务器磁盘故障,导致3个master节点有2个节点的etcd数据文件损坏,导致整个集群不可用。etcd三个节点集群时,如果有2个节点损坏,仅剩余的一个etcd节点是不能正常通... 查看详情

k8s集群灾难恢复(代码片段)

...worker001三台master宕掉两台或三台在宕掉两台或三台master后集群已宕掉,worker节点中的pod可以正常运行,这里考虑机器可以正常修复,并能正常启动。这里模拟测试:停掉192.168.244.12,192.168.244.13两台master机器让192.168.244.11上的etcd... 查看详情

关于k8s中node扩容隔离恢复的一些笔记(代码片段)

写在前面分享一些K8s中Node扩容、隔离、恢复的笔记博文主要是通过kubeadm做节点扩容的一个Demo理解不足小伙伴帮忙指正傍晚时分,你坐在屋檐下,看着天慢慢地黑下去,心里寂寞而凄凉,感到自己的生命被剥夺了... 查看详情

记一次虚机强制断电k8s集群etcdpod挂掉快照丢失(没有备份)问题处理(代码片段)

写在前面不小心拔错电源了,虚机强制关机,开机后集群死掉了记录下解决方案断电导致etcd快照数据丢失,没有备份.基本上是没办法处理可以找专业的DBA来处理数据看有没有可能恢复这篇博文的解决办法是删除了etcd数据目录中... 查看详情

简单实用kubernetes的etcd备份与恢复实现恢复集群配置(代码片段)

学习目标内容提示:由于牵涉概念过多,本章主要讲解具体的备份恢复,其他概述官网:https://kubernetes.io/zh-cn/docs/tasks/administer-cluster/configure-upgrade-etcd/#backing-up-an-etcd-cluster一.etcd的工作原理可将其分成两层次:Http层请求、接收消... 查看详情

k8s集群安装和迁移

参考技术A原内部测试环境K8S集群为3节点集群(一主二从),不知从哪天开始起,主节点从每两天异常停机,到后来每天异常停机,再后来每两小时异常停机,排查结果后怀疑是硬件故障,只能更换主节点。但是上面挂载了太多的... 查看详情

k8s学习-cka真题-etcd数据库备份恢复(代码片段)

目录题目解析命令环境搭建解题结果参考题目解析针对存在的etcd实例https://127.0.0.1:2379,创建一个快照,保存到/srv/data/etcd-snapshot.db。在创建快照的过程中,如果卡住了,就键入ctrl+c终止,然后重试。然后恢... 查看详情

k8s学习-cka真题-etcd数据库备份恢复(代码片段)

...路径了,可以查看配置文件(我的是kubeadm安装的集群):---cert-file=/etc/kubernetes/pki/etcd/server.crt---key-file=/etc/kubernetes/pki/etcd/server.key---trusted-ca-file=/etc/kubernetes/pki/etc 查看详情

记一次虚机强制断电k8s集群etcdpod挂掉快照丢失(没有备份)问题处理(代码片段)

...前面不小心拔错电源了,虚机强制关机,开机后集群死掉了记录下解决方案断电导致etcd快照数据丢失,没有备份.基本上是没办法处理可以找专业的DBA来处理数据看有没有可能恢复这篇博文的解决办法是删除了etcd数据... 查看详情

记一次虚机强制断电k8s集群etcdpod挂掉快照丢失(没有备份)问题处理(代码片段)

...前面不小心拔错电源了,虚机强制关机,开机后集群死掉了记录下解决方案断电导致etcd快照数据丢失,没有备份.基本上是没办法处理可以找专业的DBA来处理数据看有没有可能恢复这篇博文的解决办法是删除了etcd数据... 查看详情

velero安装及测试

velero安装及测试与Etcd备份相比,直接备份Etcd是将集群的全部资源备份起来。而velero是可以对k8s集群内对象级别进行备份。除了对k8s集群进行整体备份外,velero还可以通过对Type、NameSpace、Label等对象进行分类备份或者恢复。<!--br... 查看详情

kubernetes集群etcd数据备份与恢复实践

1、备份有证书etcd集群数据#首次备份,在master01节点将etcdctl工具从容器拷到master节点dockercp<etcdcontainerid>:/usr/local/bin/etcdctl/usr/bin/#etcdcontainerid可通过命令(dockerps|grepetcd)查询获得,拷贝成功后,授执行权限chmod+x/usr/bin/etcdctl#在m 查看详情

有云资质的厂商-ucache灾备云

和其他很多行业一样,随着新技术的不断涌现,在备份和灾难恢复领域也发生了一些重大变革,除了一些传统的灾备厂商技术上自我迭代,在这个行业还出现了一些新的企业,新企业多以云灾备在线服务的方式为用户提供灾备和... 查看详情

关于kubernetes中如何优雅的访问集群外服务的一些笔记(代码片段)

写在前面分享一些k8s中服务如何访问集群外服务的笔记博文内容涉及:如何访问集群外服务创建外部服务代理SVC(IP+PORT情况)Endponts/EndpointSlice实现Demo外部服务为单体/集群的访问Demo创建ExternalName类型SVC(域名的情况)理解不足... 查看详情

关于k8s集群中证书期限确认及续约的一些笔记(代码片段)

写在前面嗯,k8s集群CA证书突然过期了,所有这里整理相关笔记博文内容涉及:如何确认证书是否过期通过kubeadm批量续约证书Demo理解不足小伙伴帮忙指正一切一切,凡已属于和能属于这世界的一切,都无可避免地... 查看详情

关于k8s集群中证书期限确认及续约的一些笔记(代码片段)

写在前面嗯,k8s集群CA证书突然过期了,所有这里整理相关笔记博文内容涉及:如何确认证书是否过期通过kubeadm批量续约证书Demo理解不足小伙伴帮忙指正一切一切,凡已属于和能属于这世界的一切,都无可避免地... 查看详情