1.扩容计算结点 在执行扩容前需检查扩容节点的以下几点信息:
内核版本
selinux已经开启enforcing
docker数据盘已经就绪
/etc/resolv.conf配置正确
hostname已经设置
时间同步已配置
在每个节点都能解析新增节点的域名,如果是通过/etc/hosts来配置域名解析,需要在配置后重启所有节点的dnsmasq服务
docker证书的问题需要添加到自动化配置中来,特别是私有镜像仓库的证书。有三个地方:
/etc/sysconfig/docker配置,
/etc/pki/ca-trust/source/anchors/目录下的证书,
/etc/docker/certs.d下docker拉取镜像认证证书
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # /etc/ansible/hosts [OSEv3:children] masters nodes etcd new_nodes ... [new_nodes] node04.internal.aws.testdrive.openshift.com openshift_node_labels="{'region': 'apps'}" openshift_hostname=node04.internal.aws.testdrive.openshift.com openshift_public_hostname=node04.580763383722.aws.testdrive.openshift.com node05.internal.aws.testdrive.openshift.com openshift_node_labels="{'region': 'apps'}" openshift_hostname=node05.internal.aws.testdrive.openshift.com openshift_public_hostname=node05.580763383722.aws.testdrive.openshift.com node06.internal.aws.testdrive.openshift.com openshift_node_labels="{'region': 'apps'}" openshift_hostname=node06.internal.aws.testdrive.openshift.com openshift_public_hostname=node06.580763383722.aws.testdrive.openshift.com ...
在dns中配置新增的节点。
1 ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-node/scaleup.yml
注意:
如果集群是通过/etc/hosts文件来配置的解析,则需在添加好对应配置关系后,重启所有节点的dnsmasq。否则会报“could not find csr for nodes”的错误。
2.OpenShift Metrics 1 2 3 4 5 6 7 8 ... [OSEv3:vars] ... openshift_metrics_install_metrics=true openshift_metrics_cassandra_storage_type=pv openshift_metrics_cassandra_pvc_size=10Gi openshift_metrics_hawkular_hostname=metrics.apps.580763383722.aws.testdrive.openshift.com ...
1 ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-metrics.yml
3.OpenShift Logging 1 2 3 4 5 6 7 8 9 10 ... [OSEv3:vars] ... openshift_logging_install_logging=true openshift_logging_namespace=logging openshift_logging_es_pvc_size=10Gi openshift_logging_kibana_hostname=kibana.apps.580763383722.aws.testdrive.openshift.com openshift_logging_public_master_url=https://kibana.apps.580763383722.aws.testdrive.openshift.com ...
1 ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/openshift-logging.yml
4.OpenShift Multitenant Networking 1 os_sdn_network_plugin_name=redhat/openshift-ovs-multitenant
1 2 3 4 5 6 7 8 9 10 # net-proj.sh #!/bin/bash # create NetworkA, NetworkB projects /usr/bin/oc new-project netproj-a /usr/bin/oc new-project netproj-b # deploy the DC definition into the projects /usr/bin/oc create -f /opt/lab/support/ose.yaml -n netproj-a /usr/bin/oc create -f /opt/lab/support/ose.yaml -n netproj-b
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 #ose.yaml apiVersion: v1 kind: DeploymentConfig metadata: name: ose labels: run: ose spec: strategy: type: Rolling rollingParams: updatePeriodSeconds: 1 intervalSeconds: 1 timeoutSeconds: 600 maxUnavailable: 25% maxSurge: 25% resources: triggers: - type: ConfigChange replicas: 1 test: false selector: run: ose template: metadata: creationTimestamp: null labels: run: ose spec: containers: - name: ose image: 'registry.access.redhat.com/openshift3/ose:v3.5' command: - bash - '-c' - 'while true; do sleep 60; done' resources: terminationMessagePath: /dev/termination-log imagePullPolicy: IfNotPresent restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst securityContext:
1 2 3 #podbip.sh #!/bin/bash /usr/bin/oc get pod -n netproj-b $(oc get pod -n netproj-b | awk '/ose-/ {print $1}') -o jsonpath='{.status.podIP}{"\n"}'
将netproj-a网络与netproj-b网络连接 1 2 oc adm pod-network join-projects netproj-a --to=netproj-b oc get netnamespace
将netproj-a网络脱离 1 2 3 oc adm pod-network isolate-projects netproj-a oc get netnamespace oc exec -n netproj-a $POD_A_NAME -- ping -c1 -W1 $POD_B_IP
5.Node管理 将Node隔离出集群 1 oc adm manage-node node02.internal.aws.testdrive.openshift.com --schedulable=false
查看指定Node上运行的pod 1 oc adm manage-node node02.internal.aws.testdrive.openshift.com --list-pods
迁移指定Node上的pod 模拟迁移 1 oc adm manage-node node02.internal.aws.testdrive.openshift.com --evacuate --dry-run
迁移 1 oc adm manage-node node02.internal.aws.testdrive.openshift.com --evacuate
恢复Node的可调度 1 oc adm manage-node node02.internal.aws.testdrive.openshift.com --schedulable=true
创建volume 1 2 3 oc volume dc/file-uploader --add --name=my-shared-storage \ -t pvc --claim-mode=ReadWriteMany --claim-size=5Gi \ --claim-name=my-shared-storage --mount-path=/opt/app-root/src/uploaded
Increasing Storage Capacity in CNS 1 2 3 4 5 6 7 8 9 10 11 [...] [cns] node01.580763383722.aws.testdrive.openshift.com node02.580763383722.aws.testdrive.openshift.com node03.580763383722.aws.testdrive.openshift.com node04.580763383722.aws.testdrive.openshift.com node05.580763383722.aws.testdrive.openshift.com node06.580763383722.aws.testdrive.openshift.com [...]
1 ansible-playbook /opt/lab/support/configure-firewall.yaml
1 2 3 oc label node/node04.internal.aws.testdrive.openshift.com storagenode=glusterfs oc label node/node05.internal.aws.testdrive.openshift.com storagenode=glusterfs oc label node/node06.internal.aws.testdrive.openshift.com storagenode=glusterfs
1 2 3 export HEKETI_CLI_SERVER=http://heketi-container-native-storage.apps.580763383722.aws.testdrive.openshift.com export HEKETI_CLI_USER=admin export HEKETI_CLI_KEY=myS3cr3tpassw0rd
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 #/opt/lab/support/topology-extended.json { "clusters": [ { "nodes": [ { "node": { "hostnames": { "manage": [ "node01.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.1.30" ] }, "zone": 1 }, "devices": [ "/dev/xvdd" ] }, { "node": { "hostnames": { "manage": [ "node02.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.3.130" ] }, "zone": 2 }, "devices": [ "/dev/xvdd" ] }, { "node": { "hostnames": { "manage": [ "node03.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.4.150" ] }, "zone": 3 }, "devices": [ "/dev/xvdd" ] } ] }, { "nodes": [ { "node": { "hostnames": { "manage": [ "node04.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.1.23" ] }, "zone": 1 }, "devices": [ "/dev/xvdd" ] }, { "node": { "hostnames": { "manage": [ "node05.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.3.141" ] }, "zone": 2 }, "devices": [ "/dev/xvdd" ] }, { "node": { "hostnames": { "manage": [ "node06.internal.aws.testdrive.openshift.com" ], "storage": [ "10.0.4.234" ] }, "zone": 3 }, "devices": [ "/dev/xvdd" ] } ] } ] }
1 heketi-cli topology load --json=/opt/lab/support/topology-extended.json
1 heketi-cli topology info ##得到Cluster ID
1 2 3 4 5 6 7 8 9 10 11 12 13 14 # /opt/lab/support/second-cns-storageclass.yaml apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: cns-silver provisioner: kubernetes.io/glusterfs parameters: resturl: "http://heketi-container-native-storage.apps.580763383722.aws.testdrive.openshift.com" restauthenabled: "true" restuser: "admin" volumetype: "replicate:3" clusterid: "INSERT-CLUSTER-ID-HERE" secretNamespace: "default" secretName: "cns-secret"
添加已有节点的盘 1 2 3 4 # 获取NODEID heketi-cli node list | grep ca777ae0285ef6d8cd7237c862bd591c(CLUSTERID) heketi-cli device add --name=/dev/xvde --node=33e0045354db4be29b18728cbe817605(NODEID)
移除有问题的盘 1 heketi-cli node info 33e0045354db4be29b18728cbe817605(NODEID)
以上的结果如下:
1 2 3 4 5 6 7 8 9 Node Id: 33e0045354db4be29b18728cbe817605 State: online Cluster Id: ca777ae0285ef6d8cd7237c862bd591c Zone: 1 Management Hostname: node04.internal.aws.testdrive.openshift.com Storage Hostname: 10.0.1.23 Devices: Id:01c94798bf6b1af87974573b420c4dff Name:/dev/xvdd State:online Size (GiB):9 Used (GiB):1 Free (GiB):8 Id:da91a2f1c9f62d9916831de18cc09952 Name:/dev/xvde State:online Size (GiB):9 Used (GiB):1 Free (GiB):8
移除盘
1 heketi-cli device disable 01c94798bf6b1af87974573b420c4dff
6.给Registry组件添加Volume 1 2 3 oc volume dc/docker-registry --add --name=registry-storage -t pvc \ --claim-mode=ReadWriteMany --claim-size=5Gi \ --claim-name=registry-storage --overwrite
7.更改dc的镜像 1 oc patch dc nginx -p '{"spec":{"template":{"spec":{"containers":[{"name":"nginx","image":"harbor.apps.example.com/public/nginx:1.14"}]}}}}'
8.给A项目授予拉取B项目IS的权限 1 oc policy add-role-to-user system:image-puller system:serviceaccount:A:default -n B
9.给Jenkins授予管理A项目资源的权限 1 oc policy add-role-to-user edit system:serviceaccount:jenkins:jenkins -n A
10.手动维护etcd 1 2 3 4 5 6 export ETCDCTL_API=3 etcdctl --cacert=/etc/origin/master/master.etcd-ca.crt --cert=/etc/origin/master/master.etcd-client.crt --key=/etc/origin/master/master.etcd-client.key --endpoints=https://master1.os10.openshift.com:2379,https://master2.os10.openshift.com:2379,https://master3.os10.openshift.com:2379 endpoint health ETCDCTL_API=3 etcdctl --cacert=/etc/origin/master/master.etcd-ca.crt --cert=/etc/origin/master/master.etcd-client.crt --key=/etc/origin/master/master.etcd-client.key --endpoints=https://master1.os10.openshift.com:2379,https://master2.os10.openshift.com:2379,https://master3.os10.openshift.com:2379 get / --prefix --keys-only ETCDCTL_API=3 etcdctl --cacert=/etc/origin/master/master.etcd-ca.crt --cert=/etc/origin/master/master.etcd-client.crt --key=/etc/origin/master/master.etcd-client.key --endpoints=https://master1.os10.openshift.com:2379,https://master2.os10.openshift.com:2379,https://master3.os10.openshift.com:2379 del /kubernetes.io/pods/bookinfo/nginx-4-bkdb4
11.执行镜像对应的任务 –restart=Always 默认值,创建一个deploymentconfig –restart=OnFailure 创建一个Job(但是实践证实为一个Pod) –restart=OnFailure –schedule=”0/5 * * * *” 创建一个Cron Job –restart=Never 创建一个单独的Pod
1 2 oc run nginx -it --rm --image=nginx --restart=OnFailure ls oc run nginx -it --rm --image=nginx --restart=OnFailure bash
12.清理主机容器 当容器存储docker-storage的storage-driver引擎使用devicemapper时会出现如下错误:devmapper: Thin Pool has 162394 free data blocks which is less than minimum required 163840 free data blocks. Create more free space in thin pool or use dm.min_free_space option to change behavior
。这个时候需要清理下容器主机的存储。 具体操作如下:
1 2 3 4 5 6 # 清理exited进程: exited_containers=$(docker ps -q -f status=exited); if [ "$exited_containers" != "" ]; then docker rm $exited_containers; fi # 清理dangling volumes: dangling_volumes=$(docker volume ls -qf dangling=true); if [ "$dangling_volumes" != "" ]; then docker volume rm $dangling_volumes; fi # 清理dangling image: dangling_images=$(docker images --filter "dangling=true" -q --no-trunc); if [ "$dangling_images" != "" ]; then docker rmi $dangling_images; fi
参考文档 http://www.cnblogs.com/mhc-fly/p/9324425.html 还可在不同在子命令下执行 prune,这样删除的就是某类资源:
1 2 3 $ docker container prune -f $ docker volume prune -f $ docker image prune -f
13.Node节点内存与CPU预留 /etc/origin/node/node-config.yaml
1 2 3 4 5 6 7 kubeletArguments: system-reserved: - cpu=200m - memory=1G kube-reserved: - cpu=200m - memory=1G
14.用oc get只查看dc的镜像名 1 2 [root@maser]$ oc get dc test-app --template={{range.spec.template.spec.containers}}{{.image}}{{end}} registry.example.com/test/test-app:1.13
获取第一个dc的第一个容器的镜像
1 [root@maser]$ oc get dc --template='{{with $dc:=(index .items 0)}}{{with $container:=(index $dc.spec.template.spec.containers 0)}}{{$container.image}}{{"\n"}}{{end}}{{end}}'
或者使用–jsonpath
1 2 3 [root@maser]$ oc get dc -o jsonpath='{range.items[*]}{range .spec.template.spec.containers[*]}{.image}{"\n"}{end}{end}' [root@maser]$ oc get dc -o jsonpath='{.items[0].spec.template.spec.containers[0].image}{"\n"}'
15.Openshift Webconsole支持私有镜像仓库
创建私有镜像仓库的证书1 2 3 4 5 [root@registry ~]# mkdir /etc/crts/ && cd /etc/crts [root@registry ~]# openssl req \ -newkey rsa:2048 -nodes -keyout example.com.key \ -x509 -days 365 -out example.com.crt -subj \ "/C=CN/ST=GD/L=SZ/O=Global Security/OU=IT Department/CN=*.example.com"
将私有镜像仓库的CA文件拷贝到镜像仓库所在服务器的/etc/pki/ca-trust/source/anchors/
目录下
在镜像仓库中配置tls,如果是docker-distribution /etc/docker-distribution/registry/config.yml
1 2 3 4 5 http: addr: :443 tls: certificate: /etc/crts/example.com.crt key: /etc/crts/example.com.key
重启docker-distribution1 [root@registry ~]# systemctl daemon-reload && systemctl restart docker-distribution && systemctl enable docker-distribution
在镜像仓库所在服务器上执行update-ca-trust extract
将私有镜像仓库的CA文件拷贝到每台Openshift节点的/etc/pki/ca-trust/source/anchors/
目录下
每台Openshift节点上执行update-ca-trust extract
16.Docker支持私有镜像仓库tls认证 /etc/docker/certs.d目录下创建对应的域名目录,如私有镜像仓库的域名为:example.harbor.com
1 $ mkdir -p /etc/docker/certs.d/example.harbor.com
将私有镜像仓库的CA文件拷贝到该目录下即可。
17.查看etcd数据 1 2 3 4 5 6 7 etcdctl --cert-file=/etc/origin/master/master.etcd-client.crt --key-file /etc/origin/master/master.etcd-client.key --ca-file /etc/origin/master/master.etcd-ca.crt --endpoints="https://master1.os10.openshift.example.com:2379,https://master2.os10.openshift.example.com:2379,https://master3.os10.openshift.example.com:2379" export ETCDCTL_API=3 etcdctl --cacert=/etc/origin/master/master.etcd-ca.crt --cert=/etc/origin/master/master.etcd-client.crt --key=/etc/origin/master/master.etcd-client.key --endpoints=https://master1.os10.openshift.example.com:2379,https://master2.os10.openshift.example.com:2379,https://master3.os10.openshift.example.com:2379 endpoint health ETCDCTL_API=3 etcdctl --cacert=/etc/origin/master/master.etcd-ca.crt --cert=/etc/origin/master/master.etcd-client.crt --key=/etc/origin/master/master.etcd-client.key --endpoints=https://master1.os10.openshift.example.com:2379,https://master2.os10.openshift.example.com:2379,https://master3.os10.openshift.example.com:2379 get / --prefix --keys-only
计算某个项目project下所有pod的limits cpu/memory的总和 1 2 3 4 ## 计算pod总的limits cpu的总和 data=$(pods=`oc get pod|awk '{print $1}'|grep -v NAME`; for pod in $pods; do oc get pod $pod --template={{range.spec.containers}}{{.resources.limits.cpu}}{{println}}{{end}}; done); i=0; for j in $(echo $data); do i=$(($i+$j)); done ; echo $i; ## 18.计算pod总的limits memory的总和 data=$(pods=`oc get pod|awk '{print $1}'|grep -v NAME`; for pod in $pods; do oc get pod $pod --template={{range.spec.containers}}{{.resources.limits.memory}}{{println}}{{end}}; done);i=0; for j in $(echo $data); do mj=$(echo $j|cut -dG -f1); i=$(($i+$mj)); done; echo $i;
19.DNSMasq启动失败报错“DBus error: Connection “:1.180” is not allowed to own the service “uk.org.thekelleys.dnsmasq” ” 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 $ cat /etc/dbus-1/system.d/dnsmasq.conf <!DOCTYPE busconfig PUBLIC "-//freedesktop//DTD D-BUS Bus Configuration 1.0//EN" "http://www.freedesktop.org/standards/dbus/1.0/busconfig.dtd"> <busconfig> <policy user="root"> <allow own="uk.org.thekelleys.dnsmasq"/> <allow send_destination="uk.org.thekelleys.dnsmasq"/> </policy> <policy context="default"> <allow own="uk.org.thekelleys.dnsmasq"/> <allow send_destination="uk.org.thekelleys.dnsmasq"/> </policy> </busconfig> $ systemctl daemon-reload $ systemctl restart dbus $ systemctl restart dnsmasq
20.ssh特别慢,卡在debug1: pledge: network
位置 重启下systemd-logind
1 $ systemctl restart systemd-logind
如果是卡在Authentication上,可以把ssh client端的StrictHostKeyChecking设置为no
1 2 3 4 $ cat /etc/ssh/ssh_configHost * GSSAPIAuthentication no StrictHostKeyChecking no
21.清理私有镜像仓库 1 2 3 4 5 6 7 8 9 10 11 $ cat > /usr/bin/cleanregistry.sh <<EOF # !/bin/bash oc login -u admin -p password oc adm prune builds --orphans --keep-complete=25 --keep-failed=5 --keep-younger-than=60m --confirm oc adm prune deployments --orphans --keep-complete=25 --keep-failed=10 --keep-younger-than=60m --confirm # oc rollout latest docker-registry -n default # sleep 20 oc adm prune images --keep-younger-than=400m --confirm EOF $ crontab -l 0 0 * * * /usr/bin/cleanregistry.sh >> /var/log/cleanregistry.log 2>&1
22.docker run覆盖entrypoint 1 $ docker run --entrypoint="/bin/bash" --rm -it xhuaustc/nginx-openshift-router:1.15
23.oc image mirror同步镜像 1 $ oc image mirror myregistry.com/myimage:latest docker.io/myrepository/myimage:stable --insecure=true
24.开通端口防火墙 1 2 3 # vi /etc/sysconfig/iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9100 -j ACCEPT # systemctl restart iptables
25.查看crt证书有效时间 1 2 3 4 $ openssl x509 -noout -text -in ca.crt | grep Validity -A2 Validity not Before: Sep 7 08:48.13 2018 GMT not After: Sep 6 08:48.14 2020 GMT
26.将主机设为不可调度 方法一:
1 $ oc adm cordon $nodename
方法二:
1 $ oc adm manage-node --schedulable=false $nodename
27.驱逐主机上的POD 1 $ oc adm manage-node --evacuate $nodename
28.Service的域名 正常情况下 Service的域名格式为:service-name.project-name.svc.cluster.local
对应的IP是Service Cluster IP 设置Service的clusterIP=None,同时该Pod需要添加subdomain字段,如果是statefulset资源需要添加serviceName字段。 Service的域名格式为:service-name.project-name.svc.cluster.local
对应的IP是后台对应的Pod的容器的IP 同时后台对应的Pod都有DNS记录,格式为pod-name.service-name.project-name.svc.cluster.local
29.查看Docker镜像的构建历史命令 docker history ${镜像名/ID} -H --no-trunc | awk -F"[ ]{3,}" '{print $3}' | sed -n -e "s#/bin/sh -c##g" -e "s/#(nop) //g" -e '2,$p' | sed '1!G;h;$!d'
例如查看镜像mysql:5.6.41的构建命令
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 $ docker history mysql:5.6.41 -H --no-trunc | awk -F"[ ]{3,}" '{$1="";$2="";$(NF-1)="";print $0}' | sed -n -e "s#/bin/sh -c##g" -e "s/#(nop) //g" -e '2,$p' | sed '1!G;h;$!d' #(nop) ADD file:f8f26d117bc4a9289b7cd7447ca36e1a70b11701c63d949ef35ff9c16e190e50 in / CMD ["bash"] groupadd -r mysql && useradd -r -g mysql mysql apt-get update && apt-get install -y --no-install-recommends gnupg dirmngr && rm -rf /var/lib/apt/lists/* ENV GOSU_VERSION=1.7 set -x && apt-get update && apt-get install -y --no-install-recommends ca-certificates wget && rm -rf /var/lib/apt/lists/* && wget -O /usr/local/bin/gosu "https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg --print-architecture)" && wget -O /usr/local/bin/gosu.asc "https://github.com/tianon/gosu/releases/download/$GOSU_VERSION/gosu-$(dpkg --print-architecture).asc" && export GNUPGHOME="$(mktemp -d)" && gpg --keyserver ha.pool.sks-keyservers.net --recv-keys B42F6819007F00F88E364FD4036A9C25BF357DD4 && gpg --batch --verify /usr/local/bin/gosu.asc /usr/local/bin/gosu && gpgconf --kill all && rm -rf "$GNUPGHOME" /usr/local/bin/gosu.asc && chmod +x /usr/local/bin/gosu && gosu nobody true && apt-get purge -y --auto-remove ca-certificates wget mkdir /docker-entrypoint-initdb.d apt-get update && apt-get install -y --no-install-recommends pwgen perl && rm -rf /var/lib/apt/lists/* set -ex; key='A4A9406876FCBD3C456770C88C718D3B5072E1F5'; export GNUPGHOME="$(mktemp -d)"; gpg --keyserver ha.pool.sks-keyservers.net --recv-keys "$key"; gpg --export "$key" > /etc/apt/trusted.gpg.d/mysql.gpg; gpgconf --kill all; rm -rf "$GNUPGHOME"; apt-key list > /dev/null ENV MYSQL_MAJOR=5.6 ENV MYSQL_VERSION=5.6.41-1debian9 echo "deb http://repo.mysql.com/apt/debian/ stretch mysql-${MYSQL_MAJOR}" > /etc/apt/sources.list.d/mysql.list { echo mysql-community-server mysql-community-server/data-dir select ''; echo mysql-community-server mysql-community-server/root-pass password ''; echo mysql-community-server mysql-community-server/re-root-pass password ''; echo mysql-community-server mysql-community-server/remove-test-db select false; } | debconf-set-selections && apt-get update && apt-get install -y mysql-server="${MYSQL_VERSION}" && rm -rf /var/lib/apt/lists/* && rm -rf /var/lib/mysql && mkdir -p /var/lib/mysql /var/run/mysqld && chown -R mysql:mysql /var/lib/mysql /var/run/mysqld && chmod 777 /var/run/mysqld && find /etc/mysql/ -name '*.cnf' -print0 | xargs -0 grep -lZE '^(bind-address|log)' | xargs -rt -0 sed -Ei 's/^(bind-address|log)/#&/' && echo '[mysqld]\nskip-host-cache\nskip-name-resolve' > /etc/mysql/conf.d/docker.cnf VOLUME [/var/lib/mysql] #(nop) COPY file:b79e447a4154d7150da6897e9bfdeac5eef0ebd39bb505803fdb0315c929d983 in /usr/local/bin/ ln -s usr/local/bin/docker-entrypoint.sh /entrypoint.sh # backwards compat ENTRYPOINT ["docker-entrypoint.sh"] EXPOSE 3306/tcp CMD ["mysqld"]
30.应用在完成Build后推送到内部镜像仓库如下报错误 Pushing image docker-registry.default.svc:5000/apb/my-test-apb:latest ...
Pushed 0/15 layers, 0% complete
Registry server Address:
Registry server User Name: serviceaccount
Registry server Email: serviceaccount@example.org
Registry server Password: <<non-empty>>
error: build error: Failed to push image: unauthorized: unable to validate token
此时很大可能是因为一些变更,导致镜像仓库的Token有变化,但是镜像仓库未重启,重启镜像仓库即可恢复。
1 2 3 $ oc get pod -n default | grep docker-registry docker-registry-1-8tjhk 1/1 Running 0 4m $ oc delete pod `oc get pod -n default | grep docker-registry | awk '{print $1}'`
31.为容器用户指定用户名
在镜像构建中将文件/etc/passwd设置为容器启动用户可写1 RUN chmod g=u /etc/passwd
容器启动时设置用户名 ENTRYPOINT/CMD 脚本中添加设置用户名代码1 2 3 4 5 6 7 8 USER_NAME=${USER_NAME:-ocpuid} USER_ID=$(id -u) if ! whoami &> /dev/null; then if [ -w /etc/passwd ]; then echo "${USER_NAME}:x:${USER_ID}:0:${USER_NAME} user:${HOME}:/sbin/nologin" >> /etc/passwd fi fi exec "$@"
32.升级Docker 升级不同OpenShift组件的思路是一样,主要是如下两条。
更新yum源中的docker包1 2 $ cp docker-rpm/* ./extras/Packages/d/ $ createrepo --update extras
迁移节点上的POD并将它设置为不可调度1 $ oc adm drain <node_name> --force --delete-local-data --ignore-daemonsets
排除不需要升级的软件1 2 $ atomic-openshift-docker-excluder exclude $ atomic-openshift-excluder exclude
升级docker1 2 $ yum clean all $ yum update docker
重启服务或者重启主机
Master节点
1 2 3 4 $ systemctl restart docker $ master-restart api $ master-restart controllers $ systemctl restart origin-node
Node节点
1 2 $ systemctl restart docker $ systemctl restart origin-node
或者
将节点设置为可调度1 $ oc adm uncordon <node_name>
33.获取Token并请求OpenShift ASB服务的例子 1 2 3 4 5 6 7 8 9 10 11 $ curl -k -H "Authorization: Bearer `oc serviceaccounts get-token asb-client`" https://$(oc get routes -n openshift-ansible-service-broker --no-headers | awk '{print $2}')/osb/v2/catalog { "paths": [ "/ansible-service-broker/", "/apis", "/healthz", "/healthz/ping", "/healthz/poststarthook/generic-apiserver-start-informers", "/metrics" ] }
34.调用OpenShift API获取Pod信息 1 $ oc get --raw /api/v1/namespaces/<namespace-name>/pods/<pod-name> | json_reformat
35.使用HostPath挂载本地目录 1 2 3 4 5 6 7 8 9 10 $ chcon -Rt svirt_sandbox_file_t /testHostPath or $ chcon -R unconfined_u:object_r:svirt_sandbox_file_t:s0 /testHostPath or $ semanage fcontext -a -t svirt_sandbox_file_t '/testHostPath(/.*)?' $ restorecon -Rv /testHostPath # 确认设置 semanage fcontext -l | grep testHostPath # 确认文件生效 ls -Z /testHostPath # 删除 配置: semanage fcontext -d '/testHostPath(/.*)?'
36.将搜索镜像导出到本地文件脚本 1 $ docker image | grep redis | awk '{image=$1; gsub(/.*\//, "", $1); printf("docker save -o %s.tar %s:%s\n", $1, image, $2)}' | xargs -i bash -c "{}"
37.Docker日志中有错误 container kill failed because of container not found or no such process
1 2 3 4 $ $ journalctl -r -u docker --since '1 day ago' --no-pager | grep -i error $ $ systemctl restart docker
38.查看所有应用重启次数,并且排序 1 $ oc get pod --sort-by='.status.containerStatuses[0].restartCount' --all-namespace | sort -rn -k10
39.docker拉取镜像报错:400 unsupported docker v1 repository request docker的配置中添加--disable-legacy-registry
配置。
1 2 3 4 $ cat /etc/sysconfig/docker... OPTIONS='... --disable-legacy-registry ...' ...
原因:当docker客户端通过v2 API请求镜像库,而镜像不存在,客户端会尝试使用v1 API请求镜像仓库,而镜像仓库不支持v1 API请求,则会返回该错误。
40.应用日志无法查看,oc exec也无法进入容器 报错Error from server: Get https://master.example.com:8443/containerLogs/namespace/pod-name/console: remote error: tls: internal error
处理办法,查看csr,并将它们手动授信
1 2 $ oc get csr $ oc get csr -o name | xargs oc adm certificate approve
41.netmanager工具设置了dns,无法直接通过/etc/resolv.conf文件更改 1 2 3 4 $ nmcli con show # 查看所有的网络连接 $ nmcli con show <net-connect-name> #查看网络连接详情,可查看dns的配置 $ nmcli con mod <net-connect-name> -ipv4.dns <dns-server-ip> #删除指定的dns ip $ nmcli con mod <net-connect-name> +ipv4.dns <dns-server-ip> #添加指定的dns ip
42.查看集群当前计算节点资源的分配率 1 2 3 4 5 6 7 8 9 $ nodes=$(oc get node --selector=node-role.kubernetes.io/compute=true --no-headers | awk '{print $1}' ); for i in $nodes ; do echo $i ; oc describe node $i | grep Resource -A 3 | grep -v '\-\-\-' ; done node1 Resource Requests Limits cpu 10445m (65%) 25770m (161%) memory 22406Mi (34%) 49224Mi (76%) node2 Resource Requests Limits cpu 8294m (51%) 25620m (160%) memory 18298Mi (28%) 48600Mi (75%)
43.安装时master api服务无法访问etcd master主机绑定多张网卡,在/etc/ansible/hosts中需要指定etcd_ip,如下所示
1 2 [etcd] master.example.com etcd_ip=10.1.2.3
另外需要确保,etcd所在主机的hostname所指定的ip确切为etcd_ip指定的ip。
44.安装时master节点有多张网卡,如何指定masterIP 在master安装时master-config.yml中设置的masterIP为openshift.common.ip,为节点的默认网卡。可以通过编辑roles/openshift_facts/library/openshift_facts.py
文件来设置该ip
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 def get_defaults(self, roles): """ Get default fact values Args: roles (list): list of roles for this host Returns: dict: The generated default facts """ defaults = {} ip_addr = self.system_facts['ansible_default_ipv4']['address'] exit_code, output, _ = module.run_command(['hostname', '-f']) # noqa: F405 hostname_f = output.strip() if exit_code == 0 else '' hostname_values = [hostname_f, self.system_facts['ansible_nodename'], self.system_facts['ansible_fqdn']] hostname = choose_hostname(hostname_values, ip_addr).lower() exit_code, output, _ = module.run_command(['hostname']) # noqa: F405 raw_hostname = output.strip() if exit_code == 0 else hostname defaults['common'] = dict(ip=ip_addr, public_ip=ip_addr, raw_hostname=raw_hostname, hostname=hostname, public_hostname=hostname,
另外可以通过将目标网卡设置为默认网卡来解决。 还有OpenShift通过更新hosts也可以来配置,通过设置openshift_node_groups来设置kubeletArguments.node-ip
的值,如下:
1 {'name': 'node-config-node1', 'labels': ['...,...'], 'edits': [{ 'key': 'kubeletArguments.node-ip','value': ['x.x.x.x']}]}
45.部署集群时,采用自定义证书,Master1节点报x509: certificate signed by unknown authority错误 检查ansible inventory hosts文件中自定义证书名是否与openshift默认的组件证书名重复了。如ca.crt等
46.部署时网络错误,需要查看是否配置了默认路由,如果没有,则需要设置 1 2 3 4 5 6 $ ip route 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 102 172.16.10.0/24 dev eth1 proto kernel scope link src 172.16.10.11 metric 101 172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 $ $ ip route add default via 172.16.10.1
47. 删除指定文件夹下最近一个月的文件 1 $ find /dir -type f -mtime +30 -exec rm -rf {} \;
48.pod报the node was low on resource ephemeral-storage
而被驱逐 pod应用临时存储空间不足导致错误,需要查看本地磁盘,特别是/var/lib/origin所在磁盘的空间情况。
49.自签证书
根证书创建1 2 3 $ openssl genrsa -out ca.key 2048 $ openssl req -new -x509 -days 36500 -key ca.key -out ca.crt -subj "/C=CN/ST=shanxi/L=taiyuan/O=cn/OU=test/CN=example.com" $
创建证书并使用根证书签发1 2 3 $ openssl genrsa -out app.key 2048 $ openssl req -new -key app.key -out app.csr $ openssl x509 -req -in app.csr -CA ca.crt -CAkey ca.key -out app.crt -days 3650 -CAcreateserial
使用 Openssl 工具查看证书信息1 2 3 $ openssl x509 -in signed.crt -noout -dates $ openssl x509 -in signed.crt -noout -subject $ openssl x509 -in signed.crt -noout -text
50. ETCD某个节点无法重启,报错rafthttp: the clock difference against peer 27de23fad174dca is too high [1m16.89887s > 1s]
检查ETCD服务器的时间是否同步,如果不同步,强制同步后,ETCD会自动恢复。
51. 查看最近一小时的Event 告警事件 集群默认保留最近1小时的Event事件,通过field-selector过滤掉正常的事件
1 $ oc get event --field-selector=type =Warning --all-namespaces
52. 获取Alertmanager的告警信息 1 $ oc exec -it alertmanager-main-0 -c alertmanager -n openshift-monitoring -- amtool alert query 'severity=critical' --alertmanager.url http://localhost:9093
53. 获取statefulset中的Pod的序号 1 2 [[ $(hostname) =~ -([0-9]+)$ ]] || exit ordinal=${BASH_REMATCH[1]}
其中ordinal
即为statefulset中的序号,一般可用在initContainers中对Pod进行初始化配置设置,具体生产实践中可灵活使用。
54. 清理镜像仓库中的镜像 镜像仓库必须开启可删除功能
1 2 3 4 5 # curl -k -I -H "Accept: application/vnd.docker.distribution.manifest.v2+json" -I http://localhost:5000/v2/openshift/ocp-router/manifests/v3.11.129 获取镜像层的sha256值 # curl -X DELETE http://localhost:5000/v2/openshift/ocp-router/manifests/sha256:39ad17c3e10f902d8b098ee5128a87d4293b6d07cbc2d1e52ed9ddf0076e3cf9 # # registry garbage-collect /etc/docker-distribution/registry/config.yml
55. AIX部署NFS服务,应用POD无法挂载mount.nfs: Remote I/O error.
默认情况下,NFS客户端通过NFSv4协议访问NFS服务,如果AIX部署NFS时不支持NFSv4协议,则在挂载时会报mount.nfs: Remote I/O error.
的错误。可通过nfsstat -s
查看服务端支持的NFS版本。 有两种解决方法:
重新配置NFS Server,让其支持NFSv4;
配置PV,强制使用NFSv3来访问后端NFS服务。 参考配置如下:spec.mountOptions1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 apiVersion: v1 kind: PersistentVolume metadata: name: pv0003 spec: capacity: storage: 5Gi volumeMode: Filesystem accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Recycle mountOptions: - hard - nfsvers=3 nfs: path: /tmp server: 172.17.0.2
另外也可以通过添加annotations.volume.beta.kubernetes.io/mount-options
来设置1 oc patch pv pv0003 -p '{"metadata":{"annotations":{"volume.beta.kubernetes.io/mount-options":"rw,nfsvers=3"}}}'
参考文章:https://kubernetes.io/zh/docs/concepts/storage/persistent-volumes/
56. 将POD从副本控制器中脱离 在应用运行过程中,某些场景下,需要将某个POD从业务流量中脱离。例如在问题排查时,一旦应用重启,将不易于具体线上问题的排查,这时我们需要在尽快恢复应用的情况下,保留问题POD的状态。 方法很简单,就是利用Label。我们知道在K8S/OCP中,各种资源的关系都是通过Label来建立的,只需要将POD的Label去掉,让它就会成为一个孤立的POD,应用的迭代不会对POD的生命周期有影响,同时业务流量也不会分发到该POD。
1 2 # oc label pod xxx-pod --list //查看当前pod所有label # oc label pod xxx-pod <LABEL-A>- <LABEL-B>- //删除关联的LABEL
57. Node状态变为NotReady,且检查状态为Unknown.
可检查下CSR,是否存在Pending
批准这些CSR即可
1 2 3 $ oc get csr -o name | xargs oc adm certificate approve 或 $ kubectl get csr -o name | xargs kubectl certificate approve
58. No space left on device,但df -h
查看空间空空的 还需要检查一下inodes
1 2 3 $ df -ihFilesystem Inodes IUsed IFree IUse% Mounted on /dev/sdb 16M 502K 16M 4% /
如果发现IUse% 为100,就没法再存储数据了。 解决办法 : rm -rf 一些小而多的文件,如日志等。