20231125ks-installer追踪

0.背景

博客地址 https://blog.hylstudio.cn/archives/1343

飞书文档 https://paraparty.feishu.cn/docx/FTM7d1TIcoxL83xfRWzc34omnFe

安装k8s的时候最后一步会特别慢,之前追了一半,今天继续追完搞清楚发生了啥

懒得排版了,为了最佳阅读体验可以看飞书,这里做备份

1.追踪过程

前置知识:kubectl、shell-operator、python、ansible、helm
https://github.com/kubesphere/ks-installer/blob/20a6daa18adf10410a385b48ab2769e55d8bdee2/Dockerfile.complete#L8 我们可以看到主入口的程序来源是https://github.com/flant/shell-operator 这是一个运行在k8s集群中事件驱动脚本的工具。
根据https://github.com/flant/shell-operator/blob/55ca7a92c873cccfdad0c7591048eaeb8cf0dd4b/docs/src/HOOKS.md?plain=1#L11 描述shell-operator默认会递归扫描hooks目录下的文件把按文件的标准输出当作声明监听event
根据https://github.com/kubesphere/ks-installer/blob/20a6daa18adf10410a385b48ab2769e55d8bdee2/Dockerfile#L5 可知controller下的文件会被放到/hooks下作为监听,其中包括installerRunner.py
https://github.com/kubesphere/ks-installer/blob/20a6daa18adf10410a385b48ab2769e55d8bdee2/controller/installRunner.py#L30C12-L30C32 可知installRunner.py会在ClusterConfiguration创建或更新的时候被触发。
  • TODO 待详细查看触发动作中yaml是如何传递kubesphere版本信息的
https://github.com/kubesphere/ks-installer/blob/20a6daa18adf10410a385b48ab2769e55d8bdee2/controller/installRunner.py#L338C41-L338C53 可以看到实际执行动作的是ansible,用的是https://ansible.readthedocs.io/projects/runner/en/stable/index.html
run方法文档https://ansible.readthedocs.io/projects/runner/en/stable/ansible_runner/#ansible_runner.interface.run

verbosity (int) – Control how verbose the output of ansible-playbook is _input (io.FileIO) – An optional file or file-like object for use as input in a streaming pipeline _output (io.FileIO) – An optional file or file-like object for use as output in a streaming pipeline

  • TODO 从ansible playbook中找到kubesphere版本信息是哪个变量
  • TODO 确认从kubekey到ansible playbook的版本信息传递数据流
参考https://www.cnblogs.com/sylvia-liu/p/14933776.html强制覆盖entrypoint进去修改python

docker pull harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker run -it –entrypoint /bin/bash harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 cp /hooks/kubesphere/installRunner.py /hooks/kubesphere/installRunner.py.bak vi /hooks/kubesphere/installRunner.py

进入shell修改python脚本

import io ansible_log=io.FileIO(‘/home/kubesphere/ansible.log’, ‘w’) _output=ansible_log, verbosity=5

重新打一个镜像v3.3.3

docker commit 5274b25c35d5 harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.3 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.3

安装虚机

./vm.sh -c 4 -m 8 -d 80 -p on k8s1 ./vm.sh -c 2 -m 4 -d 40 -p on k8s2 ./vm.sh -c 2 -m 4 -d 40 -p on k8s3

分配域名

#dns config 192.168.2.206 k8s-control.hylstudio.local 192.168.2.206 k8s1.k8s.local 192.168.2.177 k8s2.k8s.local 192.168.2.203 k8s3.k8s.local

强行修改ClusterConfiguration发现kk会报错,改回3.3.2先安装

apiVersion: installer.kubesphere.io/v1alpha1 kind: ClusterConfiguration metadata: name: ks-installer namespace: kubesphere-system labels: version: v3.3.2

尝试给v3.3.2替换成v3.3.3的内容,替换之前对v3.3.2备份

#做个假的3.3.2骗掉版本检测 docker tag harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2-bak docker image rm harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker tag harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.3 harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2

安装到最后一步的时候进入ks-installer

namespace/kubesphere-system unchanged serviceaccount/ks-installer unchanged customresourcedefinition.apiextensions.k8s.io/clusterconfigurations.installer.kubesphere.io unchanged clusterrole.rbac.authorization.k8s.io/ks-installer unchanged clusterrolebinding.rbac.authorization.k8s.io/ks-installer unchanged deployment.apps/ks-installer unchanged clusterconfiguration.installer.kubesphere.io/ks-installer created 13:55:05 UTC success: [k8s1] Please wait for the installation to complete: >>—>

查看python文件发现不是从本地pull的镜像,但describe显示地址和tag都是对的。查看harbor上是sha也是ok的,怀疑启动时候还有别的逻辑会修改python文件
相关进程如下

ks-installer-566ffb8f44-ml9gm:/kubesphere$ ps aux PID USER TIME COMMAND 1 kubesphe 0:00 /shell-operator start 56 kubesphe 0:06 python3 /hooks/kubesphere/installRunner.py 2501 kubesphe 1:21 {ansible-playboo} /usr/local/bin/python /usr/local/bin/ansible-playbook -e @/kubespher 4348 kubesphe 0:01 {ansible-playboo} /usr/local/bin/python /usr/local/bin/ansible-playbook -e @/kubespher 5261 kubesphe 0:00 /bin/sh -c /usr/local/bin/python /home/kubesphere/.ansible/tmp/ansible-tmp-1700921243. 5262 kubesphe 0:00 /usr/local/bin/python /home/kubesphere/.ansible/tmp/ansible-tmp-1700921243.7008557-434 5263 kubesphe 0:00 /usr/local/bin/kubectl apply -f /kubesphere/kubesphere/ks-core/crds/iam.kubesphere.io_ 5287 kubesphe 0:00 bash 5299 kubesphe 0:00 ps aux

日志显示如下

Start installing monitoring Start installing multicluster Start installing openpitrix Start installing network ************************************************** Waiting for all tasks to be completed … task network status is successful (1/4) task openpitrix status is successful (2/4) task multicluster status is successful (3/4) 只能怀疑是monitoring了

但从外面执行的时候发现entrypoint已经不对了,用dockerfile重新打一个测试,不要用docker commit
手动cp出来修改过后的python文件

docker cp xxx:/hooks/kubesphere/installRunner.py .

编写DockerFile

FROM harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2-bak RUN rm -rf /hooks/kubesphere/installRunner.py COPY installRunner.py /hooks/kubesphere/

打个新的镜像,注意新建个文件夹减小docker build context体积

mkdir imgbuild mv installRunner.py DockerFile imgbuild cd imagebuild docker build -f DockerFile -t harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.4 . docker image rm harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker tag harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.4 harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.4 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2

再试一次,发现kubekey会动harbor上的tag导致最后使用的不是替换的镜像
不加-a参数尝试apply installer,注意观察harbor上的tag是否正确
手动删除节点上的镜像,重试。ansible.log没起作用,后面再看
stdout中输出的确实是-vvvvv的内容了等着看最后结果
还有一些方法更完美:
  1. 在kuberkey修改harbor之前,apply yaml之前就手动去机器上crictl pull正确的镜像,这样检测到本地有的镜像就不会被错误的远程覆盖掉
  2. 分两次安装,第一次安装k8s,手动crictl pull,第二次安装kubesphere
  3. 追下kubekey代码,绕过校验直接用正确的版本号安装
  4. 最终根据源码https://github.com/kubesphere/kubekey/blob/e755baf67198d565689d7207378174f429b508ba/cmd/kk/cmd/create/cluster.go#L141C43-L141C59得知可通过--skip-push-images来避免harbor的镜像被覆盖
通过观察前面的详细输出,可以确认不是卡在开头修改的run方法中调用的ansible任务

Start installing monitoring Start installing multicluster Start installing openpitrix Start installing network ************************************************** Waiting for all tasks to be completed …

根据日志输出可找到实际的任务是https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/controller/installRunner.py#L277C10-L277C20
component List

readyToEnabledList = [ ‘monitoring’, ‘multicluster’, ‘openpitrix’, ‘network’]

根据代码可知执行的是https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/playbooks/monitoring.yaml
实例化调用的是https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/controller/installRunner.py#L57
参数来自https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/controller/installRunner.py#L261
可以看到写死的quiet=True, https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/controller/installRunner.py#L266C3-L266C24
接下来看看监控安装啥这么慢:
  1. 静态分析monitoring安装流程
  2. 继续修改这地方增加相同的verbose参数查看输出?异步调用ansible日志会输出到哪?
  3. _output参数看起来无效,怎么调试。这个参数是给streaming pipeline用的
  4. status_handler看起来能用
TODO kubekey为啥这4个使用异步调用ansible且不输出中间过程?
因quiet=True所以stdout看不到过程,通过查看当前ks-installer中的进程确定执行到哪了

ks-installer-566ffb8f44-zft9h:/hooks/kubesphere$ ps aux|more PID USER TIME COMMAND 1 kubesphe 0:00 /shell-operator start 18 kubesphe 5:49 python3 /hooks/kubesphere/installRunner.py 2171 kubesphe 0:00 bash 4053 kubesphe 1:54 {ansible-playboo} /usr/local/bin/python /usr/local/bin/ansible-playbook -e @/kubesphere/config/ks-config.json -e @/kubesphere/config/ks-status.json -e @/kubesphere/results/env/extravars /kubesphere/playbooks/monitoring.yaml 8876 kubesphe 0:00 {ansible-playboo} /usr/local/bin/python /usr/local/bin/ansible-playbook -e @/kubesphere/config/ks-config.json -e @/kubesphere/config/ks-status.json -e @/kubesphere/results/env/extravars /kubesphere/playbooks/monitoring.yaml 8899 kubesphe 0:00 /bin/sh -c /usr/local/bin/python /home/kubesphere/.ansible/tmp/ansible-tmp-1700926548.0542264-8876-155728454178142/AnsiballZ_command.py && sleep 0 8900 kubesphe 0:01 /usr/local/bin/python /home/kubesphere/.ansible/tmp/ansible-tmp-1700926548.0542264-8876-155728454178142/AnsiballZ_command.py 8926 kubesphe 0:00 /usr/local/bin/kubectl apply -f /kubesphere/kubesphere/prometheus/alertmanager 8957 kubesphe 0:00 ps aux 8958 kubesphe 0:00 more

monitor的任务定义在https://github.com/kubesphere/ks-installer/blob/ef79beead3285698cdce559dd5505c79fe11dbff/roles/ks-monitor/tasks/main.yaml
通过watch -n 1 'ps aux|more'可以看到执行过程,其实没卡住就是安装的慢
根据https://github.com/ansible/ansible-runner/blob/e81b02cae85f7c3e402fcb1cc1512da5ee3bcf35/src/ansible_runner/interface.py#L228C28-L228C28可知返回值第一个是异步执行的线程,修改代码尝试重定向输出到日志文件
同时修改quiet = False

def generateTaskLists(): readyToEnabledList, readyToDisableList = getComponentLists() tasksDict = {} for taskName in readyToEnabledList: playbookPath = os.path.join(playbookBasePath, str(taskName) + ‘.yaml’) artifactDir = os.path.join(privateDataDir, str(taskName)) if os.path.exists(artifactDir): shutil.rmtree(artifactDir) tasksDict[str(taskName)] = component( playbook=playbookPath, private_data_dir=privateDataDir, artifact_dir=artifactDir, ident=str(taskName), quiet=False, rotate_artifacts=1 ) return tasksDict def installRunner(self): installer = ansible_runner.run_async( playbook=self.playbook, private_data_dir=self.private_data_dir, artifact_dir=self.artifact_dir, ident=self.ident, quiet=self.quiet, rotate_artifacts=self.rotate_artifacts, verbosity=5 ) task_name = self.ident thread = installer[0] log_file = open(‘/tmp/’+task_name+’.debug.log’, ‘w’) thread.stdout = log_file return installer[1]

最终修改对比

— a/installRunner.py.bak +++ b/installRunner.py @@ -90,8 +90,13 @@ class component(): artifact_dir=self.artifact_dir, ident=self.ident, quiet=self.quiet, – rotate_artifacts=self.rotate_artifacts + rotate_artifacts=self.rotate_artifacts, + verbosity=5 ) + task_name = self.ident + thread = installer[0] + log_file = open(‘/tmp/’+task_name+’.debug.log’, ‘w’) + thread.stdout = log_file return installer[1] @@ -263,7 +268,7 @@ def generateTaskLists(): private_data_dir=privateDataDir, artifact_dir=artifactDir, ident=str(taskName), – quiet=True, + quiet=False, rotate_artifacts=1 )

按最小修改打镜像

mkdir imgbuild mv installRunner.py DockerFile imgbuild cd imagebuild docker build -f DockerFile -t harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.7 . docker image rm harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker tag harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.7 harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.7 docker push harbor.hylstudio.local/kubesphereio/ks-installer:v3.3.2 docker image ls –digests|grep installer

强制删除kubesphere monitor的namespace,强制删除ks-installer这个pod和宿主机的installer镜像,迫使k8s重建pod,这时候会按harbor的镜像tag重新拉取
harbor看不到imageId,Image ID是docker客户端侧的概念,所以仓库不管= =
https://github.com/goharbor/harbor/issues/10293
https://github.com/goharbor/harbor/issues/2469
下次安装在kubekey apply installer之后,手动去edit镜像为指定的tag,其他的不动了
修改thread.stdout = log_file并不会输出,根据https://github.com/ansible/ansible-runner/blob/e0371d634426dfbdb9d3bfacb20e2dd4b039b499/src/ansible_runner/runner.py#L155C28-L155C48
可知self.config.suppress_output_file为假才会输出到文件self.config.artifact_dir的/stdout和stderr中,artifact_dir来自/kubesphere/results/{task_name},suppress_output_filesuppress_ansible_output默认值也都是False
增加--skip-push-images,修改quite=False后去/kubesphere/results下查看是否有stdout和stderr

%s/quiet=True/quiet=False/g

— a/installRunner.py.bak +++ b/installRunner.py @@ -85,6 +85,7 @@ class component(): def installRunner(self): installer = ansible_runner.run_async( + verbosity=5, playbook=self.playbook, private_data_dir=self.private_data_dir, artifact_dir=self.artifact_dir, @@ -263,7 +264,7 @@ def generateTaskLists(): private_data_dir=privateDataDir, artifact_dir=artifactDir, ident=str(taskName), – quiet=True, + quiet=False, rotate_artifacts=1 ) @@ -341,6 +342,7 @@ def preInstallTasks(): for task, paths in preInstallTasks.items(): pretask = ansible_runner.run( + verbosity=5, playbook=paths[0], private_data_dir=privateDataDir, artifact_dir=paths[1], @@ -353,11 +355,12 @@ def preInstallTasks(): def resultInfo(resultState=False, api=None): ks_config = ansible_runner.run( + verbosity=5, playbook=os.path.join(playbookBasePath, ‘ks-config.yaml’), private_data_dir=privateDataDir, artifact_dir=os.path.join(privateDataDir, ‘ks-config’), ident=’ks-config’, – quiet=True + quiet=False ) if ks_config.rc != 0: @@ -365,11 +368,12 @@ def resultInfo(resultState=False, api=None): exit() result = ansible_runner.run( + verbosity=5, playbook=os.path.join(playbookBasePath, ‘result-info.yaml’), private_data_dir=privateDataDir, artifact_dir=os.path.join(privateDataDir, ‘result-info’), ident=’result’, – quiet=True + quiet=False ) if result.rc != 0: @@ -380,6 +384,7 @@ def resultInfo(resultState=False, api=None): if “migration” in resource[‘status’][‘core’] and resource[‘status’][‘core’][‘migration’] and resultState == False: migration = ansible_runner.run( + verbosity=5, playbook=os.path.join(playbookBasePath, ‘ks-migration.yaml’), private_data_dir=privateDataDir, artifact_dir=os.path.join(privateDataDir, ‘ks-migration’), @@ -395,11 +400,12 @@ def resultInfo(resultState=False, api=None): logging.info(info) telemeter = ansible_runner.run( + verbosity=5, playbook=os.path.join(playbookBasePath, ‘telemetry.yaml’), private_data_dir=privateDataDir, artifact_dir=os.path.join(privateDataDir, ‘telemetry’), ident=’telemetry’, – quiet=True + quiet=False ) if telemeter.rc != 0:

可到/kubesphere/results看到当前执行任务的实时日志

drwxr-xr-x 3 kubesphe kubesphe 4.0K Dec 2 13:37 common drwxr-xr-x 1 kubesphe kubesphe 4.0K Feb 3 2023 env drwxr-xr-x 3 kubesphe kubesphe 4.0K Dec 2 13:41 ks-core drwxr-xr-x 3 kubesphe kubesphe 4.0K Dec 2 13:37 metrics_server drwxr-xr-x 3 kubesphe kubesphe 4.0K Dec 2 13:37 preinstall

综上所述,当kuberkey出现Please wait for the installation to complete时,控制权是交给了ks-installer

namespace/kubesphere-system unchanged serviceaccount/ks-installer unchanged customresourcedefinition.apiextensions.k8s.io/clusterconfigurations.installer.kubesphere.io unchanged clusterrole.rbac.authorization.k8s.io/ks-installer unchanged clusterrolebinding.rbac.authorization.k8s.io/ks-installer unchanged deployment.apps/ks-installer unchanged clusterconfiguration.installer.kubesphere.io/ks-installer created 13:35:26 UTC success: [k8s1] Please wait for the installation to complete: >>—>

当ks-installer出现这样的字样时,可到/kubesphere/results继续查看剩余4个并行执行的任务状态

Start installing monitoring Start installing multicluster Start installing openpitrix Start installing network ************************************************** Waiting for all tasks to be completed …

按之前的速度,/kubesphere/results/monitoring/monitoring是最慢的,着重观察这个。手动观察剩下三个都已经成功了,只剩下monitoring还在安装
突然发现安装过程居然也有helm,简直是究极套娃,技术栈极其复杂

PLAY RECAP ********************************************************************* localhost : ok=24 changed=22 unreachable=0 failed=0 skipped=24 rescued=0 ignored=0 task monitoring status is successful (4/4)

monitoring这么复杂的任务,不知道官方出于什么理由彻底隐藏了安装过程,不给任何进度提示。在monitoring安装完成后,才出现了下面kubesphere-config的apply操作,所以在这之前kubesphere的webui是无法访问的,因为这里包含了重要的jwttoken来访问kubesphere api,容器会一直报错等待这个configmap的初始化,至此曾经的疑惑均完成了解答

changed: [localhost] => (item=kubesphere-config.yaml) => { “ansible_loop_var”: “item”, “changed”: true, “cmd”: “/usr/local/bin/kubectl apply -f /kubesphere/kubesphere/kubesphere-config.yaml”, “delta”: “0:00:07.874273”, “end”: “2023-12-02 14:09:16.591793”, “failed_when_result”: false, “invocation”: { “module_args”: { “_raw_params”: “/usr/local/bin/kubectl apply -f /kubesphere/kubesphere/kubesphere-config.yaml”, “_uses_shell”: true, “argv”: null, “chdir”: null, “creates”: null, “executable”: null, “removes”: null, “stdin”: null, “stdin_add_newline”: true, “strip_empty_ends”: true, “warn”: true } }, “item”: “kubesphere-config.yaml”, “rc”: 0, “start”: “2023-12-02 14:09:08.717520”, “stderr”: “”, “stderr_lines”: [], “stdout”: “configmap/kubesphere-config created”, “stdout_lines”: [ “configmap/kubesphere-config created” ] }

TODO 收集results文件夹下所有event的json画散点图观察
TODO 给kubekey增加参数,打印详细的输出。读取参数后参考上面的代码修改即可
TODO 监控对kubesphere ui的启动应该不是强依赖,梳理依赖关系增加参数来支持异步安装监控,提前让kubesphere可用。需要读ansible脚本确认。根据日志显示telemetry的安装就是完全独立于kubesphere的安装,可以参考
0 Comments
Leave a Reply