troubleshooting
Including problems with DOK and any services and components, please check the logs or events very carefully. This is the most basic way to troubleshoot problems. You can use the dok longAnalyze
command provided by DOK to analyze the logs Analysis, you can search for some keywords such as Error, Warn and so on.
Some apps used by DOK have certain requirements for system software. The more common problem is that the iscsi software used by Longhorn is not installed successfully. There are many possibilities for such problems, such as system software dependency conflicts, version conflicts, etc., which are difficult problems. DOK cannot guarantee that different types of machines and systems can be successfully installed 100%, so if there is a problem, you need to address sex analysis.
It may be encountered that Longhorn has not been created successfully, and Minio, Prometheus, Harbor, n9e, etc. that rely on Longhorn to provide storage cannot be created normally
You can see that when DOK installs various apps, it will report some timeout errors
Here you can see that DOK has encountered problems installing iscsi
Generally, if iscsi encounters a problem, you need to execute the following command on all worker nodes. This command is not 100% able to be installed successfully, depending on the software of the machine itself and the software source in the intranet yum source, etc. If you encounter Further questions require more analysis.
yum install -y bash curl iscsi-initiator-utils util-linux-ng
After installing iscsi, you need to uninstall all apps that failed to install, usually helm uninstall
, if it involves PV/PVC, you need to manually delete it, and finally reinstall it again. For related content, you can refer to the helm chapter of the document .
This is also a typical ssh problem. Please check whether several machines in the cluster can be accessed through ssh normally. As shown in the figure below, the ip machine obviously has a problem.
This is a typical ssh command timeout error. In this case, whether DOK can be re-run directly, you need to refer to the section “Can DOK Rerun Directly When an Error Is Encountered” in this article, if a process has already started , it is best to stop the process and re-run DOK.
You can use the command kubectl get node -o wide
to check whether the first controlplane has been successfully installed, as shown in the figure below, indicating that the first controlplane has been deployed, and the error may be caused by the second controlplane or the installation of workers If an error is reported again at this time, DOK does not support direct re-running at this time. If the first controlplane is not installed successfully, then DOK can directly re-run.
If you want to rerun DOK when at least one of the controlplane has been installed, you can first execute yes | kubeadm reset
on each controlplane node, and delete the DOK installation on each node** package, and then you can directly rerun DOK.
If you don’t want to run again, you have to get stuck in that step and manually complete the subsequent deployment work, such as manually installing the network plug-in Flannel and manually installing various Helm applications.
# the operation on the worker node is just an example. The specific command will be given in the log output of dok
kubeadm join 127.0.0.1:8443 --token 0n05f6.hpdp60u0m7jxr9vb --discovery-token-ca-cert-hash sha256:37227e3bc98db919b156f74ac1b1691018b3202e530e2242051d31ae034bf985 --certificate-key 0037c3eedc51e918b671ce77beb3198a9716e11119389a1fffceca5f4e2aec73
# on master0
# manually install flannel
cd /root/dok-release/network/
kubectl apply -f kube-flannel.yml
# manually install various helm components, for example, manually install ingress-nginx
cd /root/dok-release/app
helm install ingress-nginx -n ingress-nginx --create-namespace ingress-nginx --wait -f ingress-nginx/dok-values.yaml
During the execution of DOK, it will try to modify some kernel parameters to meet the needs of kubeadm, but it is still not 100% sure that the modification can be completed successfully (the reason may be that some machine operation and maintenance will have some mandatory means to ensure that the kernel parameters are not modified) , so if an error is reported during the installation process, check the log, you can modify the kernel parameters on the corresponding host, or negotiate with the machine operation and maintenance to configure some parameters. The following figure shows /proc/sys/net/ bridge/bridge-nf-call-iptables
cannot be modified to 1 normally.
If an error is reported at the kubeadm level, you can check Troubleshooting kubeadm.
The node has been in the NotReady state. Check the Kubelet log and find that it has been reporting errors. You can check the Containerd log. If there are errors about CNI, you can check /opt/cni/bin/
and /etc/cni/ net.d
two directories, if the CNI components are installed normally, just restart Containerd.
There is no monitoring panel for components such as Minio on Grafana. It may be that the ConfigMap has not been created successfully. You can check the DOK log to see if the relevant configuration has been produced.
Use kubectl-view-allocations to report an error, because the plug-in does not support Centos7 by default, kubectl-view-allocations#135, there is no re-issue Compile and fix the program.
kubeadm will detect the host name. If it does not meet the rules, this step will report an error. After changing the host name, execute dok createCluster
again.
The log displayed by dok will be written to the terminal and the file /tmp/dok.log
at the same time, and the exit of the terminal that executes the dok command will not affect the execution of the process. You can continue to view it through tailf /tmp/dok.log
, dok The process will execute in the background until it succeeds or exits with an error.
By default, DOK will try to install Nginx. The version of the rpm package Nginx in DOK is 1.20.1. If Nginx cannot be installed normally, you can check and locate it in the log. It may be that Nginx has already been installed on the machine. , and the version is lower than 1.10, it is recommended to upgrade.
By default, DOK will perform a series of checks on the cluster machines. Just modify the configuration of the host according to the error message. If you do not want to check, you can add --noCheck
.
The host’s swap is not closed, find the problem host from the error report, and execute swapoff -a
on the host.
By default, DOK will perform a series of checks on the cluster machines. Just correct the configuration of the host according to the error message. If you don’t want to check, you can add --noCheck
. It is not recommended to skip the machine check directly, otherwise the created cluster Various problems can arise due to non-standardization.
This is because the host password is wrongly filled in. If DOK is installed through a password, the passwords of all hosts must be the same.
In password-free mode, it is easy to ignore. If master0 is also used as the execution machine, master0 can also be password-free localhost.
If you do not fill in the password, DOK will read /root/.ssh/id_rsa
by default. If you have not configured the machine to avoid passwords, an error will be reported. You can add --password
or configure password-free solutions. See dok createCluster -h
to view specific options.