troubleshooting

Overview

Including problems with DOK and any services and components, please check the logs or events very carefully. This is the most basic way to troubleshoot problems. You can use the dok longAnalyze command provided by DOK to analyze the logs Analysis, you can search for some keywords such as Error, Warn and so on.

error failed dependencies

Some apps used by DOK have certain requirements for system software. The more common problem is that the iscsi software used by Longhorn is not installed successfully. There are many possibilities for such problems, such as system software dependency conflicts, version conflicts, etc., which are difficult problems. DOK cannot guarantee that different types of machines and systems can be successfully installed 100%, so if there is a problem, you need to address sex analysis.

It may be encountered that Longhorn has not been created successfully, and Minio, Prometheus, Harbor, n9e, etc. that rely on Longhorn to provide storage cannot be created normally

You can see that when DOK installs various apps, it will report some timeout errors

Here you can see that DOK has encountered problems installing iscsi

Generally, if iscsi encounters a problem, you need to execute the following command on all worker nodes. This command is not 100% able to be installed successfully, depending on the software of the machine itself and the software source in the intranet yum source, etc. If you encounter Further questions require more analysis.

yum install -y bash curl iscsi-initiator-utils util-linux-ng

After installing iscsi, you need to uninstall all apps that failed to install, usually helm uninstall, if it involves PV/PVC, you need to manually delete it, and finally reinstall it again. For related content, you can refer to the helm chapter of the document .

Couldn’t establish a connection to the remote server 172.22.1.17

This is also a typical ssh problem. Please check whether several machines in the cluster can be accessed through ssh normally. As shown in the figure below, the ip machine obviously has a problem.

failed: wait: remote command exited without exit status or exit signal

This is a typical ssh command timeout error. In this case, whether DOK can be re-run directly, you need to refer to the section “Can DOK Rerun Directly When an Error Is Encountered” in this article, if a process has already started , it is best to stop the process and re-run DOK.

Can DOK run again if it encounters an error?

You can use the command kubectl get node -o wide to check whether the first controlplane has been successfully installed, as shown in the figure below, indicating that the first controlplane has been deployed, and the error may be caused by the second controlplane or the installation of workers If an error is reported again at this time, DOK does not support direct re-running at this time. If the first controlplane is not installed successfully, then DOK can directly re-run.

If you want to rerun DOK when at least one of the controlplane has been installed, you can first execute yes | kubeadm reset on each controlplane node, and delete the DOK installation on each node** package, and then you can directly rerun DOK.

If you don’t want to run again, you have to get stuck in that step and manually complete the subsequent deployment work, such as manually installing the network plug-in Flannel and manually installing various Helm applications.

# the operation on the worker node is just an example. The specific command will be given in the log output of dok
kubeadm join 127.0.0.1:8443 --token 0n05f6.hpdp60u0m7jxr9vb --discovery-token-ca-cert-hash sha256:37227e3bc98db919b156f74ac1b1691018b3202e530e2242051d31ae034bf985 --certificate-key 0037c3eedc51e918b671ce77beb3198a9716e11119389a1fffceca5f4e2aec73
# on master0
# manually install flannel
cd /root/dok-release/network/
kubectl apply -f kube-flannel.yml
# manually install various helm components, for example, manually install ingress-nginx
cd /root/dok-release/app
helm install ingress-nginx -n ingress-nginx --create-namespace ingress-nginx --wait -f ingress-nginx/dok-values.yaml

/proc/sys/net/bridge/bridge-nf-call-iptables contents are not set to 1

During the execution of DOK, it will try to modify some kernel parameters to meet the needs of kubeadm, but it is still not 100% sure that the modification can be completed successfully (the reason may be that some machine operation and maintenance will have some mandatory means to ensure that the kernel parameters are not modified) , so if an error is reported during the installation process, check the log, you can modify the kernel parameters on the corresponding host, or negotiate with the machine operation and maintenance to configure some parameters. The following figure shows /proc/sys/net/ bridge/bridge-nf-call-iptables cannot be modified to 1 normally.

If an error is reported at the kubeadm level, you can check Troubleshooting kubeadm.

cni plugin not initialized

The node has been in the NotReady state. Check the Kubelet log and find that it has been reporting errors. You can check the Containerd log. If there are errors about CNI, you can check /opt/cni/bin/ and /etc/cni/ net.d two directories, if the CNI components are installed normally, just restart Containerd.

Error from server (NotFound): configmaps “extra-grafana-dashboard” not found

There is no monitoring panel for components such as Minio on Grafana. It may be that the ConfigMap has not been created successfully. You can check the DOK log to see if the relevant configuration has been produced.

error execution phase preflight: couldn’t validate the identity of the API Server

version `GLIBC_2.18’ not found

Use kubectl-view-allocations to report an error, because the plug-in does not support Centos7 by default, kubectl-view-allocations#135, there is no re-issue Compile and fix the program.

nodeRegistration.name: Invalid value

kubeadm will detect the host name. If it does not meet the rules, this step will report an error. After changing the host name, execute dok createCluster again.

The execution page accidentally exited

The log displayed by dok will be written to the terminal and the file /tmp/dok.log at the same time, and the exit of the terminal that executes the dok command will not affect the execution of the process. You can continue to view it through tailf /tmp/dok.log, dok The process will execute in the background until it succeeds or exits with an error.

nginx: [emerg] “log_format” directive is not allowed here

By default, DOK will try to install Nginx. The version of the rpm package Nginx in DOK is 1.20.1. If Nginx cannot be installed normally, you can check and locate it in the log. It may be that Nginx has already been installed on the machine. , and the version is lower than 1.10, it is recommended to upgrade.

kernel version

By default, DOK will perform a series of checks on the cluster machines. Just modify the configuration of the host according to the error message. If you do not want to check, you can add --noCheck.

swap haven’t been turn off

The host’s swap is not closed, find the problem host from the error report, and execute swapoff -a on the host.

By default, DOK will perform a series of checks on the cluster machines. Just correct the configuration of the host according to the error message. If you don’t want to check, you can add --noCheck. It is not recommended to skip the machine check directly, otherwise the created cluster Various problems can arise due to non-standardization.

SSH dial error

This is because the host password is wrongly filled in. If DOK is installed through a password, the passwords of all hosts must be the same.

In password-free mode, it is easy to ignore. If master0 is also used as the execution machine, master0 can also be password-free localhost.

ssh key file read failed

If you do not fill in the password, DOK will read /root/.ssh/id_rsa by default. If you have not configured the machine to avoid passwords, an error will be reported. You can add --password or configure password-free solutions. See dok createCluster -h to view specific options.