Press "Enter" to skip to content

66.k8s运行tensorflow

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

1. 安装Nvidia显卡驱动

 

检查显卡

 

##安装依赖包
yum -y install pciutils gcc gcc-c++ wget kernel-devel kernel-headers
##检查Nvidia硬件
lspci | grep VGA
#出现类似以下的结果,如果服务器上没有NVIDIA显卡,就可以终止了
#04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

 

检查内核版本

 

uname -sr
#Linux 3.10.0-1160.6.1.el7.x86_64
ls /boot | grep vmlinuz
#vmlinuz-3.10.0-1160.6.1.el7.x86_64
rpm -aq |grep kernel-devel
#kernel-devel-3.10.0-1160.6.1.el7.x86_64

 

注意:保证内核版本和源码版本一样,否则安装报错!!!!!!

 

若内核和源码需要从FC官方网站上下载与内核版本对应的源码包进行安装.

 

可以在以下网站下载并安装:

http://rpmfind.net/linux/rpm2html/search.php?query=kernel-devel

下载显卡驱动

 

驱动下载https://www.nvidia.com/Download/index.aspx

 

勾选相应型号,并下载安装脚本

禁用自带的驱动

 

vi /lib/modprobe.d/dist-blacklist.conf
#blacklist nvidiafb
blacklist nouveau
options nouveau modeset=0

 

 

##重建 initramfs image
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)
##切换到文本模式
systemctl set-default multi-user.target
##重启
init 6
##重启之后,检查是否禁用,如果没有显示相关的内容,说明已禁用
lsmod | grep nouveau

 

安装驱动

 

##安装下载的驱动
./NVIDIA-Linux-x86_64-455.45.01.run
##如果找不到kernel的源码,可以通过命令指定
#./NVIDIA-Linux-x86_64-455.45.01.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64  -k $(uname -r)
## 执行后,开始解压驱动包,进入安装步骤,可能中间会出现一些警告,但是不影响

 

 

 

 

 

 

安装完成!

 

检查驱动

 

# 查看GPU相关配置
nvidia-smi
# 如果正常显示安装的显卡信息,则说明驱动安装成功; 
# 如果提示找不到该指令,或什幺信息都没有显示,则驱动安装失败,可以卸载驱动后重新安装

 

 

2.安装nvidia-docker2

 

##删除老的nvidia-docker,安装nvidia-docker2
##docker volume ls -q -f driver=nvidia-docker |xargs -r -I{} -n1 docker ps -q -a -f volume={} |xargs -r docker rm -f
sudo yum remove nvidia-docker
distribution=$(. /etc/os-release;echo$ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo |\
sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
##Docker测试Nvidia驱动
docker run --rm --gpus all nvidia/cuda:11.1-devel-centos7 nvidia-smi

 

3.安装k8s-device-plugin

 

参考:https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes

 

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.7.2/nvidia-device-plugin.yml

 

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:v0.7.2
        name: nvidia-device-plugin-ctr
        args: ["--fail-on-init-error=false"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

 

gpu-pod.yaml 进行测试

 

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:11.1-devel-centos7
    - name: digits-container
      image: nvidia/digits:6.0

 

编写测试yaml,执行 kubectl create -f gpu-pod.yaml
发现pod调度到GPU服务器执行,并显示结果说明成功

 

4.k8s安装tensorflow

 

参考:https://blog.csdn.net/vah101/article/details/108098827

 

创建tensorflow.yaml

 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu-jupyter
  labels:
    app: tensorflow-gpu-jupyter
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tensorflow-gpu-jupyter
  template: # define the pods specifications
    metadata:
      labels:
        app: tensorflow-gpu-jupyter
    spec:
      containers:
      - name: tensorflow-gpu-jupyter
        image: tensorflow/tensorflow:latest-gpu-jupyter
        #resources:
        #  limits:
        #    nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
 name: tensorflow-gpu-jupyter
 labels:
  app: tensorflow-gpu-jupyter
spec:
 type: NodePort
 ports:
 - port: 8888
   targetPort: 8888
   nodePort: 30999
 selector:
  app: tensorflow-gpu-jupyter

 

执行 kubectl create -f tensorflow.yaml
等待服务正常启动,就可以访问30999端口的jupyter服务了.

Be First to Comment

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注