Press "Enter" to skip to content

编译tensorflow-serving GPU

本站内容均来自兴趣收集,如不慎侵害的您的相关权益,请留言告知,我们将尽快删除.谢谢.

背景

 

因为业务方上了bert的模型,所以要制作一个GPU版本的sidecar的TFServing镜像。

 

 

找到一种可以直接通过docker编译,不用在主机上装各种东西的方法:使用devel版本的镜像。

 

devel镜像

 

TFServing-devel版本的镜像自带了很多编译Tensorflow Serving的组件,比如bazel、gcc、glibc等等等等,因此会非常大。打好镜像后我们把bin文件拷贝到非devel版本的镜像里面运行即可。

 

去dockerhub拉镜像: https://hub.docker.com/r/tensorflow/serving/tags?page=1&name=2.0

 

拉下面这两个:

 

tensorflow/serving   2.0.0-gpu           af288d8e0730        11 months ago       2.49GB
tensorflow/serving   2.0.0-devel-gpu     111028dae1da        11 months ago       11.8GB

 

运行

 

运行并进入容器:

 

docker run -itd --name tfs --network=host tensorflow/serving:2.0.0-devel-gpu /bin/bash
docker exec -it tfs /bin/bash

 

在容器里更改代码,然后编译:

 

bazel build -c opt --config=cuda //tensorflow_serving/model_servers:tensorflow_model_server --verbose_failures

 

问题

 

在编译过程中还是会遇到各种问题,这里挑了几个有代表性的拿出来讲一下:

 

no such package

 

编译过程中会遇到几次类似的:

 

ERROR: /tensorflow-serving/tensorflow_serving/model_servers/BUILD:318:1: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it and referenced by '//tensorflow_serving/model_servers:server_lib'
ERROR: Analysis of target '//tensorflow_serving/model_servers:tensorflow_model_server' failed; build aborted: no such package '@grpc//': java.io.IOException: Error downloading [https://storage.googleapis.com/mirror.tensorflow.org/github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz, https://github.com/grpc/grpc/archive/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz] to /root/.cache/bazel/_bazel_root/e53bbb0b0da4e26d24b415310219b953/external/grpc/4566c2a29ebec0835643b972eb99f4306c4234a3.tar.gz: Tried to reconnect at offset 5,847,203 but server didn't support it
INFO: Elapsed time: 1346.453s

 

解决方法:重试几次。或者用以下两种方法

 

宿主机搭建一个文件服务器

 

使用ng搭建:

 

vim /usr/local/etc/nginx/nginx.conf
http {
    autoindex on;
    include       mime.types;
    default_type  application/octet-stream;
    sendfile        on;
    keepalive_timeout  65;
    server {
        listen       8001;
        server_name  127.0.0.1;
        location / {
            root   <your_path>;
            index  index.html index.htm;
        }
    }
}

 

使用宿主机的代理

 

 

    1. 在宿主机找到ip

 

 

(base) ➜  bin ifconfig | grep "inet " | grep -v 127.0.0.1
	inet xxx.xxx.xxx.xxx netmask 0xfffffff0 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx
	inet xxx.xxx.xxx.xxx netmask 0xffffff00 broadcast xxx.xxx.xxx.xxx

 

 

    1. 进去容器后设置代理

 

 

export ALL_PROXY='socks5://xxx.xxx.xxx.xxx:1080'

 

 

    1. 看看是否设置上了:

 

 

curl cip.cc

 

gcc: Internal error: Killed (program cc1)

 

内存不足,调大docker镜像内存吧:

 

Preferences -> Advances

 

卤煮是调大到12G,swap给了2G才编完的。

 

can not be used when making a shared object; recompile with -fPIC

 

/usr/bin/ld: bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: relocation R_X86_64_32 against `.rodata' can not be used when making a shared object; recompile with -fPIC
bazel-out/k8-opt/bin/tensorflow_serving/model_servers/_objs/tensorflow_model_server/tensorflow_serving/model_servers/version.o: error adding symbols: Bad value
collect2: error: ld returned 1 exit status

 

在github上找了个issue: https://github.com/netfs/serving/commit/be7c70d779a39fad73a535185a4f4f991c1d859a
但是本地代码里面已经有这个fix了。后来是把version去掉了才编译过的:

 

BUILD文件修改:

 

cc_library(
    name = "tensorflow_model_server_main_lib",
    srcs = [
        "main.cc",
    ],
    #hdrs = [
    #    "version.h",
    #],
    #linkstamp = "version.cc",
    visibility = [
        ":tensorflow_model_server_custom_op_clients",
        "//tensorflow_serving:internal",
    ],
    deps = [
        ":server_lib",
        "@org_tensorflow//tensorflow/c:c_api",
        "@org_tensorflow//tensorflow/core:lib",
        "@org_tensorflow//tensorflow/core/platform/cloud:gcs_file_system",
        "@org_tensorflow//tensorflow/core/platform/hadoop:hadoop_file_system",
        "@org_tensorflow//tensorflow/core/platform/s3:s3_file_system",
    ],
)

 

main.cc文件修改

 

//#include "tensorflow_serving/model_servers/version.h"
...
if (display_version) {
    std::cout << "TensorFlow ModelServer: " << "r1.12" << "\n"
              << "TensorFlow Library: " << TF_Version() << "\n";
    return 0;
  }

 

保存镜像

 

编完后保存:

 

docker commit -a "xxx" -m "tfserving gpu build" b629d5936020 tensorflow/serving:2.0.0-devel-gpu-build

 

镜像导入导出:

 

docker save -o xxx.tar tensorflow/serving:mkl
docker load -i xxx.tar

 

启动参数

 

sudo nvidia-docker run -p 8500:8500 \
  --mount type=bind,source=xxx/models,target=xxx \
  -t --entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu \
  --port=8500 --per_process_gpu_memory_fraction=0.5 \
  --enable_batching=true --model_name=east --model_base_path=/models/east_model &

 

参数含义:

-p 8500:8500 :指的是开放8500这个gRPC端口。
–mount type=bind, source=/your/local/model, target=/models:把你导出的本地模型文件夹挂载到docker container的/models这个文件夹,tensorflow serving会从容器内的/models文件夹里面找到你的模型。
-t –entrypoint=tensorflow_model_server tensorflow/serving:latest-gpu:如果使用非devel版的docker,启动docker之后是不能进入容器内部bash环境的,–entrypoint的作用是允许你“间接”进入容器内部,然后调用tensorflow_model_server命令来启动TensorFlow Serving,这样才能输入后面的参数。紧接着指定使用tensorflow/serving:latest-gpu 这个镜像,可以换成你想要的任何版本。
–port=8500:开放8500这个gRPC端口(需要先设置上面entrypoint参数,否则无效。下面参数亦然)
–per_process_gpu_memory_fraction=0.5:只允许模型使用多少百分比的显存,数值在[0, 1]之间。
–enable_batching:允许模型进行批推理,提高GPU使用效率。
–model_name:模型名字,在导出模型的时候设置的名字。
–model_base_path:模型所在容器内的路径,前面的mount已经挂载到了/models文件夹内,这里需要进一步指定到某个模型文件夹,例如/models/east_model指的是使用/models/east_model这个文件夹下面的模型。

使用代码里面的tool构建

 

也可以使用代码里面的tool构建tfserving镜像:

 

拉代码

 

git clone --recurse-submodules https://github.com/tensorflow/models.git
git checkout r2.0
cd serving

 

构建ModelServer

 

修改代码后,构建优化版本的ModelServer。

 

CPU版本:

 

docker build –pull -t $USER/tensorflow-serving-devel

 

-f tensorflow_serving/tools/docker/Dockerfile.devel .

 

如果机器安装了Intel的MKL库(据说要比开源的OpenBLAS快),那幺可以使用:

 

docker build –pull -t $USER/tensorflow-serving-devel

 

-f tensorflow_serving/tools/docker/Dockerfile.devel-mkl .

 

GPU版本:

 

docker build –pull -t $USER/tensorflow-serving-devel-gpu

 

-f tensorflow_serving/tools/docker/Dockerfile.devel-gpu .

 

上面的(任选一个)过程会构建$USER/tensorflow-serving-devel这个镜像。

 

构建Tensorflow Serving

 

接下来我们用上面构建的$USER/tensorflow-serving-devel来构建Tensorflow Serving镜像。

 

CPU版本:

 

docker build -t $USER/tensorflow-serving

 

–build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

 

-f tensorflow_serving/tools/docker/Dockerfile .

 

如果是MKL的CPU版本:

 

docker build -t $USER/tensorflow-serving

 

–build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

 

-f tensorflow_serving/tools/docker/Dockerfile.mkl .

 

GPU的版本:

 

docker build -t $USER/tensorflow-serving

 

–build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel

 

-f tensorflow_serving/tools/docker/Dockerfile.gpu .

 

ref

 

文件服务器: https://blog.csdn.net/qq_39567427/article/details/104877041

 

bazel: https://blog.gmem.cc/bazel-study-note

 

https://www.cnblogs.com/zjutzz/p/10305995.html

 

bazel指定从文件服务器下载: http://www.jeepxie.net/article/392509.html

 

docker内使用宿主机代理服务器: https://arminli.com/blog/183

https://www.jianshu.com/p/01f0ee9086e2

fPIC: https://www.cnblogs.com/zl1991/p/11465111.html

 

http://webcache.googleusercontent.com/search?q=cache:ZulKFDzVupwJ:fancyerii.github.io/books/tfserving-docker/+&cd=4&hl=zh-CN&ct=clnk&gl=us

Be First to Comment

发表评论

电子邮件地址不会被公开。 必填项已用*标注