TensorFlow on Docker构建问题汇总

1.Docker仓库缓存位置

默认Docker会存储在/var/lib/docker/,如果系统盘过小,很容易导致磁盘写满。为了改变存储位置,需要修改启动脚本。对于CentOS来说,修改/usr/lib/systemd/system/docker.service加入如下一行,-g

ExecStart=/usr/bin/dockerd-current \
          -g /mnt/disk1/docker_home \

2.打开管理端口

由于安全原因,默认现在是不打开2375端口,为了使用Docker-java等管理工具,需要打开端口,方法同上,修改/usr/lib/systemd/system/docker.service

    --userland-proxy-path=/usr/libexec/docker/docker-proxy-current \
    -H tcp://0.0.0.0:2375 -H unix://var/run/docker.sock   \

3.环境变量等问题

通过Commit等方式或者传入的环境变量或多或少有问题,需要通过Dockerfile写法,进行设置。比如TensorFlow,需要配置HADOOP_HDFS_HOME,LD_LIBRARY_PATH以及CLASSPATH等来读取HADOOP数据,但是通过-e传递参数方式,并不起作用。

4.Nvidia驱动安装问题

如果希望在Docker内部能够使用GPU,则应该在宿主机(host)以及Docker Container内部都安装相同的cuda版本以及cudnn版本。同时,在启动container的时候需要将GPU设备映射到container,需要映射的设备有

--device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia-uvm:/dev/nvidia-uvm

但是,有个问题,如果重启的时候,这三个设备默认没有加载,通过以下脚本启动加载。

#!/bin/bash

/sbin/modprobe nvidia

if [ "$?" -eq 0 ]; then
  # Count the number of NVIDIA controllers found.
  NVDEVS=`lspci | grep -i NVIDIA`
  N3D=`echo "$NVDEVS" | grep "3D controller" | wc -l`
  NVGA=`echo "$NVDEVS" | grep "VGA compatible controller" | wc -l`

  N=`expr $N3D + $NVGA - 1`
  for i in `seq 0 $N`; do
    mknod -m 666 /dev/nvidia$i c 195 $i
  done

  mknod -m 666 /dev/nvidiactl c 195 255

else
  exit 1
fi

/sbin/modprobe nvidia-uvm

if [ "$?" -eq 0 ]; then
  # Find out the major device number used by the nvidia-uvm driver
  D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`

  mknod -m 666 /dev/nvidia-uvm c $D 0
else
  exit 1
fi

5.整体的Dockerfile

FROM centos:7.3.1611

RUN     yum update -y
RUN     yum install -y java-1.8.0-openjdk-devel.x86_64
RUN     yum install -y vim
RUN     yum install -y wget
RUN     yum -y install epel-release
RUN     yum install -y python-pip
RUN     yum -y install python-devel
RUN     pip install --upgrade pip

ADD ./hadoop-2.7.2-1.2.8.tar.gz /usr/local

RUN     mkdir /install

COPY ./cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm /install
COPY ./cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm /install

RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-ga2-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda
RUN     rpm -i /install/cuda-repo-rhel7-8-0-local-cublas-performance-update-8.0.61-1.x86_64-rpm
RUN     yum -y install cuda-cublas-8-0

ADD ./cudnn-8.0-linux-x64-v6.0.tar.gz /install

RUN     cp /install/cuda/include/cudnn.h /usr/local/cuda/include/
RUN     cp -d /install/cuda/lib64/libcudnn* /usr/local/cuda/lib64/
RUN     chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*


ENV JAVA_HOME /etc/alternatives/java_sdk_1.8.0
ENV HADOOP_HOME /usr/local/hadoop-2.7.2-1.2.8
ENV HADOOP_HDFS_HOME $HADOOP_HOME
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:${JAVA_HOME}/jre/lib/amd64/server:$LD_LIBRARY_PATH
ENV PATH $JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH

解决Jersey 2.x Jersey 1.x冲突问题– UriBuilder问题

最近在使用Java Docker来控制Docker的过程中发现了一个问题,因为Java Docker使用的是Jersey 2.x,maven pom如下:

<dependency>
    <groupId>com.github.docker-java</groupId>
    <artifactId>docker-java</artifactId>
    <version>3.0.13</version>
</dependency>

而集群中Hadoop的依赖都是Jersey 1.x。所以在引用的过程中提示如下错误:

Exception in thread "main" java.lang.AbstractMethodError: javax.ws.rs.core.UriBuilder.uri(Ljava/lang/String;)Ljavax/ws/rs/core/UriBuilder;
        at javax.ws.rs.core.UriBuilder.fromUri(UriBuilder.java:119)
        at org.glassfish.jersey.client.JerseyWebTarget.<init>(JerseyWebTarget.java:71)
        at org.glassfish.jersey.client.JerseyClient.target(JerseyClient.java:290)
        at org.glassfish.jersey.client.JerseyClient.target(JerseyClient.java:76)
        at com.github.dockerjava.jaxrs.JerseyDockerCmdExecFactory.init(JerseyDockerCmdExecFactory.java:237)
        at com.github.dockerjava.core.DockerClientImpl.withDockerCmdExecFactory(DockerClientImpl.java:161)
        at com.github.dockerjava.core.DockerClientBuilder.build(DockerClientBuilder.java:45)

很明显这是Jersey版本mismatch导致的,在解决这一过程中,走了不少弯路,包括采用了maven shade plugin来解决,采用exclude一些jar包来解决等,都没有解决问题。
最后还是从错误出发,这个错误的原因是fromUri这个方法是一个抽象方法,没有实现,也就是只有抽象方法,没有实现类。
再把代码看一下,在Jersery 2.x中的实现为:

public abstract UriBuilder uri(String var1);

它的实现类为org.glassfish.jersey.uri.internal.JerseyUriBuilder,看到这一部分,大概了解到原因是缺少了实现类。于是加入相应的dependency到pom中

        <dependency>
            <groupId>org.glassfish.jersey.core</groupId>
            <artifactId>jersey-server</artifactId>
            <version>2.23.1</version>
        </dependency>
        <dependency>
            <groupId>org.glassfish.jersey.containers</groupId>
            <artifactId>jersey-container-servlet-core</artifactId>
            <version>2.23.1</version>
        </dependency>

加入后再打包,解决这一问题。

Linux 各种依赖download only

一、python包依赖–pip download

pip donwload package

二、yum包依赖–yum downloadonly

首先安装包downloadonly包

yum install yum-plugin-downloadonly

使用方法:

yum install --downloadonly --downloaddir=<directory> <package>

把所有安装包一起本地安装方法:

yum --nogpgcheck localinstall *.rpm

Install Nvidia CUDA on Aliyun ECS

1.首先确认硬件存在Nvidia GPU,并且操作系统版本兼容,我们选取的是CentOS 7版本,所以没有问题

lspci | grep -i nvidia

2.安装cuda toolkit
从官网下载,http://developer.nvidia.com/cuda-downloads,下载后选择对应的平台,我选取的是centos7,对应了cuda toolkit为9,这个对后面的driver安装选择有影响,后面会说明。

rpm -i cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64.rpm
yum install cuda

这样安装好了cuda。

3.加入PATH等

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

4.reboot服务器,然后验证一下

[root@emr-worker-1 release]# nvidia-smi -L
GPU 0: Tesla M40 (UUID: GPU-5ac5582f-b7cb-225a-b698-2c1da3cb1646)

至此,安装完成。

Aliyun EMR 值班问题解决汇总

加入EMR团队将近两个月,开始了值班,主要就是解决用户使用Aliyun EMR产品中出现的问题。不同于当年在新浪做hadoop管理员,当年只是针对HDFS,YARN,OLAP,Stream来解决用户的问题,现在针对EMR所有产品栈的问题道都需要了解,解决,如果解决不了找到相应的同事解决。
本帖会针对出现的一些问题,做记录,并提出解决问题的方法:
1. 用户提出的问题为

使用emr集群上sqoop将RDS上的mysql整库导入到hive,其中修改了hive配置,将hive.metastore.warehouse.dir设置到oss下,现出现数据量大的表不能正常导入,mysql到hdfs这过程成功,到hdfs到oss这过程失败,报错信息为:
错误信息

查看错误后很明显是由于环境变量中缺少distcp jar包,distcp jar包在Hadoop tools中,第一个想到的解决方法是export CLASSPATH中添加hadoop tools中对应的jar包,但是运行时候依然报错,查看了代码路径,发现了是sqoop会调用hive的命令,这个时候的CLASSPATH由hive定义,所以最简单的方法是将distcp jar包拷贝到hive 的lib目录下。
根本的解决方法是在Hive的启动脚本中,yarn的启动脚本中加入hadoop tools的CLASSPATH。

机器学习小结-线性回归

最近接触了机器学习的相关知识,会不定期的进行总结,巩固学习成果。
机器学习分为了监督型以及非监督型两种:
1. 监督型就是有输入数据、有输出数据,在这一基础上建立模型,进行预测等。
2. 非监督型有输入数据,但是没有输出数据,直接建模。
机器学习有很多种算法,线性回归应该是最先接触的算法,也是最基础的算法。总结一下线性回归,主要是概念。

说明

线性回归属于监督型机器学习算法,这种预测方法已经存在了超过200年。我这里只介绍简单的一个变量的线性回归模型。
单变量线性回归,假定

y=b0+b1*x

输出Y与输入X存在线性关系,b0,b1为参数,目标是根据输入数据,能够将误差降低的最小。而将误差降低到最小的方式就是根据实际的输出值Y与通过线性回归计算出的输出值y求差的平方之和,最后结果越小误差越小。

J = (Y-b0-b1-x)^2 /2m

其中m是训练数据的个数,除以2m是为了后面计算方便。

计算方法

  1. 最小二乘法
    有了训练数据,有了公式,有两种计算方法,一种是通过最小二乘法,最小二乘法有很多说明了,这里我只把公式粘贴出来:
    最小二乘法
    可见,根据训练数据能够求解出误差最小的系数,我们认为这一系数是最佳解。
  2. 梯度下降法
    通过反复迭代,找到局部最优解,模型是一样的,定义如下的误差函数:
    误差函数
    其中
    定义
    为了得到最优解,我们向梯度下降,就是负梯度的方向,预测下一步的值。
    梯度下降的方法
    上图就是梯度下降的计算公式,这里面最重要的就是去求解两个偏导数,根据偏导数公式,可以得到以下公式
    梯度下降公式
    可以看出,完全可以通过训练数据得到下一步参数值,当满足迭代停止条件的时候就停止迭代,得到我们要求的参数。

Java System.out重定向

近期在开发tensorflow-on-yarn的时候需要为启动的tensorflow添加环境变量,如获取所有的HADOOP CLASSPATH。在Hadoop命令行中,运行

Hadoop classpath --glob

即可获得所有的classpath,看了一下脚本,调用的是org.apache.hadoop.util.Classpath类,于是想到直接调用这个类就可以获得CLASSPATH了,但是发现直接调用输出到了stdout,所以就需要对System.out做重定向,获得需要的。
以下的代码就是重定向加上获得CLASSPATH

        // Get classpath
        PrintStream old_out = System.out;
        ByteArrayOutputStream pipeOut = new ByteArrayOutputStream();

        System.setOut(new PrintStream(new PrintStream(pipeOut)));
        Classpath.main(new String[]{"--glob"});
        System.setOut(old_out);
        String classpath = new String(pipeOut.toByteArray());
        env.put("CLASSPATH", System.getenv("CLASSPATH")+classpath);

使用Maven-Shade-Plugin 解决jar包依赖冲突

在我们向集群中提交任务的时候,常常因为app的jar包与Hadoop CLASSPATH中jar包冲突导致了No such method错误等等。究其原因,就是因为hadoop执行脚本启动用户jar包之前会设置CLASSPATH,CLASSPATH中有版本不一致,jar包启动读取某些库的版本与开发时候的版本不一致,导致了No such method错误。
为了解决这一问,我们可以采用Maven-shade-plugin,这个插件的主要作用是将打包过程中内部的jar包重新命名为新的名称,通过映射解决这一问,从而防止由于引用不同版本导致的问题,例如
org.codehaus.plexus.util包跟集群中的包有依赖冲突,我们将它重新命名为org.shaded.plexus.util

<project>
  ...
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.0.0</version>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>shade</goal>
            </goals>
            <configuration>
              <relocations>
                <relocation>
                  <pattern>org.codehaus.plexus.util</pattern>
                  <shadedPattern>org.shaded.plexus.util</shadedPattern>
                  <excludes>
                    <exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
                    <exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
                  </excludes>
                </relocation>
              </relocations>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
  ...
</project>

Maven Classifier引用问题

最近在写集群的一些UT,使用了MiniDFSCluster以及MiniYarnCluster来搭建测试的模拟集群。在pom中引入依赖的时候发现了问题,通常引入依赖的时候使用下面的pom语句

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-server-tests</artifactId>
            <version>2.7.2</version>
            <scope>test</scope>
        </dependency>

加入后发现无法找到需要的类,找到jar包后比对了一下,发现引入的是hadoop-yarn-server-tests-2.7.2.jar,而我们需要的是hadoop-yarn-server-tests-2.7.2-tests.jar,查了一下发现需要加入classifier. classifier就是在maven主项目外通过标识能够获取的jar包,算作Maven pom的属性之一。所以修改pom文件如下就可以解决这一问题。加入classifier为tests之后就能够引入对应的jar包了。

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-yarn-server-tests</artifactId>
            <version>2.7.2</version>
            <classifier>tests</classifier>
            <scope>test</scope>
        </dependency>

ResourceManager dispatcher handler slow because RMStore Synchronized Method

Recently, I have noticed the Async dispatcher in our resource manager get pending for some times.
Here are some log:

2016-05-24 00:46:20,398 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 24000
2016-05-24 00:46:21,008 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 25000
2016-05-24 00:46:21,632 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 26000
2016-05-24 00:46:22,251 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 27000
2016-05-24 00:46:22,873 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 28000
2016-05-24 00:46:23,501 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 29000
2016-05-24 00:46:24,109 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Size of event-queue is 30000  

As we all know, the async dispatcher in resource manager normally handle the event quickly enough, but from the log ,we can notice the pending situation is serious.
So we investigated this problem, and jstack the rescue manager process during pending. Here is the jstack information:

"AsyncDispatcher event handler" prio=10 tid=0x00007f4d6db10000 nid=0x5bca waiting for monitor entry [0x00007f4d3aa8c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.storeNewApplication(RMStateStore.java:375)
        - waiting to lock <0x00000003bae88af0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppNewlySavingTransition.transition(RMAppImpl.java:881)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppNewlySavingTransition.transition(RMAppImpl.java:872)
        at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        - locked <0x0000000394cbae40> (a org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:645)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:82)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:690)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:674)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:662)

"AsyncDispatcher event handler" daemon prio=10 tid=0x00007f4d6d8f6000 nid=0x5c32 in Object.wait() [0x00007f4d3a183000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at org.apache.hadoop.hdfs.DFSOutputStream.waitForAckedSeqno(DFSOutputStream.java:2031)
        - locked <0x000000032bc7bd58> (a java.util.LinkedList)
        at org.apache.hadoop.hdfs.DFSOutputStream.flushInternal(DFSOutputStream.java:2015)
        at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2113)
        - locked <0x000000032bc7ba80> (a org.apache.hadoop.hdfs.DFSOutputStream)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.writeFile(FileSystemRMStateStore.java:528)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:329)
        - locked <0x00000003bae88af0> (a org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:625)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:770)
        at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:765)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:662)

It seemed the async dispatcher in resource manager is blocked by the method of storeNewApplication in RMSteateStore.
From the code we know there are two async dispatcher in resource manager process. One is the main dispatcher for whole resource manager to deal with applications submit, scheduler and other staff. The other is the dispatcher in rmstore, the function of rmstore can be explained in this [blog](https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html). Because rmstore use hdfs or zk for backup, the process time is slow, so it have its own dispatcher in case not to pending the main dispatcher of resource manager.
Unfortunately, we use hdfs for our rmstore back up, deep inside the code

  public synchronized void storeNewApplication(RMApp app) {
    ApplicationSubmissionContext context = app
                                            .getApplicationSubmissionContext();
    assert context instanceof ApplicationSubmissionContextPBImpl;
    ApplicationState appState =
        new ApplicationState(app.getSubmitTime(), app.getStartTime(), context,
          app.getUser());
    dispatcher.getEventHandler().handle(new RMStateStoreAppEvent(appState));
  }

The main dispatcher handle the event to store the new application context information, this method only pass the event to the rmstatestore dispatcher and return immediately. But the method is sync. And in the child class of rmstatestore — Filesystemrmstatestore, the code to store application to hdfs is as follow:

  @Override
  public synchronized void storeApplicationStateInternal(ApplicationId appId,
      ApplicationStateData appStateDataPB) throws Exception {
    String appIdStr = appId.toString();
    Path appDirPath = getAppDir(rmAppRoot, appIdStr);
    fs.mkdirs(appDirPath);
    Path nodeCreatePath = getNodePath(appDirPath, appIdStr);

    LOG.info("Storing info for app: " + appId + " at: " + nodeCreatePath);
    byte[] appStateData = appStateDataPB.getProto().toByteArray();
    try {
      // currently throw all exceptions. May need to respond differently for HA
      // based on whether we have lost the right to write to FS
      writeFile(nodeCreatePath, appStateData);
    } catch (Exception e) {
      LOG.info("Error storing info for app: " + appId, e);
      throw e;
    }
  }

This method is also sync. So if dispatcher in rmstatestore occupy the lock, and write to hdfs slow, it is easy to find the main dispatcher get pending to wait for the lock. And the lock is meaningless for main dispatcher, so we can just get rid of the lock.
There are related jira for this problem, the jira is YARN-4398.