六、singularity之GPU--tensorflow安装和使用(二)
1、搜索
dockersearch tensorflow
2、下载镜像
推荐使用北极星dockerserver集群,比较快,在login06上执行,文件差不多6GB
singularitybuild --no-https --sandbox tensorflow_gpu docker://bjxdockerfast:5000/tensorflow_gpu:latest
或者其他登陆节点:
singularitybuild --no-https --sandbox tensorflow_gpu docker://bjxdocker:5000/tensorflow_gpu:latest
或者从官网上获取(这里我们不推荐,因为官方的yum源是无法使用的,可以自己测试):
singularitybuild --sandbox tensorflow_gpu docker://tensorflow/tensorflow:latest-gpu
查看tag https://hub.docker.com/r/tensorflow/tensorflow/tags
3、在debuggpu01或者login12测试--这步可以不测试,感兴趣的测试
ssh debuggpu01
1} 使用所有显卡
singularityshell -B /appsnew/:/appsnew/ --nv --nvccli -w tensorflow_gpu
INFO: Setting 'NVIDIA_VISIBLE_DEVICES=all' to emulate legacy GPU binding.
##--nv nv显卡 --nvccli nv环境带过来可写
2} 选择显卡
#先定0,1,3显卡(3卡节点)
export NVIDIA_VISIBLE_DEVICES=0,1,3
singularityshell -B /appsnew/:/appsnew/ --contain --nv --nvccli -w tensorflow_gpu
--contain 使用指定的卡
测试:
Apptainer> python
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
...
device_type: "GPU"
memory_limit: 14474280960
locality {
bus_id: 1
links {
}
}
incarnation: 13349913758992036690
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5"
如果全部绑定就会显示全部显卡,一个就是一个 NVIDIA_VISIBLE_DEVICES=0(0号显卡); 两个就是:NVIDIA_VISIBLE_DEVICES=1,2(1,2号显卡)
4、在集群中测试:
1} 创建python脚本
创建目录并进入
mkdir gputest
cd gputest
[gao_pkuhpc@login06 gputest]$ cat gpus.py
# coding: utf-8
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
2} 创建一个使用所有GPU的脚本-gpu脚本参考北极星集群手册
也可以先生成一个脚本去改:
[gao_pkuhpc@login06 gputest]$ pkurun-g4c 1 4 sleep 1
Submitted batch job 16640868
[gao_pkuhpc@login06 gputest]$ ls
job.srp001448
[gao_pkuhpc@login06 gputest]$ vi job.srp001448
[gaog_pkuhpc@login06 gputest]$ pkurun-g4c 1 1 sleep 1
gpu=1 4
Submitted batch job 16641507
[gaog_pkuhpc@login06 gputest]$ vi job.srp00
job.srp001448 job.srp004604
[gaog_pkuhpc@login06 gputest]$ vi job.srp004604
#!/bin/bash
#SBATCH -J sle004604
#SBATCH -p gpu_4l
#SBATCH -N 1
#SBATCH -o sle004604_%j.out
#SBATCH -e sle004604_%j.err
#SBATCH --no-requeue
#SBATCH -A gaog_g1
#SBATCH --qos=gaogg4c
#SBATCH --gres=gpu:1
#SBATCH --overcommit
#SBATCH --mincpus=7
export NVIDIA_VISIBLE_DEVICES=0
singularityshell -B /appsnew/:/appsnew/ --contain --nv --nvccli -w tensorflow_gpu nvidia-smi
singularityshell -B /appsnew/:/appsnew/ --contain --nv --nvccli -w tensorflow_gpu gpus.py
7,在login12上操作
singularityshell -B /appsnew/:/appsnew/ --nv --nvccli -w tensorflow_gpu
Apptainer> pip3 install torch