查询
查询系统
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
(base) root@ubuntu:/home/ubuntu# hostnamectl
Static hostname: ubuntu
Icon name: computer-desktop
Chassis: desktop 🖥️
Machine ID: xxx
Boot ID: xxx
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-1012-oem
Architecture: x86-64
Hardware Vendor: Dell Inc.
Hardware Model: Dell Pro Max Tower T2 FCT2250
Firmware Version: 1.8.1
Firmware Date: Fri 2025-08-15
Firmware Age: 1month 2w 4d
|
查询cpu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
(base) root@ubuntu:/home/ubuntu# lscpu
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
字节序: Little Endian
CPU: 24
在线 CPU 列表: 0-23
厂商 ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
型号名称: Intel(R) Core(TM) Ultra 9 285K
BIOS Model name: Intel(R) Core(TM) Ultra 9 285K CPU @ 5.3GHz
BIOS CPU family: 775
CPU 系列: 6
型号: 198
每个核的线程数: 1
每个座的核数: 24
座: 1
步进: 2
CPU(s) scaling MHz: 31%
CPU 最大 MHz: 6500.0000
CPU 最小 MHz: 800.0000
BogoMIPS: 7372.80
...
|
查询内存
1
2
3
4
|
(base) root@ubuntu:/home/ubuntu# free -h
total used free shared buff/cache available
内存: 125Gi 4.4Gi 117Gi 129Mi 4.7Gi 120Gi
交换: 8.0Gi 0B 8.0Gi
|
查询硬盘占用
1
2
3
4
5
6
7
8
9
|
(base) root@ubuntu:/home/ubuntu# df -h
文件系统 大小 已用 可用 已用% 挂载点
tmpfs 13G 2.6M 13G 1% /run
/dev/nvme0n1p2 3.6T 71G 3.4T 3% /
tmpfs 63G 6.6M 63G 1% /dev/shm
tmpfs 5.0M 12K 5.0M 1% /run/lock
efivarfs 438K 271K 163K 63% /sys/firmware/efi/efivars
/dev/nvme0n1p1 1.1G 6.2M 1.1G 1% /boot/efi
tmpfs 13G 192K 13G 1% /run/user/1000
|
查询显卡
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
(base) root@ubuntu:/home/ubuntu# nvidia-smi
Thu Oct 2 20:44:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.172.08 Driver Version: 570.172.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:02:00.0 On | N/A |
| 0% 53C P8 38W / 575W | 433MiB / 32607MiB | 4% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2062 G /usr/lib/xorg/Xorg 263MiB |
| 0 N/A N/A 2338 G /usr/bin/gnome-shell 32MiB |
| 0 N/A N/A 2581 G .../sunloginclient --cmd=autorun 22MiB |
| 0 N/A N/A 2911 G ...me/58.0.3029.81 Safari/537.36 6MiB |
| 0 N/A N/A 3015 G ...267E0D65798C2E81534660824D3E7 16MiB |
| 0 N/A N/A 3124 G ...exec/xdg-desktop-portal-gnome 8MiB |
+-----------------------------------------------------------------------------------------+
|
安装驱动
系统更新
1
2
|
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git wget curl software-properties-common
|
安装 NVIDIA 驱动
自动检测并安装推荐驱动:
1
2
|
sudo ubuntu-drivers autoinstall
sudo reboot
|
重启后验证:
如果能显示 GPU 型号、驱动版本,说明驱动安装成功。
下载 CUDA,
1
2
3
|
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda_12.8.0_570.86.10_linux.run
chmod +x cuda_12.8.0_570.86.10_linux.run
sudo sh cuda_12.8.0_570.86.10_linux.run
|
注意安装时选择不安装 NVIDIA 驱动
。
配置环境变量,
1
2
3
|
echo 'export PATH=/usr/local/cuda-12.8/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
|
检查,
安装 cuDNN
下载 cuDNN 对应 CUDA 版本,
1
2
3
|
wget https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-9.13.1.26_cuda12-archive.tar.xz
tar -xvf cudnn-linux-x86_64-9.13.1.26_cuda12-archive.tar.xz
cd cudnn-linux-x86_64-9.13.1.26_cuda12-archive
|
拷贝库和头文件到 CUDA 路径,
1
2
3
4
|
sudo cp include/cudnn*.h /usr/local/cuda/include/
sudo cp lib/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn*.h /usr/local/cuda/lib64/libcudnn*
sudo ldconfig
|
验证版本,
1
|
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
|
安装 Anaconda
下载最新版本,
1
2
3
|
wget https://repo.anaconda.com/archive/Anaconda3-2025.06-1-Linux-x86_64.sh
chmod +x Anaconda3-2025.06-1-Linux-x86_64.sh
bash Anaconda3-2025.06-1-Linux-x86_64.sh
|
配置自动激活base
环境,
1
2
|
conda config --set auto_activate_base true
source ~/.bashrc
|
验证,
1
2
|
conda --version
python --version
|
检测驱动
检查硬件
-
有线/无线网卡
-
蓝牙
-
显示器HDMI2.1、DP2.1
-
USB/Type-C
检查软件
-
Anaconda
-
Pytorch
-
Tensorflow
-
Jax
-
Slurm
输入法
1
2
|
sudo apt-get remove --purge fcitx
sudo apt-get autoremove
|
安装fcitx5
,
1
|
sudo apt-get install fcitx5 fcitx5-configtool fcitx5-chinese-addons
|
搜索fcitx5配置
调节即可。
- 结果重启之后,
ibus
可以正常切换中文了(可能卸载 fcitx4
后,ibus
环境干净了,重启时ibus-daemon
就正常工作,中文恢复可用),现在仍然使用的ibus
,并把切换中英文改为Shift
,与搜狗输入法保持一致。
编辑器
安装,
1
2
|
sudo apt install pluma -y
sudo apt install gedit -y
|
但系统中搜索会出现两个gedit,分别是apt和snap两个版本,卸载snap版本,
Clash
下载clash-verge-rev,安装
1
|
sudo apt install -y ./Clash.Verge_x.x.x-_xxx.deb
|
Chrome浏览器
1
2
|
wget -O google.deb https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install -y ./google.deb
|
远程桌面向日葵
1
|
sudo snap install sunlogin-client
|
字体
安装 Windows 字体(C:\Windows\Fonts
复制到文件夹windows11
中),
1
2
|
sudo cp -r windows11/ /usr/share/fonts
sudo fc-cache -fsv
|
安装GNOME Tweaks
替换字体为Microsoft YaHei Mono
,
1
|
sudo apt install -y gnome-tweaks
|
修改主机名
1
2
3
4
5
|
sudo hostnamectl set-hostname ubuntu
sudo gedit /etc/hosts
127.0.0.1 localhost
127.0.1.1 新主机名
|
修改密码
1
2
|
sudo passwd ubuntu
sudo passwd root
|
Slurm
安装
- 安装依赖与组件。更新系统并安装
Slurm
及认证工具Munge
,
1
2
|
sudo apt update
sudo apt install -y slurm-wlm slurmctld slurmd munge
|
1
2
3
|
sudo useradd -r -m -s /usr/sbin/nologin slurm
sudo mkdir -p /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
sudo chown -R slurm:slurm /var/spool/slurmctld /var/spool/slurmd /var/log/slurm
|
- 配置
Munge
(认证服务),Slurm
通过Munge
做认证,
1
2
3
4
|
sudo /usr/sbin/create-munge-key
sudo chown -R munge:munge /etc/munge /var/lib/munge /var/log/munge
sudo systemctl enable --now munge
systemctl status munge --no-pager
|
1
|
sudo gedit /etc/slurm/slurm.conf
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
|
# 基本集群信息
ClusterName=slurm-cluster
ControlMachine=ubuntu # 请修改为你的主机名
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=0
# 调度策略
SchedulerType=sched/backfill
SelectType=select/cons_tres
# 日志记录
SlurmctldDebug=info
SlurmdDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/none
# 节点定义 (单节点配置)
NodeName=ubuntu Sockets=1 CoresPerSocket=8 ThreadsPerCore=1 RealMemory=125000 Gres=gpu:1 State=UNKNOWN
# 分区定义
PartitionName=debug Nodes=ubuntu Default=YES MaxTime=INFINITE State=UP
|
1
2
|
sudo systemctl enable slurmctld slurmd
sudo systemctl start slurmctld slurmd
|
检查,
1
2
3
|
(base) root@ubuntu:/home/ubuntu# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle ubuntu
|
注意
-
SelectType=select/cons_tres
或者SelectType=select/linear
均可,推荐前者。
-
不需要在/etc/slurm/gres.conf
中写入NodeName=ubuntu Name=gpu File=/dev/nvidia0
,也不需要在job.sh
中增加#SBATCH --gres=gpu:1
,直接调用Python脚本是可以正确识别GPU的。
-
CoresPerSocket=8
而不是24。原因是slurmd -C
输出8而不是24,slurmd
的CPU拓扑探测不准,通常是slurmd/hwloc
对新CPU识别不全,只能识别8个Performance-core
(性能核),而忽略了16个Efficient-core
(能效核)。
提交脚本job.sh
1
2
3
4
5
6
7
8
|
#!/bin/bash
#SBATCH --job-name=name
#SBATCH --output=%j.out
#SBATCH --partition=debug
#SBATCH --ntasks=1
#SBATCH --exclusive
python xx.py > log
|
一些命令
1
2
3
4
5
6
7
8
|
提交任务,
sbatch job.sh
查看任务,
squeue
删除任务,
scancel id
|
命令别名
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
alias qva='sbatch job.sh'
alias q='squeue'
alias qd='scancel'
alias sob='source ~/.bashrc'
alias nv='nvidia-smi'
alias mk='mkdir'
alias cp='cp -r'
alias hi='history'
alias tary='tar -zcvf'
alias tarj='tar -zxvf'
alias nv='nvidia-smi'
alias show='scontrol show job'
alias x.sh='chmod +x x.sh&&./x.sh'
alias matlab='matlab -nodesktop -nojvm -nosplash -nodisplay'
|