马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
在沐曦C500 64G设备上运行ms-swift时报内核错误(日志1332~3323行)
(swift) root@jupyter-xtuner-2-5449cfb555-gvhwq:/sys/kernel/boot_params# mx-smi
mx-smi version: 2.2.3
=================== MetaX System Management Interface Log ===================
Timestamp : Fri May 23 17:58:15 2025
Attached GPUs : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.3 Kernel Mode Driver Version: 2.12.13 |
| MACA Version: 2.32.0.6 BIOS Version: 1.16.2.0 |
|------------------------------------+---------------------+----------------------+
| GPU NAME | Bus-id | GPU-Util |
| Temp Pwr:Usage/Cap | Memory-Usage | |
|====================================+=====================+======================|
| 0 MetaX C500 | 0000:10:00.0 | 0% |
| 37C 57W / NA | 826/65536 MiB | |
+------------------------------------+---------------------+----------------------+
+---------------------------------------------------------------------------------+
| Process: |
| GPU PID Process Name GPU Memory |
| Usage(MiB) |
|=================================================================================|
| no process found |
+---------------------------------------------------------------------------------+
End of Log
其中训练脚本如下
#!/bin/bash
# 创建日志目录
LOG_DIR="logs"
mkdir -p $LOG_DIR # 确保日志目录存在,如果不存在则创建
# 获取当前时间戳,用于生成唯一的日志文件名
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG_FILE="$LOG_DIR/internlm3-8b_lora_sft_${TIMESTAMP}.log" # 设置日志文件路径
# 设置CUDA环境变量
export NPROC_PER_NODE=1 # 设置每个节点使用的进程数为1
export OMP_NUM_THREADS=1 # 限制OpenMP线程数为1,避免过多线程竞争
export CUDA_VISIBLE_DEVICES=0 # 指定使用的GPU编号为0
# 使用nohup命令在后台运行训练任务,即使终端关闭也能继续运行
nohup swift sft \
--model ~/public-model/models/Shanghai_AI_Laboratory/internlm3-8b \
--train_type lora \
--dataset '~/public-model/models/Shanghai_AI_Laboratory/swift/dataset/train.jsonl' \
--torch_dtype float16 \
--num_train_epochs 2 \
--per_device_train_batch_size 4 \
--learning_rate 5e-5 \
--warmup_ratio 0.1 \
--split_dataset_ratio 0 \
--report_to wandb \
--lora_rank 8 \
--lora_alpha 32 \
--use_chat_template false \
--target_modules all-linear \
--gradient_accumulation_steps 2 \
--save_steps 1000 \
--save_total_limit 5 \
--gradient_checkpointing_kwargs '{"use_reentrant": false}' \
--logging_steps 5 \
--max_length 4096 \
--output_dir ./swift_output/InternLM3-8B-Lora \
--dataset_num_proc 16 \
--dataloader_num_workers 16 \
--model_author FreshLittleLemon \
--model_name InternLM3-8B-Lora \
> "$LOG_FILE" 2>&1 &
# 打印进程ID和日志文件位置,便于用户跟踪
echo "Training started with PID $!" # 显示后台进程的PID
echo "Log file: $LOG_FILE" # 显示日志文件位置
# 提示用户如何实时查看日志
echo "To view logs in real-time, use:"
echo "tail -f $LOG_FILE"
tail -f $LOG_FILE
环境如下
mx-smi
conda create -n swift python=3.10 -y
conda activate swift
wget -O maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz "https://wheel-pub.oss-cn-shanghai.aliyuncs.com/mxc500/2.32.2.x/x86_64/maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1748011284&Signature=hIZQHjrcey8SVfL6HP3mXPSZJq8%3D"
tar -xvf maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz
cd 2.32.0.3/wheel
pip install apex-0.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torch-2.6.0+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torchaudio-2.4.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torchvision-0.15.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install triton-3.0.0+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
cd
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install wandb
|
|