MetaX C500运行ms-swift微调internlm3-8b报内核错误

查看全部 · 发表于 2025-5-23 18:00:34

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

x

在沐曦C500 64G设备上运行ms-swift时报内核错误（日志1332~3323行）

(swift) root@jupyter-xtuner-2-5449cfb555-gvhwq:/sys/kernel/boot_params# mx-smi
mx-smi  version: 2.2.3

=================== MetaX System Management Interface Log ===================
Timestamp                                        : Fri May 23 17:58:15 2025

Attached GPUs                                  : 1
+---------------------------------------------------------------------------------+
| MX-SMI 2.2.3                      Kernel Mode Driver Version: 2.12.13       |
| MACA Version: 2.32.0.6             BIOS Version: 1.16.2.0                   |
|------------------------------------+---------------------+----------------------+
| GPU       NAME                | Bus-id             | GPU-Util          |
| Temp       Pwr:Usage/Cap       | Memory-Usage       |                   |
|====================================+=====================+======================|
| 0          MetaX C500          | 0000:10:00.0       | 0%                |
| 37C       57W / NA             | 826/65536 MiB    |                   |
+------------------------------------+---------------------+----------------------+

+---------------------------------------------------------------------------------+
| Process:                                                                      |
|  GPU                   PID       Process Name                GPU Memory    |
|                                                                Usage(MiB)    |
|=================================================================================|
|  no process found                                                             |
+---------------------------------------------------------------------------------+

End of Log

其中训练脚本如下
#!/bin/bash

# 创建日志目录
LOG_DIR="logs"
mkdir -p $LOG_DIR  # 确保日志目录存在，如果不存在则创建

# 获取当前时间戳，用于生成唯一的日志文件名
TIMESTAMP=$(date +"%Y%m%d_%H%M%S")
LOG_FILE="$LOG_DIR/internlm3-8b_lora_sft_${TIMESTAMP}.log"  # 设置日志文件路径

# 设置CUDA环境变量
export NPROC_PER_NODE=1  # 设置每个节点使用的进程数为1
export OMP_NUM_THREADS=1  # 限制OpenMP线程数为1，避免过多线程竞争
export CUDA_VISIBLE_DEVICES=0  # 指定使用的GPU编号为0

# 使用nohup命令在后台运行训练任务，即使终端关闭也能继续运行
nohup swift sft \
--model ~/public-model/models/Shanghai_AI_Laboratory/internlm3-8b \
--train_type lora \
--dataset '~/public-model/models/Shanghai_AI_Laboratory/swift/dataset/train.jsonl' \
--torch_dtype float16 \
--num_train_epochs 2 \
--per_device_train_batch_size 4 \
--learning_rate 5e-5 \
--warmup_ratio 0.1 \
--split_dataset_ratio 0 \
--report_to wandb \
--lora_rank 8 \
--lora_alpha 32 \
--use_chat_template false \
--target_modules all-linear \
--gradient_accumulation_steps 2 \
--save_steps 1000 \
--save_total_limit 5 \
--gradient_checkpointing_kwargs '{"use_reentrant": false}' \
--logging_steps 5 \
--max_length 4096 \
--output_dir ./swift_output/InternLM3-8B-Lora \
--dataset_num_proc 16 \
--dataloader_num_workers 16 \
--model_author FreshLittleLemon \
--model_name InternLM3-8B-Lora \
> "$LOG_FILE" 2>&1 &

# 打印进程ID和日志文件位置，便于用户跟踪
echo "Training started with PID $!"  # 显示后台进程的PID
echo "Log file: $LOG_FILE"  # 显示日志文件位置

# 提示用户如何实时查看日志
echo "To view logs in real-time, use:"
echo "tail -f $LOG_FILE"
tail -f $LOG_FILE

环境如下

mx-smi
conda create -n swift python=3.10 -y
conda activate swift
wget -O maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz "https://wheel-pub.oss-cn-shanghai.aliyuncs.com/mxc500/2.32.2.x/x86_64/maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz?OSSAccessKeyId=LTAI5t8HeoJo71RpDsrCMZbQ&Expires=1748011284&Signature=hIZQHjrcey8SVfL6HP3mXPSZJq8%3D"
tar -xvf maca-pytorch2.6-py310-2.32.0.3-x86_64.tar.xz
cd 2.32.0.3/wheel
pip install apex-0.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torch-2.6.0+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torchaudio-2.4.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install torchvision-0.15.1+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
pip install triton-3.0.0+metax2.32.0.3-cp310-cp310-linux_x86_64.whl
cd
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .
pip install wandb

MetaX C500运行ms-swift微调internlm3-8b报内核错误

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

图文热点

社区下载页面无法登录

驱动安装成功，工具无法使用

【微调打榜内测】【书生】端侧小模型论

推荐话题

精彩时刻

新人必看