马上注册,结交更多好友,享用更多功能,让你轻松玩转社区。
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 沐曦-马天舒 于 2025-3-26 15:15 编辑
版本信息:
Component Name | Version Information
| Metax vbios | 1.23.1.0 | Metax Driver | 2.31.0.6
| MACA SDK | 2.31.0.6
| MACA Pytorch | 2.31.0.4 |
发布内容:
[Feature] [KMD] KMD 增加对ccx fw 和ccx boot(vbios)的版本兼容性检测 [Feature] [FW] 支持ccxfirmware与ccx boot版本兼容 [Feature] [UMD] 产生MACASDK x86, arm的apt,yum安装包 [Feature] [UMD] UMD支持GPU拓扑感知
[Feature] [UMD]支持在线安装,SDK的安装路径和安装后的版本号要改成3位 [Feature] [mccl] [复用SDMA] 多机SDMA sendRecv并行方案开发 [Feature] [mccl] [工具]集群状态检测脚本开发/优化 [Feature] [mccl] MCCL添加RAS功能
[Feature] [Compiler] OpenACC编译器新增对WRF项目工程中涵盖的语义特性支持 [Feature] [Compiler] 新增OpenACC编译器运行时依赖python包列表显示的选项支持
[Feature] [ACL] Add headdim 512 in FlashAttention [Feature] [ACL] Add w8a8 azp functionality
[Feature] [ACL] blas_T support for GEMM+Commincation api
[Feature] [ACL] blasLT improve performance for gemm+bias api
[Feature] [ACL] mcDNN fp32/tf32单向LSTM 性能优化
[Feature] [ACL] flashAttn提供kernel selector插件以针对特定shape选择性能更好的kernel
[Feature] [ACL] flashAttn提供kv cache int8反量化功能以支持decoder阶段提升性能 [Feature] [ACL] flashInfer升级版本到0.2.x版本以支持deepseekV3推理 [Feature] [ACL]mcFft库优化8192范围内素数size的性能从30%提升到60%
[Feature] [DC-Tools] mx-smi支持在更新vbios失效的情况下,强刷同sku类型的vbios
[Improvement] [UMD] Graph ib模式中的memcpy支持在graphlnstantiate后开启profiling
[Improvement] [UMD] 优化GraphLaunch耗时
[Improvement] [UMD] direct dispatch模式下支持mcDeviceSetGraphMemAttributeAPI [Improvement] [UMD] direct dispatch模式下支持mcDeviceGraphMemTrim API [Improvement] [UMD] direct dispatch模式下支持mcDeviceGetGraphMemAttribute API [Improvement] [UMD]默认开启kernel前L2 flush的优化
[Improvement] [UMD] 支持MXMACA应用程序hang 住时获取基本调试信息【V1】
[Improvement] [UMD] trap kernel 精准定位方案细化完善(command级别精准定位) [Improvement] [UMD] mcTracer工具支持profiling区间可配置 [Improvement] [mccl] [C500X] Dragonfly 8卡通信算法优化
[Improvement] [mccl] [C550 Switch] 开发分层算法提升C550 SwitchBox拓扑16卡性能 [Improvement] [mccl] [C550 Switch] 开发分层算法提升C550 Switch Box拓扑64卡性能
[Improvement] [Compiler] Improve flash attn2 performance over 1% with fast divsolution
[Improvement] [Compiler] Improve vIlm kDequantize performance over 4% withtable lookup algorithm solution
[Improvement] [Compiler] Improve flash attn2 performance by B16 registerallocate
[Improvement] [Compiler] Improve triton-flash attention2 bwd performance withmetaxgpu sip solution [Improvement] [ACL] TF32&Int8 gemmperformance optimization for new fused id+ 2m pagesize [Improvement] [ACL] blas performanceimprovement for LLM inference (including deepseek r1)
[Improvement] [ACL] deepseek mla triton算子性能优化,平均性能70%+a100 [Improvement] [ACL] deepseek fused moetriton算子性能优化,batch__size <= 256时平均性能 a100 70%
[Improvement] [ACL] pytorch cat算子性能优化,平均性能提升15%
[Improvement] [ACL] blas performance improvement for gemv
[Improvement] [DC-Tools] mxvs工具打流支持switch地址遍历
[Improvement] [DC-Tools] cubridge 编译时提示NVML_DEVICE_PCL_BUS_ID_FMT 没有declare
|