CycleUser

昨天千问发布了最新的Qwen3-4b-Instruct-2507 和Qwen3-4b-Thinking-2507 模型，有如下两个亮点：

Qwen3-4B-Instruct-2507 的通用能力超越了商业闭源的小尺寸模型 GPT-4.1-nano，与中等规模的 Qwen3-30B-A3B（non-thinking）性能接近。

Qwen3-4B-Thinking-2507 的推理表现可媲美中等模型 Qwen3-30B-Thinking，在聚焦数学能力的 AIME25 测评中以4B参数获81.3分；且通用能力相关评测均超越了更大尺寸的Qwen3-30B-Thinking模型。

本指南介绍如何使用 llama.cpp 和 Ollama 部署 Qwen3-4b-Instruct-2507 和Qwen3-4b-Thinking-2507 模型。

直接用转换好的

模型已转换好，直接使用 Ollama 运行即可。

ollama run hopephoto/qwen3-4b-thinking-2507_q8 # 推理模型
ollama run hopephoto/qwen3-4b-instruct-2507_q8 # 指令模型

大概需要8G左右的磁盘空间。

性能评估

在 AMD 7950X + 128G DDR5 RAM + 4060 GPU 8GRAM 的配置的笔记本电脑上运行。

推理模型的评估命令为：

ollama run hopephoto/Qwen3-4B-Thinking-2507_q8 "请写一个Python函数来计算斐波那契数列" --verbose

得到的性能评测结果为：

total duration:       1m20.0929826s
load duration:        3.9914417s
prompt eval count:    26 token(s)
prompt eval duration: 518.139ms
prompt eval rate:     50.18 tokens/s
eval count:           3032 token(s)
eval duration:        1m15.5805682s
eval rate:            40.12 tokens/s

指令模型的评估命令为：

ollama run hopephoto/Qwen3-4B-Instruct-2507_q8 "请写一个Python函数来计算斐波那契数列" --verbose

得到的性能评测结果为：

total duration:       13.281791s
load duration:        3.6564452s
prompt eval count:    53 token(s)
prompt eval duration: 268.922ms
prompt eval rate:     197.08 tokens/s
eval count:           394 token(s)
eval duration:        9.3551987s
eval rate:            42.12 tokens/s

不出意外，4B模型，Q8量化，显存占用不足6G，4060上达到了40tokens/s。

自行转换

Python 3.10+
Git
Ollama
足够的磁盘空间（约30GB+，单个模型本体大约7.5G，q8量化后大约4G）

步骤一：安装依赖

1.1 安装 ModelScope

pip install modelscope

1.2 克隆 llama.cpp 仓库

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

步骤二：下载模型

使用 ModelScope 下载 Qwen3-4B-2507 模型：

modelscope download --model Qwen/Qwen3-4B-Thinking-2507  # 推理模型
modelscope download --model Qwen/Qwen3-4B-Instruct-2507  # 指令模型

推理模型和指令模型的下载的默认位置分别如下，注意将用户名 替换成实际的用户名。

C:\Users\用户名\.cache\modelscope\hub\models\qwen\Qwen3-4B-Thinking-2507

C:\Users\用户名\.cache\modelscope\hub\models\qwen\Qwen3-4B-Instruct-2507

步骤三：转换模型格式

将 Hugging Face 格式转换为 GGUF 格式，注意将用户名 替换成实际的用户名。

python convert_hf_to_gguf.py "C:\Users\用户名\.cache\modelscope\hub\models\qwen\Qwen3-4B-Thinking-2507" --outfile models\qwen3-4b-thinking-2507_q8.gguf --verbose --outtype q8_0 # 推理模型

python convert_hf_to_gguf.py "C:\Users\用户名\.cache\modelscope\hub\models\qwen\Qwen3-4B-Instruct-2507" --outfile models\qwen3-4b-instruct-2507_q8.gguf --verbose --outtype q8_0 # 指令模型

参数说明

--outfile: 输出文件路径
--verbose: 显示详细转换信息
--outtype q8_0: 量化类型，q8_0 提供较好的质量与大小平衡

步骤四：创建 Ollama Modelfile

创建配置文件 Modelfile_qwen3_coder_thinking用于推理模型：

FROM models/qwen3-4b-thinking-2507_q8.gguf

TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"

SYSTEM """你是Qwen，由阿里云开发的AI助手。你对用户的问题和请求总是有帮助、准确和诚实的。"""

创建配置文件 Modelfile_qwen3_coder_instruct用于指令模型：

FROM models/qwen3-4b-instruct-2507_q8.gguf

TEMPLATE """<|im_start|>system
{{ .System }}<|im_end|>
<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

PARAMETER stop "<|im_end|>"
PARAMETER stop "<|im_start|>"

SYSTEM """你是Qwen，由阿里云开发的AI助手。你对用户的问题和请求总是有帮助、准确和诚实的。"""

配置说明

FROM: 指定模型文件路径
TEMPLATE: 定义对话模板，使用 Qwen 特定的格式
PARAMETER stop: 设置停止标记
SYSTEM: 设置系统提示词

文件结构

llama.cpp/
├── models/
│   └── Qwen3-4B-Thinking-2507_q8.gguf
│   └── Qwen3-4B-Instruct-2507_q8.gguf
├── Modelfile_qwen3_coder_thinking
├── Modelfile_qwen3_coder_instruct
└── convert_hf_to_gguf.py

步骤五：在 Ollama 中创建模型

ollama create Qwen3-4B-Thinking-2507_q8 -f Modelfile_qwen3_coder_thinking

ollama create Qwen3-4B-Instruct-2507_q8 -f Modelfile_qwen3_coder_instruct

步骤六：验证安装

运行以下命令测试模型：

ollama run Qwen3-4B-Thinking-2507_q8
ollama run Qwen3-4B-Instruct-2507_q8

使用示例

代码生成示例

ollama run Qwen3-4B-Thinking-2507_q8 "请写一个Python函数来计算斐波那契数列"

ollama run Qwen3-4B-Instruct-2507_q8 "请写一个Python函数来计算斐波那契数列"

代码解释示例

ollama run Qwen3-4B-Thinking-2507_q8 "解释这段代码的功能：def quicksort(arr): ..."

ollama run Qwen3-4B-Instruct-2507_q8 "解释这段代码的功能：def quicksort(arr): ..."