Fin-R1 GGUF Models

Choosing the Right Model Format

Selecting the correct model format depends on your hardware capabilities and memory constraints.

BF16 (Brain Float 16) – Use if BF16 acceleration is available

A 16-bit floating-point format designed for faster computation while retaining good precision.
Provides similar dynamic range as FP32 but with lower memory usage.
Recommended if your hardware supports BF16 acceleration (check your device’s specs).
Ideal for high-performance inference with reduced memory footprint compared to FP32.

📌 Use BF16 if:
✔ Your hardware has native BF16 support (e.g., newer GPUs, TPUs).
✔ You want higher precision while saving memory.
✔ You plan to requantize the model into another format.

📌 Avoid BF16 if:
❌ Your hardware does not support BF16 (it may fall back to FP32 and run slower).
❌ You need compatibility with older devices that lack BF16 optimization.

F16 (Float 16) – More widely supported than BF16

A 16-bit floating-point high precision but with less of range of values than BF16.
Works on most devices with FP16 acceleration support (including many GPUs and some CPUs).
Slightly lower numerical precision than BF16 but generally sufficient for inference.

📌 Use F16 if:
✔ Your hardware supports FP16 but not BF16.
✔ You need a balance between speed, memory usage, and accuracy.
✔ You are running on a GPU or another device optimized for FP16 computations.

📌 Avoid F16 if:
❌ Your device lacks native FP16 support (it may run slower than expected).
❌ You have memory limitations.

Quantized Models (Q4_K, Q6_K, Q8, etc.) – For CPU & Low-VRAM Inference

Quantization reduces model size and memory usage while maintaining as much accuracy as possible.

Lower-bit models (Q4_K) → Best for minimal memory usage, may have lower precision.
Higher-bit models (Q6_K, Q8_0) → Better accuracy, requires more memory.

📌 Use Quantized Models if:
✔ You are running inference on a CPU and need an optimized model.
✔ Your device has low VRAM and cannot load full-precision models.
✔ You want to reduce memory footprint while keeping reasonable accuracy.

📌 Avoid Quantized Models if:
❌ You need maximum accuracy (full-precision models are better for this).
❌ Your hardware has enough VRAM for higher-precision formats (BF16/F16).

Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)

These models are optimized for extreme memory efficiency, making them ideal for low-power devices or large-scale deployments where memory is a critical constraint.

IQ3_XS: Ultra-low-bit quantization (3-bit) with extreme memory efficiency.
- Use case: Best for ultra-low-memory devices where even Q4_K is too large.
- Trade-off: Lower accuracy compared to higher-bit quantizations.
IQ3_S: Small block size for maximum memory efficiency.
- Use case: Best for low-memory devices where IQ3_XS is too aggressive.
IQ3_M: Medium block size for better accuracy than IQ3_S.
- Use case: Suitable for low-memory devices where IQ3_S is too limiting.
Q4_K: 4-bit quantization with block-wise optimization for better accuracy.
- Use case: Best for low-memory devices where Q6_K is too large.
Q4_0: Pure 4-bit quantization, optimized for ARM devices.
- Use case: Best for ARM-based devices or low-memory environments.

Summary Table: Model Format Selection

Model Format	Precision	Memory Usage	Device Requirements	Best Use Case
BF16	Highest	High	BF16-supported GPU/CPUs	High-speed inference with reduced memory
F16	High	High	FP16-supported devices	GPU inference when BF16 isn’t available
Q4_K	Medium Low	Low	CPU or Low-VRAM devices	Best for memory-constrained environments
Q6_K	Medium	Moderate	CPU with more memory	Better accuracy while still being quantized
Q8_0	High	Moderate	CPU or GPU with enough VRAM	Best accuracy among quantized models
IQ3_XS	Very Low	Very Low	Ultra-low-memory devices	Extreme memory efficiency and low accuracy
Q4_0	Low	Low	ARM or low-memory devices	llama.cpp can optimize for ARM devices

Included Files & Details

`Fin-R1-bf16.gguf`

Model weights preserved in BF16.
Use this if you want to requantize the model into a different format.
Best if your device supports BF16 acceleration.

`Fin-R1-f16.gguf`

Model weights stored in F16.
Use if your device supports FP16, especially if BF16 is not available.

`Fin-R1-bf16-q8_0.gguf`

Output & embeddings remain in BF16.
All other layers quantized to Q8_0.
Use if your device supports BF16 and you want a quantized version.

`Fin-R1-f16-q8_0.gguf`

Output & embeddings remain in F16.
All other layers quantized to Q8_0.

`Fin-R1-q4_k.gguf`

Output & embeddings quantized to Q8_0.
All other layers quantized to Q4_K.
Good for CPU inference with limited memory.

`Fin-R1-q4_k_s.gguf`

Smallest Q4_K variant, using less memory at the cost of accuracy.
Best for very low-memory setups.

`Fin-R1-q6_k.gguf`

Output & embeddings quantized to Q8_0.
All other layers quantized to Q6_K .

`Fin-R1-q8_0.gguf`

Fully Q8 quantized model for better accuracy.
Requires more memory but offers higher precision.

`Fin-R1-iq3_xs.gguf`

IQ3_XS quantization, optimized for extreme memory efficiency.
Best for ultra-low-memory devices.

`Fin-R1-iq3_m.gguf`

IQ3_M quantization, offering a medium block size for better accuracy.
Suitable for low-memory devices.

`Fin-R1-q4_0.gguf`

Pure Q4_0 quantization, optimized for ARM devices.
Best for low-memory environments.
Prefer IQ4_NL for better accuracy.

🚀 If you find these models useful

Please click like ❤ . Also I’d really appreciate it if you could test my Network Monitor Assistant at 👉 Network Monitor Assitant.

💬 Click the chat icon (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM.

What I'm Testing

I'm experimenting with function calling against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function".

🟡 TestLLM – Runs the current testing model using llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a time—still working on scaling!). If you're curious, I'd be happy to share how it works! .

The other Available AI Assistants

🟢 TurboLLM – Uses gpt-4o-mini Fast! . Note: tokens are limited since OpenAI models are pricey, but you can Login or Download the Quantum Network Monitor agent to get more tokens, Alternatively use the TestLLM .

🔵 HugLLM – Runs open-source Hugging Face models Fast, Runs small models (≈8B) hence lower quality, Get 2x more tokens (subject to Hugging Face API availability)

Final Word

I fund the servers used to create these model files, run the Quantum Network Monitor service, and pay for inference from Novita and OpenAI—all out of my own pocket. All the code behind the model creation and the Quantum Network Monitor project is open source. Feel free to use whatever you find helpful.

If you appreciate the work, please consider buying me a coffee ☕. Your support helps cover service costs and allows me to raise token limits for everyone.

I'm also open to job opportunities or sponsorship.

Thank you! 😊

Fin-R1：通过强化学习驱动的金融推理大模型

📄 中文 | EN

Fin-R1 是一款针对金融领域复杂推理的大型语言模型，由上海财经大学统计与数据科学学院张立文教授与其领衔的金融大语言模型课题组（SUFE-AIFLM-Lab）联合财跃星辰研发并开源发布。该模型以 Qwen2.5-7B-Instruct 为基座，通过高质量的可验证金融问题微调训练，最终表现在多个金融领域基准测试上的表现达到参评模型的SOTA水平。

Code: https://github.com/SUFE-AIFLM-Lab/Fin-R1

💡 场景应用

Fin-R1 是一款专为金融推理领域设计的大语言模型，采用轻量化的 7B 参数量级架构。在显著降低部署成本的同时，该模型通过在针对金融推理场景的高质量思维链数据上采用 SFT（监督微调）和 RL（强化学习）两阶段训练，为模型在金融领域的应用提供了坚实的理论支撑、业务规则、决策逻辑以及技术实现能力，从而有效提升模型的金融复杂推理能力，为银行、证券、保险以及信托等金融核心业务场景提供有力支持。

金融代码

金融代码是指在金融领域中用于实现各种金融模型、算法和分析任务的计算机编程代码，涵盖了从简单的财务计算到复杂的金融衍生品定价、风险评估和投资组合优化等多个方面，以方便金融专业人士进行数据处理、统计分析、数值计算和可视化等工作。

金融计算

金融计算是对金融领域的各种问题进行定量分析和计算的过程，其核心在于通过建立数学模型和运用数值方法来解决实际金融问题，可为金融决策提供科学依据，帮助金融机构和投资者更好地管理风险、优化资源配置和提高投资回报率。

英语金融计算

英语金融计算强调在跨语言环境下使用英语进行金融模型的构建和计算，并能够以英语撰写金融分析报告和与国际同行进行沟通交流。

金融安全合规

金融安全合规聚焦于防范金融犯罪与遵守监管要求，帮助企业建立健全的合规管理体系，定期进行合规检查和审计，确保业务操作符合相关法规要求。

智能风控

智能风控利用AI与大数据技术识别和管理金融风险，与传统风控手段相比，智能风控具有更高的效率、准确性和实时性，它通过对海量金融数据的深度挖掘和分析，能够发现潜在的风险模式和异常交易行为，从而及时预警和采取相应的风险控制措施。

ESG分析

ESG分析通过评估企业在环境（Environmental）、社会（Social）、治理（Governance）的表现，衡量其可持续发展能力，确保投资活动不仅能够获得财务回报，还能促进可持续发展和社会责任的履行。金融机构和企业也通过提升自身的 ESG 绩效，来满足投资者和社会对企业更高的期望和要求。

总体工作流程

我们基于 DeepSeek-R1 构建了数据蒸馏框架，并严格按照官方参数设定进行数据处理，采用两阶段数据筛选方法提升金融领域数据质量，生成了SFT数据集和RL数据集。在训练过程中，我们利用Qwen2.5-7B-Instruct，通过监督微调（SFT）和强化学习（RL）训练金融推理大模型 Fin-R1，以提升金融推理任务的准确性和泛化能力。

🛠️ 数据构建

为将 DeepSeek-R1 的推理能力迁移至金融场景并解决高质量金融推理数据问题，我们用Deepseek-R1（满血版）针对涵盖行业语料（FinCorpus、Ant_Finance），专业认知（FinPEE），业务知识（FinCUGE、FinanceIQ、Finance-Instruct-500K），表格解析（FinQA），市场洞察（TFNS），多轮交互（ConvFinQA）以及量化投资（FinanceQT）的多个数据集进行领域知识蒸馏筛选，构建了约 60k 条面向专业金融推理场景的高质量 COT 数据集 Fin-R1-Data 。该数据集涵盖中英文金融垂直领域的多维度专业知识，并根据具体任务内容将其分为金融代码、金融专业知识、金融非推理类业务知识和金融推理类业务知识四大模块，可有效支撑银行、基金和证券等多个金融核心场景。本研究构建了基于 Deepseek-R1 的数据蒸馏框架，并创新性提出对思维链进行“答案+推理”双轮质量打分筛选方法，首轮基于规则匹配和 Qwen2.5-72B-Instruct 对答案准确性评分，次轮对推理链的逻辑一致性、术语合规性等推理逻辑进行深度校验以保证数据质量。

数据蒸馏

在蒸馏过程中，我们严格依照 DeepSeek - R1 官方提供的细节，进行相应设置的数据蒸馏操作。

数据筛选

针对金融数据结构的复杂特性采取对思维链进行“答案+推理逻辑”双轮质量打分的创新方式筛选，首轮基于规则匹配和 Qwen2.5-72B-Instruct 对答案准确性评分，次轮对推理链的逻辑一致性、术语合规性等推理逻辑进行深度校验以保证数据质量，每次打分筛选出的数据标注为 good 或 bad 进行区分：

1）答案打分：对于蒸馏得到的数据，针对客观题（如选择题、判断题），采用基于规则的匹配方式，校对蒸馏数据的正确性；对于无法通过规则匹配的结果，利用 Qwen2.5-72B-Instruct 对模型生成的答案以及正确答案进行打分，正确得 1 分，错误得 0 分。

2）推理过程打分：对于经过上一步筛选得到的正确思维链数据，再次利用 Qwen2.5-72B-Instruct 对推理轨迹进行打分，高质量数据得 1 分，低质量数据得 0 分。我们采取了如下几个指标来进行打分：

1.内部一致性：检查推理过程中的步骤是否一致，并且是否能够逐步逻辑地推导出标准答案。

2.术语重叠度：检查推理过程中使用的术语与标准答案中的术语的重叠程度。重叠度越高越好。

3.推理步骤数量：评估推理过程是否包含足够的步骤（至少3步）。

4.逻辑一致性：确保推理过程中的步骤与标准答案在逻辑上高度一致，并检查是否存在明显的错误或遗漏。

5.内容多样性：检查推理过程中是否存在大量重复的步骤。

6.与任务领域的相关性：检查推理过程是否涉及与任务领域相关的内容（任务领域：{task_domain}）。如果推理反映了与任务领域的相关性，则给予更高的评分。

7.与任务指令的一致性：检查推理过程是否与任务指令高度相关。相关性越高越好。如果推理内容完全符合任务指令，则给予更高的评分。

我们将经过两轮筛选后均标注为good的数据作为高质量的 COT 数据用于 SFT ；而未经过筛选标注为bad的数据则作为推理QA数据用于强化学习（RL）。

Fin-R1-Data数据分布如下：

Fin-R1-Data 涵盖中英文金融垂直领域的多维度专业知识，并根据具体任务内容将其分为金融代码、金融专业知识、金融非推理类业务知识和金融推理类业务知识四大模块，可有效支撑银行、证券以及信托等多个金融核心场景。

数据集	数据量
ConvFinQA-R1-Distill	7629
Finance-Instruct-500K-R1-Distill	11300
FinCUGE-R1-Distill	2000
FinQA-R1-Distill	2948
TFNS-R1-Distill	2451
FinanceIQ-R1-Distill	2596
FinanceQT-R1-Distill	152
Ant_Finance-R1-Distill	1548
FinCorpus-R1-Distill	29288
FinPEE-R1-Distill	179
总计	60091

🚀 微调训练

两阶段流程

针对金融领域复杂推理任务，我们利用 Qwen2.5-7B-Instruct 进行两阶段微调训练得到金融推理大语言模型 Fin-R1 。首先通过高质量金融推理数据的 SFT (Supervised Fine-Tuning) 帮助模型初步提升金融推理能力，然后在 GRPO（Group Relative Policy Optimization) 算法的基础上结合格式奖励和准确度奖励进行强化学习，以此进一步提升金融推理任务的准确性和泛化能力。

第一阶段----推理能力注入：

针对金融推理任务中的复杂推理，我们第一阶段使用 ConvFinQA 和 FinQA 金融数据集对 Qwen2.5-7B-Instruct 进行了监督微调。经过一轮微调训练，确保模型能够深入理解并处理复杂的金融推理问题。

第二阶段----强化学习优化：

在模型掌握复杂推理技能后，我们采用 GRPO（Group Relative Policy Optimization）算法作为核心框架，以双重奖励机制优化模型输出的格式和准确度，并在此基础上引入了基于模型的验证器（Model-Based Verifier），采用 Qwen2.5-Max 进行答案评估来改进基于正则表达式的奖励可能存在的偏差，生成更加精确可靠的奖励信号，从而提升强化学习的效果和稳定性。

🚨 模型评测结果

我们在覆盖多项金融业务场景的基准测试上对模型进行评估，在评测结果中，只经过指令微调 (SFT) 的模型 Fin-R1-SFT 在金融场景中相较于基础模型已经取得了一定性能提升，但是相比于 DeepSeek-R1 仍有提升空间，我们于是在 Fin-R1-SFT 基础上再进行强化学习训练，结果发现经过指令微调 (SFT) 加强化学习 (RL) 训练的 Fin-R1 以仅 7B 的轻量化参数规模展现出显著的性能优势，达到 75.2 的平均得分位居第二，全面超越参评的同规模模型，同时与行业标杆 DeepSeek-R1 平均分差距仅3.0，且超越DeepSeek-R1-Distill-Llama-70B（69.2）6.0分。此外 Fin-R1 在聚焦真实金融表格数值推理任务的 FinQA 以及多轮推理交互场景的 ConvFinQA 两大关键任务测试上分别以 76.0 和 85.0 的得分在参评模型中登顶第一，展现出了模型在金融推理场景及金融非推理场景中的强大处理能力。

Model	Parameters	FinQA	ConvFinQA	Ant_Finance	TFNS	Finance-Instruct-500k	Average
DeepSeek-R1	671B	71.0	82.0	90.0	78.0	70.0	78.2
Fin-R1	7B	76.0	85.0	81.0	71.0	62.9	75.2
Qwen-2.5-32B-Instruct	32B	72.0	78.0	84.0	77.0	58.0	73.8
DeepSeek-R1-Distill-Qwen-32B	32B	70.0	72.0	87.0	79.0	54.0	72.4
Fin-R1-SFT	7B	73.0	81.0	76.0	68.0	61.0	71.9
Qwen-2.5-14B-Instruct	14B	68.0	77.0	84.0	72.0	56.0	71.4
DeepSeek-R1-Distill-Llama-70B	70B	68.0	74.0	84.0	62.0	56.0	69.2
DeepSeek-R1-Distill-Qwen-14B	14B	62.0	73.0	82.0	65.0	49.0	66.2
Qwen-2.5-7B-Instruct	7B	60.0	66.0	85.0	68.0	49.0	65.6
DeepSeek-R1-Distill-Qwen-7B	7B	55.0	62.0	71.0	60.0	42.0	58.0

声明及未来展望

本项目由上海财经大学统计与数据科学学院金融大语言模型课题组（SUFE-AIFLM-Lab）联合财跃星辰完成。Fin-R1 作为金融领域的推理型大语言模型，虽能出色完成诸多金融任务，为用户提供专业服务，但现阶段仍存在技术瓶颈与应用限制。它提供的建议和分析结果仅供参考，不可等同于专业金融分析师或专家的精准判断。我们诚挚希望用户以批判性思维审视模型输出，结合自身专业知识与经验进行决策。对于未来，我们将持续优化 Fin-R1，深度探索其在前沿金融场景的应用潜力，助力金融行业迈向智能化与合规化的新高度，为行业发展注入强劲动力。

📫 联系我们

诚邀业界同仁共同探索 AI 与金融深度融合的创新范式，共建智慧金融新生态，并通过邮件与[email protected]联系

Downloads last month: 2,067

GGUF

Model size

7.62B params

Architecture

qwen2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

Text Generation

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support