Models need feedback on what makes outputs “good” or “bad.” Policy optimization (PO) turns preferences and rewards into actual training signals. This field is evolving quickly, moving far beyond classics like PPO and GRPO. So here is our overview of 10 newest PO methods:
3. DCPO (Dynamic Clipping Policy Optimization) → DCPO: Dynamic Clipping Policy Optimization (2509.02333) Uses dynamic clipping, which adjusts probability limits per token for better token exploration, and smooth reward standardization to balance rewards over training steps and prevent wasted updates
4. ARPO (Agentic Reinforced Policy Optimization) → Agentic Reinforced Policy Optimization (2507.19849) Optimizes multi-turn LLM agents that use external tools. It uses an entropy-based adaptive rollout to explore post-tool use and an advantage attribution method to better assign credit across steps, leading to more efficient tool use with fewer resources
5. GRPO-RoC (Group Relative Policy Optimization with Resampling-on-Correct) → rStar2-Agent: Agentic Reasoning Technical Report (2508.20722) Oversamples rollouts, then resamples them to keep diverse mistakes and only the highest-quality correct answers. It reduces noises and ends up with stronger reasoning in a code environment
✨ Efficiency leads the month - At scale: optimizing compute use in massive MoE models e.g. DeepSeek v3.1 - In small models: lightweight & deployable e.g. MiniCPM V 4.5, Step Audio 2-mini, Intern S1-mini,Ovis2.5-9B etc.
✨ Reasoning + Agentic wave 🌊 Not just demos, but real product use cases. - Meituan, DeepSeek: large-scale models tuned for reasoning & tools - Qwen, GLM, InternLM: multimodal reasoning + agentic interaction - CodeAgent, Prover, Baichuan-M2-32B: domain-focused (coding, logic, specialized reasoning)
✨ Open source is exploding across all types of companies!! - Big tech: Tencent, ByteDance, Xiaomi, Kuaishou, Alibaba/Qwen, Skywork, Ant Group - Startups: DeepSeek (yes, still a startup!), Zhipu, Baichuan, StepFun, OpenBMB - New entrants: Meituan, RedNote - Research labs: Shanghai AI Lab (InternLM, OpenGVLab)
✨ Open source was explicitly mentioned in the State Council’s new guidance on deepening the "AI+" strategy. - Open-source: support communities, encourage contributions (incl. university credits & recognition), foster new application approaches, and build globally impactful ecosystems 👀
💡 The Chinese community didn’t slow down at all in August 🤯 September, the last month before the Golden Week holiday, may bring even more surprises.
✨ Supports 33 languages, including 5 ethnic minority languages in China 👀 ✨ Including a translation ensemble model: Chimera-7B ✨ Full pipeline: pretrain > CPT > SFT > enhancement > ensemble refinement > SOTA performance at similar scale
Everyone is buzzing around image generation this week, or more specifically, Google's Nano-Banana. So today we want to share a list of models that can be your great toolkit for image generation + editing + multi-turn refinement.
1. Gemini 2.5 Flash Image, or Nano-Banana → https://deepmind.google/models/gemini/image/ Google’s newest image model with conversational editing, character consistency, and multi-image fusion. Available in AI Studio and the Gemini API. Price: $2.50 per 1M tokens
2. FLUX (Black Forest Labs) → https://bfl.ai/ A family of models known for rich detail and, excellent prompt adherence, and fast iterative generation. Offered in several variants, from Pro to open-source, it's accessible via Hugging Face, Replicate, Azure AI Foundry, etc., and used as a base in many pipelines. Price: $0.025-0.08 per image
3. Midjourney v7 → https://www.midjourney.com/ Enhanced image fidelity, prompt comprehension, and anatomical coherence (hands, bodies, objects) + provides a smart lightbox editor. The Omni-reference tool improves character and object consistency in your images. It remains accessible via Discord with a supporting web interface. Price: $10-60/month
4. Stable Diffusion 3.5 (Stability AI) → https://stability.ai/stable-image Open-weights line with improved text rendering, photorealism, and prompt adherence compared to earlier versions. It introduces technical innovations through its MMDiT architecture. Price: $0.025-0.065 per image
5. OpenAI GPT-Image-1 →https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1 It's the same multimodal model that powers ChatGPT's image capabilities, offering high-fidelity image generation, precise edits, including inpainting, and accurate text rendering. Available via the Images API. Price: $40 per 1M tokens
MiniCPM-V 4.5 🚀 New MLLM for image, multi-image & video understanding, running even on your phone, released by OpenBMB openbmb/MiniCPM-V-4_5
✨ SOTA vision language capability ✨ 96× video token compression > high-FPS & long video reasoning ✨ Switchable fast vs deep thinking modes ✨ Strong OCR, document parsing, supports 30+ languages
✨ 36B - Base & Instruct ✨ Apache 2.0 ✨ Native 512K long context ✨ Strong reasoning & agentic intelligence ✨ 2 Base versions: with & without synthetic data
Sharing some free, useful resources for you. In this collection, we’ve gathered the most recent books to give you up-to-date information on key fundamental topics. Hope this helps you master AI and machine learning:
1. Machine Learning Systems by Vijay Janapa Reddi → https://www.mlsysbook.ai/ Provides a framework for building effective ML solutions, covering data engineering, optimization, hardware-aware training, inference acceleration, architecture choice, and other key principles
2. Generative Diffusion Modeling: A Practical Handbook by Zihan Ding, Chi Jin → https://arxiv.org/abs/2412.17162 Offers a unified view of diffusion models: probabilistic, score-based, consistency, rectified flow, pre/post-training. It aligns notations with code to close the “paper-to-code” gap.
3. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges → https://arxiv.org/abs/2104.13478 Explores unified geometric principles to analyze neural networks' architectures: CNNs, RNNs, GNNs, Transformers, and guide the design of the future ones
4. Mathematical Foundations of Geometric Deep Learning by Haitz Saez de Ocariz Borde and Michael Bronstein → https://arxiv.org/abs/2508.02723 Dives into the the key math concepts behind geometric Deep Learning: geometric and analytical structures, vector calculus, differential geometry, etc.
5. Interpretable Machine Learning by Christoph Molnar → https://github.com/christophM/interpretable-ml-book Practical guide to simple, transparent models (e.g., decision trees) and model-agnostic methods like LIME, Shapley values, permutation importance, and accumulated local effects.
6. Understanding Deep Learning by Simon J.D. Prince → https://udlbook.github.io/udlbook/ Explores core deep learning concenpts: models, training, evaluation, RL, architectures for images, text, and graphs, addressing open theoretical questions
World models are one of the most challenging areas in AI, pushing the boundaries of reasoning, perception, and planning. They're gen AI systems that help models and agents learn internal representations of real-world environments.
Today, we invite you to take a look at 12 standout examples:
1. WorldVLA → WorldVLA: Towards Autoregressive Action World Model (2506.21539) This autoregressive world model integrates action prediction and visual world modeling in a single framework, allowing each to enhance the other. It introduces an attention masking strategy to reduce action prediction errors
2. SimuRA → https://arxiv.org/abs/2507.23773 A generalized world model that uses a language-based world model to simulate and plan actions before execution, enabling more general and flexible reasoning
3. PAN (Physical, Agentic, and Nested) world models → Critiques of World Models (2507.05169) Has a hybrid architecture that combines discrete concept-based reasoning (via LLMs) with continuous perceptual simulation (via diffusion models), enabling rich multi-level, multimodal understanding and prediction
5. WorldMem → WORLDMEM: Long-term Consistent World Simulation with Memory (2504.12369) Uses a memory bank with attention over time-stamped frames and states to maintain long-term and 3D spatial consistency in scene generation. So it reconstruct past scenes and simulate dynamic world changes across large temporal gaps
✨ The multimodal wave🌊 - GLM-4.1V-Thinking: Image+Text > Text - Intern-S1: Image+Text > Text - Wan 2.2 - Text +Image > video - Skywork-R1V3: Image+Text > Text - Skywork-UniPic: Text > Image / Image > Text - Tar-7B: Any-to-Any - Ming-Lite-Omni-1.5: Any-to-Any - Step3: Image+Text > Text - HunyuanWorld-1: Image > 3D - ThinkSound: Video > Audio - Neta-Lumina: Text > Image
✨ Big month not only for models, but for policy too🏛️ - Announced Global Action Plan for AI Governance - Proposes to set up a World AI Cooperation Organization in Shanghai - Released International AI Open Source Collaboration Initiative - Published Risk Assessment Guidelines for Endpoint AI Agents
✨ Big event - WAIC - 355K offline visitors - 108 new released in 4 days - 145 sessions across key domains
I’ve been tracking things closely, but July’s open-source wave still blew me away. Can’t wait to see what’s coming next! 🚀