meraGPT (meraGPT)

codelion

posted an update 6 days ago

Post

3184

Implemented Test-Time Diffusion Deep Researcher (TTD-DR) in OptiLLM! 🚀

Just shipped a game-changing feature that turns any LLM into a powerful research agent. TTD-DR applies diffusion-inspired techniques to iteratively refine research reports while grounding them in real web sources.

How it works:
• Generates initial draft
• Identifies knowledge gaps
• Searches web for missing info
• Iteratively refines through "denoising" steps
• Produces comprehensive reports with 15-30+ sources

The magic? It works with ANY model so you can choose your favorite open-source models on HF!

Key results:
- 47 complex research queries tested
- Every report backed by real web sources
- Quality rivals human research analysts
- No more hallucinations on current events!

Try it:
pip install optillm
Then use "deep_research-your-model-name" as the model identifier

- Implementation: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research
- Paper: https://arxiv.org/abs/2507.16075v1
- Sample reports: https://github.com/codelion/optillm/tree/main/optillm/plugins/deep_research/sample_reports

Special thanks to the TTD-DR paper authors for this brilliant approach!

#research #llm #opensource #inference

codelion

posted an update 9 days ago

Post

1544

New research: Understanding how different LLMs approach reasoning through "thought anchors"

I just published a comparative study analyzing the reasoning patterns of Qwen3-0.6B vs DeepSeek-R1-Distill-1.5B using thought anchors - critical sentences that significantly impact task success probability.

Key findings:
- DeepSeek-R1: Uses concentrated reasoning with fewer, high-impact steps (0.408 avg impact)
- Qwen3: Employs distributed reasoning spreading impact across multiple steps (0.278 avg impact)
- Different risk-reward profiles: DeepSeek more consistent (82.7% positive steps), Qwen3 more exploratory (71.6% positive)

This reveals different cognitive architectures rather than simple performance differences. The models optimize for different reasoning strategies - consistency vs exploration.

Both datasets are now available on HF:
- Qwen3 thought anchors: codelion/Qwen3-0.6B-pts-thought-anchors
- DeepSeek-R1 thought anchors: codelion/DeepSeek-R1-Distill-Qwen-1.5B-pts-thought-anchors

Built using our open-source PTS library for mechanistic interpretability analysis. All methodology is fully reproducible.

Full article: https://huggingface.co/blog/codelion/understanding-model-reasoning-thought-anchors

What reasoning patterns have you noticed in your model experiments? Would love to hear about other architectures showing similar cognitive diversity!

codelion

posted an update 11 days ago

Post

435

New SOTA for 26-circle packing problem! ypwang61 achieved 2.635977 sum of radii using OpenEvolve evolutionary optimization framework.

Progress: AlphaEvolve originally reported 2.635 in their paper, OpenEvolve made improvements, and now we have this new record at 2.635977.

The solution uses multi-stage optimization with specialized pattern initialization and enhanced penalty functions. Circle packing is a notoriously hard optimization problem where these small improvements actually represent significant algorithmic advances.

Great example of how evolutionary algorithms can push boundaries in computational geometry optimization. The implementation and results are shared openly on GitHub.

Link: https://github.com/codelion/openevolve/issues/156

codelion

updated a Space about 1 month ago

README

👀

codelion

posted an update about 1 month ago

Post

2369

🚀 Just published: "OpenEvolve: Open-Source Evolutionary Code Optimization with Real-World GPU Kernel Discovery"

We built the first open-source implementation of Google's AlphaEvolve system and used it to automatically discover GPU kernel optimizations that outperform human engineers!

Key results:

- 21.8% average decode speed improvement on Apple Silicon
- 36.7% improvement on long-context transformer attention
- Discovered novel vectorization patterns and 2-pass softmax algorithm

The system evolved a Metal kernel for Qwen3's Grouped Query Attention from a basic 3-pass implementation into something with sophisticated Apple Silicon optimizations that would take experts months to discover manually. The evolved kernel automatically found the optimal vec<T,8> operations for 128-dim attention heads and fused softmax computation with value accumulation.

Really excited about the potential here - imagine evolutionary algorithms automatically discovering optimizations across all our AI infrastructure. What would you want to optimize with this approach?

Full write-up: https://huggingface.co/blog/codelion/openevolve-gpu-kernel-discovery

GitHub: https://github.com/codelion/openevolve

#AI #MachineLearning #GPU #OpenSource #Evolution #CodeOptimization #TransformerOptimization

1 reply

·

codelion

posted an update about 1 month ago

Post

2544

Adaptive Classifier: Dynamic Text Classification with Strategic Learning

New text classification system that learns continuously without catastrophic forgetting. Achieved 22.2% robustness improvement on adversarial datasets while maintaining clean data performance.

🎯 THE PROBLEM
Traditional classifiers require complete retraining when adding new classes. Expensive and time-consuming, especially with adversarial users trying to game the system.

🚀 KEY INNOVATIONS
• Hybrid memory-neural architecture (prototype-based + neural adaptation)
• Strategic classification using game theory to predict and defend against manipulation
• Elastic Weight Consolidation prevents catastrophic forgetting

📊 RESULTS
Tested on AI-Secure/adv_glue dataset:
• Clean data: 80.0% → 82.2% (+2.2%)
• Manipulated data: 60.0% → 82.2% (+22.2%)
• Zero performance drop under adversarial attacks

🔬 APPLICATIONS
• Hallucination detection: 80.7% recall for RAG safety
• LLM routing: 26.6% cost optimization improvement
• Content moderation: Robust against gaming attempts

⚙️ USAGE
pip install adaptive-classifier

from adaptive_classifier import AdaptiveClassifier
classifier = AdaptiveClassifier("bert-base-uncased")
classifier.add_examples(texts, labels)
predictions = classifier.predict("New text")

🔗 RESOURCES
Blog: https://huggingface.co/blog/codelion/adaptive-classifier
Code: https://github.com/codelion/adaptive-classifier
Models:

adaptive-classifier

Available models: llm-hallucination-detector, llm-config-optimizer, llm-router

Works with any HuggingFace transformer. Fully open source and production-ready!

codelion

posted an update about 1 month ago

Post

1602

DeepThink Plugin: Bringing Gemini 2.5's Parallel Reasoning to Open Models

Just released an open-source plugin that implements Google's "Deep Think" reasoning approach for models like DeepSeek R1, Qwen3, and other open models.

Google's recent Gemini 2.5 report introduced Deep Think - a technique where models generate multiple hypotheses in parallel and critique them before arriving at final answers. It achieves SOTA results on math olympiads and competitive coding benchmarks.

Our implementation works by modifying the inference pipeline to explore multiple solution paths simultaneously, then synthesizing the best approach. Instead of single-pass generation, models run an internal debate before responding.

Key features:
- Works with any model that supports structured reasoning patterns
- Implements parallel thinking during response generation
- Particularly effective for complex reasoning tasks, math, and coding problems
- Increases inference time but significantly improves answer quality

The plugin won the Cerebras & OpenRouter Qwen 3 Hackathon, validating that this approach works well beyond Google's proprietary implementation.

GitHub: https://github.com/codelion/optillm/tree/main/optillm/plugins/deepthink
Demo: https://www.youtube.com/watch?v=b06kD1oWBA4

The goal is democratizing advanced reasoning capabilities that were previously locked behind APIs. Perfect for researchers and practitioners working with local deployments who want enhanced reasoning without dependency on proprietary services.

Performance notes: Currently about 2-3x slower inference but much better results on complex problems. Working on adaptive triggering to only activate when problems benefit from parallel reasoning.

Would love feedback from the HF community and collaborations on optimizing the approach further. Open to PRs and always interested in making open models more capable.

codelion

posted an update about 2 months ago

Post

2042

New Research: Theoretical Foundations for In-Context Learning in Transformers

I'm excited to share our latest theoretical work that formally proves an interesting property of large language models: base transformer models can approximate fine-tuned capabilities using only inference-time techniques like in-context learning.

The core question we investigated: Can specialized behaviors typically acquired through expensive supervised fine-tuning be elicited from base models without any parameter updates?

Our theoretical contribution: We provide a formal proof, grounded in the Turing completeness of transformers, showing that this is indeed possible under certain assumptions. The work establishes mathematical bounds on the minimal dataset sizes needed for approximation.

Key theoretical results:

- For text generation tasks: O(mV/ε²) examples suffice (where m = number of contexts, V = vocabulary size, ε = error tolerance)
- For linear classification: O(d/ε) examples (where d = input dimension)
- Extensions to finite context scenarios with practical bounds

This work helps explain why techniques like few-shot prompting, retrieval-augmented generation, and in-context learning work so effectively in practice. It bridges formal computer science theory with empirical observations about modern language models.

While the assumptions are idealized (unbounded computational resources, full dataset access), the results provide mathematical foundations for understanding inference-time adaptation strategies that are increasingly important in AI deployment.

Paper: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques (2506.08060)

1 reply

·

codelion

authored 2 papers about 2 months ago

Humanity's Last Exam

Paper • 2501.14249 • Published Jan 24 • 75

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Paper • 2506.08060 • Published Jun 9 • 8

codelion

posted an update 2 months ago

Post

3434

🧠 We just implemented Andrej Karpathy's "third paradigm" for LLM learning!

System Prompt Learning (SPL) enables LLMs to automatically learn problem-solving strategies from experience, rather than relying on static prompts.

🚀 How it works:
Your LLM builds a database of effective strategies, selects the best ones for each problem, and refines them over time based on success rates.

📊 Results across math benchmarks:
Arena Hard: 29% → 37.6% (+8.6%)
AIME24: 23.33% → 30% (+6.67%)
OptILLMBench: 61% → 65% (+4%)

The best part? All strategies are human-readable and the system gets progressively better at problem types you use frequently.

✨ Key benefits:
🔄 Cumulative learning over time
📖 Transparent, inspectable strategies
🔌 Works with any OpenAI-compatible API
⚡ Simple integration: just add "spl-" prefix to your model

Built as an open-source plugin in optillm. After 500 queries, our system developed 129 strategies and refined 97 of them!

This feels like a genuine step toward AI that learns from experience while staying completely interpretable.

🔗 GitHub: https://github.com/codelion/optillm/tree/main/optillm/plugins/spl
📖 Full article: https://huggingface.co/blog/codelion/system-prompt-learning
🐦 Original Karpathy tweet: https://x.com/karpathy/status/1921368644069765486

Have you experimented with advanced system prompting? What strategies would you want your LLM to learn?

codelion

posted an update 2 months ago

Post

2358

Introducing AutoThink: Adaptive reasoning for LLMs that improves performance by 43% on reasoning benchmarks!

Instead of using fixed thinking budgets, AutoThink:
- Classifies query complexity (HIGH/LOW) using adaptive classification
- Dynamically allocates thinking tokens based on complexity
- Uses steering vectors derived from Pivotal Token Search to guide reasoning patterns

Results on DeepSeek-R1-Distill-Qwen-1.5B:
- GPQA-Diamond: 31.06% vs 21.72% baseline (+9.34 points)
- MMLU-Pro: 26.38% vs 25.58% baseline (+0.8 points)
- Uses fewer tokens than baseline approaches

Works with any local reasoning model - DeepSeek, Qwen, Llama, custom models. The technique combines our research on Pivotal Token Search (PTS) implementation and adaptive classification frameworks.

Paper: AutoThink: efficient inference for reasoning LLMs
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327

Code and examples:
https://github.com/codelion/optillm/tree/main/optillm/autothink

PTS implementation and technical details:
https://github.com/codelion/pts
https://huggingface.co/blog/codelion/pts

Adaptive classifier framework:
https://github.com/codelion/adaptive-classifier

Would love to hear your thoughts on adaptive resource allocation for LLM reasoning! Have you experimented with similar approaches?

5 replies

·

codelion

posted an update 2 months ago

Post

2856

🧬 Hey everyone! Just released **OpenEvolve** - an open-source implementation of Google DeepMind's AlphaEvolve system.

It's an evolutionary coding agent that uses LLMs to discover and optimize algorithms. I successfully replicated DeepMind's results on circle packing (99.97% match!) and evolved a random search into a simulated annealing algorithm.

✨ Key features:
- Evolves entire codebases (not just single functions)
- Works with any OpenAI-compatible API
- LLM ensemble approach for better results
- Multi-objective optimization

👉 Check it out:
GitHub: https://github.com/codelion/openevolve
Blog post: https://huggingface.co/blog/codelion/openevolve

Would love to hear your thoughts or answer any questions about it!

codelion

posted an update 3 months ago

Post

2471

Introducing Pivotal Token Search (PTS): A new technique for targeted LLM alignment

Excited to share Pivotal Token Search (PTS), a technique for identifying and optimizing critical decision points in LLM generations!

GitHub repository: https://github.com/codelion/pts

What is PTS?
PTS helps identify specific "pivotal tokens" that dramatically shift the probability of a successful generation. Unlike traditional DPO which treats all tokens equally, PTS focuses optimization on the tokens that actually matter for success.

Inspired by Microsoft's recent Phi-4 paper (which used this technique to achieve SOTA reasoning with only 14B parameters), PTS is especially effective for:
- Mathematical reasoning
- Coding tasks
- Multi-step problem solving
- Any domain where specific decision points strongly impact outcomes

What we're releasing today: codelion/pivotal-token-search-68241145d8b8502122f3ce4f

1. Open-source code:
- Complete implementation of the PTS algorithm
- Data generation pipelines
- Usage examples and documentation

2. Huggingface resources:
- Datasets collection: https://huggingface.co/datasets?other=pts
* Pre-generated preference pairs for various domains
* Ready to use in your DPO training pipelines

- Models collection: https://huggingface.co/models?other=pts
* Pre-trained models fine-tuned with PTS
* Specialized versions for different reasoning tasks

The algorithm is straightforward to implement and can significantly improve your model's reasoning capabilities. Check out the repository for details on getting started!

We welcome feedback, contributions, and collaborations. Let us know if you use PTS in your projects!

codelion

updated a Space 6 months ago

7

MeraKB

📚

Store and chat with your documents

codelion

authored 3 papers 11 months ago

codelion

posted an update 11 months ago

Post

2327

We recently worked with OpenAI to fine-tune gpt-4o and built the SOTA model for the patched-codes/static-analysis-eval benchmark. All the code and data patched-codes/synth-vuln-fixes on how we did it is available on their GitHub - https://github.com/openai/build-hours/tree/main/5-4o_fine_tuning.

Here are some tips based on our experience:

→ Establish baseline with "conditioning" / prompting

→ Task-specific datasets are ideal for PEFT; hard to beat gpt-4o on "broad" tasks

→ Add your best system prompt to each example

→ Ensure training data distribution is similar to inference data

→ Shorten instructions with concise prompts; may require more examples.

→ Define clear evaluation metrics (seriously, please eval!)

You can see more details on the benchmark and process here - https://www.patched.codes/blog/the-static-analysis-evaluation-benchmark-measuring-llm-performance-in-fixing-software-vulnerabilities

codelion

posted an update about 1 year ago

Post

2888

A new paper titled "STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis" shows the benefits of integrating static analysis with LLMs. (https://arxiv.org/abs/2406.10018)

Authors evaluate 4 key questions:

- How does each static analysis integration strategy perform in LLM-based repository-level code completion?
> They found that integrating static analysis in the prompting phase (especially with file-level dependencies) can achieve the substantially larger improvements than other phases.

- How do different combinations of integration strategies affect LLM-based repository-level code completion?
> Languages that are easier to analyze like Java show more improvements compared to dynamic languages like Python.

- How do static analysis integration strategies perform when compared or combined with RAG in LLM-based repository-level code completion?
> Static analysis and RAG are complementary and boost the overall accuracy.

- What are the online costs of different integration strategies in LLM-based repository-level code completion?
> Combining prompting-phase static analysis and RAG is the best option for cost-effectiveness.

In my @owasp App Sec keynote last year, I had described how one can do static analysis augmented generation (SaAG) to boost the accuracy of LLM based patches for vulnerability remediation. (you can see the talk here - https://www.youtube.com/watch?v=Cw4-ZnUNVLs)

meraGPT

AI & ML interests

Recent Activity

README

Humanity's Last Exam

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

MeraKB

Patched MOA: optimizing inference for diverse software development tasks

Patched RTC: evaluating LLMs for diverse software development tasks

Evaluating Pre-trained Language Models for Repairing API Misuses

AI & ML interests

Recent Activity

Team members 1

meraGPT's activity

README

MeraKB