YerbaPage (Yuling)

reacted to their post with 👀👍🤗🔥 1 day ago

Post

1191

Latest work on SWE-Bench 🐛

Our two new papers from the SJTU & Huawei: Powered by DeepSeek-V3, we've achieved a new SOTA on the SWE-Bench benchmark!

We introduce two innovative approaches:
⚔️ SWE-Debate: AI agents compete and "debate" to generate the best code fix.
🧠 SWE-Exp: An AI agent learns from past repair "experience" to solve new issues more efficiently.

👇 Explore the future of software development:

SWE-Debate
📄 Paper: https://arxiv.org/abs/2507.23348
💻 Code: https://github.com/YerbaPage/SWE-Debate

SWE-Exp
📄 Paper: https://arxiv.org/abs/2507.23361
💻 Code: https://github.com/YerbaPage/SWE-Exp

posted an update 1 day ago

Post

1191

Latest work on SWE-Bench 🐛

Our two new papers from the SJTU & Huawei: Powered by DeepSeek-V3, we've achieved a new SOTA on the SWE-Bench benchmark!

We introduce two innovative approaches:
⚔️ SWE-Debate: AI agents compete and "debate" to generate the best code fix.
🧠 SWE-Exp: An AI agent learns from past repair "experience" to solve new issues more efficiently.

👇 Explore the future of software development:

SWE-Debate
📄 Paper: https://arxiv.org/abs/2507.23348
💻 Code: https://github.com/YerbaPage/SWE-Debate

SWE-Exp
📄 Paper: https://arxiv.org/abs/2507.23361
💻 Code: https://github.com/YerbaPage/SWE-Exp

reacted to their post with 🚀🤗👀🔥 18 days ago

Post

1591

Is your code written by a human or an AI? 🤖

With the rise of AI coding assistants, this question is more critical than ever. Our new tool, DetectCodeGPT, effectively identifies AI-generated code, outperforming SOTA methods with a 7.6% increase in AUC!

How? By analyzing unique stylistic and syntactic patterns in code, not just the text.

👇 Explore more:

Paper: https://arxiv.org/html/2401.06461v2

Code: https://github.com/YerbaPage/DetectCodeGPT

posted an update 18 days ago

Post

1591

Is your code written by a human or an AI? 🤖

With the rise of AI coding assistants, this question is more critical than ever. Our new tool, DetectCodeGPT, effectively identifies AI-generated code, outperforming SOTA methods with a 7.6% increase in AUC!

How? By analyzing unique stylistic and syntactic patterns in code, not just the text.

👇 Explore more:

Paper: https://arxiv.org/html/2401.06461v2

Code: https://github.com/YerbaPage/DetectCodeGPT

replied to their post 24 days ago

Hey! Your Gopilot Assist looks really cool! 🎮

Regarding tool calling formats - I actually use the standard OpenAI format that gets sent directly to the API server. Looking at my chat history structure, the overall format is:

{
  "timestamp": "...",
  "model": "...", // model name
  "temperature": 0.0,
  "message_count": 5, // number of messages
  "messages": [
    {
      "role": "system",
      "content": "..." // system prompt
    },
    {
      "role": "user", 
      "content": "..." // user message
    },
    {
      "role": "assistant",
      "content": "",
      "tool_calls": [
        {
          "id": "call_...",
          "type": "function",
          "function": {
            "name": "...",
            "arguments": "{\"command\":\"...\",\"description\":\"...\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "Command: ...\nDescription: ...\nExit Code: 0\nStdout: ...\nStderr: ...",
      "tool_call_id": "call_..."
    },
    {
      "role": "user",
      "content": "..." // next user message
    }
  ],
  "tools_count": 10,
  "tools": ["run_shell_command", "list_directory", "read_file", ...]
}

So yeah, I just send this JSON structure directly to the API server - no custom parsing needed on my end.

Your simplified format looks really clean for rapid development! Actually I've also encountered some problems with the tool usage formats when switching between deepseek, gemini, and anthropic APIs. The parentheses handling for multiline strings is definitely a challenge. Since my primary focus is on replicating the thinking and planning ability of Claude Code, I haven't devoted much time to the tool calling format yet. Thanks for your sharing and I'll keep an eye on this part!

My experience: I'm an NLP researcher working on LLM for Code (e.g. the SWE-bench task), so I've been exploring different approaches in LLM agent planning and tool calling. I hope the Terminal-Agent project can serve as a starting point for the community to enable more advanced features in coding agents that work in the terminal, developing from the ground up.

BTW, if you're interested in joining our project development, we can share some API budget and GPU support (although the GPU part may not be useful, lol) to work together and make the agent more powerful! 🚀

reacted to their post with 🤗👀 26 days ago

Post

4597

**Build Claude Code from Scratch Together**

I'm building a minimal, open-source AI agent that runs in the terminal to help with programming tasks. Think of a simplified, from-scratch version of the Claude Code agent 💻.

You can check out the basic framework I've already started here: [github.com/YerbaPage/Terminal-Agent](https://github.com/YerbaPage/Terminal-Agent) 🔥

To be clear, the goal isn't to create a perfect, commercial-grade tool. This is a for-fun project geared towards research and exploration. The main idea is to build a **flexible and minimal framework** that helps us understand the modules and functionalities of sophisticated agents like Claude Code. It's a playground to explore how we can enable agents to handle more complex tasks effectively.

I'm looking for a few people who'd be interested in exploring this together. We could experiment with ideas like adding a to-do list manager to improve planning or finding better ways for the agent to manage its memory like some MCPs do. 🤔

If you're into LLMs and want to build a cool, exploratory project from the ground up, leave a comment or check out the repo! 🤗

2 replies

·

posted an update 26 days ago

Post

4597

**Build Claude Code from Scratch Together**

I'm building a minimal, open-source AI agent that runs in the terminal to help with programming tasks. Think of a simplified, from-scratch version of the Claude Code agent 💻.

You can check out the basic framework I've already started here: [github.com/YerbaPage/Terminal-Agent](https://github.com/YerbaPage/Terminal-Agent) 🔥

To be clear, the goal isn't to create a perfect, commercial-grade tool. This is a for-fun project geared towards research and exploration. The main idea is to build a **flexible and minimal framework** that helps us understand the modules and functionalities of sophisticated agents like Claude Code. It's a playground to explore how we can enable agents to handle more complex tasks effectively.

I'm looking for a few people who'd be interested in exploring this together. We could experiment with ideas like adding a to-do list manager to improve planning or finding better ways for the agent to manage its memory like some MCPs do. 🤔

If you're into LLMs and want to build a cool, exploratory project from the ground up, leave a comment or check out the repo! 🤗

2 replies

·

reacted to their post with 👀🔥😎 about 1 month ago

Post

1387

How to defend benchmarks against knowledge leakage? 🛡️

LastingBench is a framework that mitigates memorization in QA benchmarks by identifying and rewriting leakage points, thereby improving the robustness and fairness of model evaluations. 🚀✨

Paper: LastingBench: Defend Benchmarks Against Knowledge Leakage (2506.21614) 📚
Code and benchmark: https://github.com/Seriousss/LastingBench 🧑‍💻

posted an update about 1 month ago

Post

1387

How to defend benchmarks against knowledge leakage? 🛡️

LastingBench is a framework that mitigates memorization in QA benchmarks by identifying and rewriting leakage points, thereby improving the robustness and fairness of model evaluations. 🚀✨

Paper: LastingBench: Defend Benchmarks Against Knowledge Leakage (2506.21614) 📚
Code and benchmark: https://github.com/Seriousss/LastingBench 🧑‍💻

Yuling

AI & ML interests

Recent Activity

Organizations

Yuling

AI & ML interests

Recent Activity

Organizations

YerbaPage's activity