I also use MI100's, and I also had a very frustrating time trying to make bnb work, even with their new branch. It is frustrating that I am able to make fine-tunings on larger models on my Mac than I am with 8x MI100.
rdsm
rdsm
AI & ML interests
None yet
Recent Activity
new activity
about 10 hours ago
Qwen/QwQ-32B:Does Macbook M1 max 64GB run this model well?
new activity
about 16 hours ago
mlx-community/Kokoro-82M-4bit:Update README.md
replied to
csabakecskemeti's
post
13 days ago
Testing Training on AMD/ROCm the first time!
I've got my hands on an AMD Instinct MI100. It's about the same price used as a V100 but on paper has more TOPS (V100 14TOPS vs MI100 23TOPS) also the HBM has faster clock so the memory bandwidth is 1.2TB/s.
For quantized inference it's a beast (MI50 was also surprisingly fast)
For LORA training with this quick test I could not make the bnb config works so I'm running the FT on the fill size model.
Will share all the install, setup and setting I've learned in a blog post, together with the cooling shroud 3D design.
Organizations
None yet
rdsm's activity
Does Macbook M1 max 64GB run this model well?
1
#44 opened 3 days ago
by
mrk83
Update README.md
#2 opened about 16 hours ago
by
rdsm
replied to
csabakecskemeti's
post
13 days ago
reacted to
mmhamdy's
post with ๐ฅ
15 days ago
Post
2729
๐ We're excited to introduce MemoryCode, a novel synthetic dataset designed to rigorously evaluate LLMs' ability to track and execute coding instructions across multiple sessions. MemoryCode simulates realistic workplace scenarios where a mentee (the LLM) receives coding instructions from a mentor amidst a stream of both relevant and irrelevant information.
๐ก But what makes MemoryCode unique?! The combination of the following:
โ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.
โ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.
โ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.
โ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.
โ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.
๐ Our Findings
1๏ธโฃ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.
2๏ธโฃ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.
๐ Paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions (2502.13791)
๐ฆ Code: https://github.com/for-ai/MemoryCode
๐ก But what makes MemoryCode unique?! The combination of the following:
โ Multi-Session Dialogue Histories: MemoryCode consists of chronological sequences of dialogues between a mentor and a mentee, mirroring real-world interactions between coworkers.
โ Interspersed Irrelevant Information: Critical instructions are deliberately interspersed with unrelated content, replicating the information overload common in office environments.
โ Instruction Updates: Coding rules and conventions can be updated multiple times throughout the dialogue history, requiring LLMs to track and apply the most recent information.
โ Prospective Memory: Unlike previous datasets that cue information retrieval, MemoryCode requires LLMs to spontaneously recall and apply relevant instructions without explicit prompts.
โ Practical Task Execution: LLMs are evaluated on their ability to use the retrieved information to perform practical coding tasks, bridging the gap between information recall and real-world application.
๐ Our Findings
1๏ธโฃ While even small models can handle isolated coding instructions, the performance of top-tier models like GPT-4o dramatically deteriorates when instructions are spread across multiple sessions.
2๏ธโฃ This performance drop isn't simply due to the length of the context. Our analysis indicates that LLMs struggle to reason compositionally over sequences of instructions and updates. They have difficulty keeping track of which instructions are current and how to apply them.
๐ Paper: From Tools to Teammates: Evaluating LLMs in Multi-Session Coding Interactions (2502.13791)
๐ฆ Code: https://github.com/for-ai/MemoryCode
Unable to Load GAIA benchmark leaderboard
3
#31 opened 17 days ago
by
manojbajaj95

upvoted
a
paper
21 days ago
Tool Calling
#1 opened about 1 month ago
by
rdsm
No tool calling
#2 opened about 1 month ago
by
rdsm