lhl PRO
AI & ML interests
Articles
Organizations
leonardlin's activity
When AutoTrain creates a new space to train your model, it does so via the huggingface API. If you modify the code so that it includes a premade README.md file, you can add these two lines:
---
app_port: 8080 # or any integer besides 7860 that's greater than 2 ** 10
startup_duration_timeout: 350m
---
This will tell huggingface to listen for the iframe on your port, instead of the one autotrain is actually hosting on, and because startup time isn't charged, you get the product for free. (you can take this even further by switching compute type to A100 or something)
📝 Blog post: https://haystack.deepset.ai/blog/rag-evaluation-with-prometheus-2
📓 Notebook: https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/prometheus2_evaluation.ipynb
─── ⋆⋅☆⋅⋆ ───
When evaluating LLMs' responses, 𝐩𝐫𝐨𝐩𝐫𝐢𝐞𝐭𝐚𝐫𝐲 𝐦𝐨𝐝𝐞𝐥𝐬 like GPT-4 are commonly used due to their strong performance.
However, relying on closed models presents challenges related to data privacy 🔒, transparency, controllability, and cost 💸.
On the other hand, 𝐨𝐩𝐞𝐧 𝐦𝐨𝐝𝐞𝐥𝐬 typically do not correlate well with human judgments and lack flexibility.
🔥 Prometheus 2 is a new family of open-source models designed to address these gaps:
🔹 two variants: prometheus-eval/prometheus-7b-v2.0; prometheus-eval/prometheus-8x7b-v2.0
🔹 trained on open-source data
🔹 high correlation with human evaluations and proprietary models
🔹 highly flexible: capable of performing direct assessments and pairwise rankings, and allowing the definition of custom evaluation criteria.
See my experiments with RAG evaluation in the links above.
Tonight I wrote up a WandB report (the panel editor is super broken in Firefox 😔) that sums up some of the more interesting bits from the results: https://wandb.ai/augmxnt/train-bench/reports/torchtune-vs-axolotl-vs-unsloth-Trainer-Comparison--Vmlldzo4MzU3NTAx
I also have an accompanying model and dataset (and codebase) for those curious to poke around:
* augmxnt/Qwen2-7B-Instruct-deccp
* augmxnt/deccp
Inspired by the distill.pub interactive graphics papers, we settled to write the most extensive, enjoyable and in-depth tech report we could draft on so prepare for a 45-mmin read with interactive graphics and all.
And it's not all, in this article we also introduce 📚FineWeb-Edu a filtered subset of Common Crawl with 1.3T tokens containing only web pages with very high educational content. Up to our knowledge, FineWeb-Edu out-performs all openly release web-scale datasets by a significant margin on knowledge- and reasoning-intensive benchmarks like MMLU, ARC, and OpenBookQA
We also make a number of surprising observations on the "quality" of the internet it-self which may challenge some of the general assumptions on web data (not saying more, I'll let you draw your conclusions ;)
HuggingFaceFW/blogpost-fineweb-v1
I'll just add that I'm sure it's spam now, that space is attached to another one of my models as well (and obviously not running either). Also the user's other space is straight out linking to something shady: https://huggingface.co/spaces/elseodelasgalletas/detector-de-ia (I can't report as I'm rate limited)
I mean, it's obviously not running my model (it's a brand new JA/EN ablation), so not sure why it'd be attached...
Also, I tested the new https://huggingface.co/DataPilot/ArrowPro-7B-KUJIRA model and it appears to be the real deal, very impressive performance, trained by a 15-yo (!) @Holy-fox - note that using my sampler settings detailed improved the score as well (as otherwise it suffered from looping errors as well).
I'll be aiming for beating that on the Llama 3 8B, and beating Command R Plus for the 70B in the coming days.
I'll just add a note on the sampler parameters for testing that I found improved performance for virtually every model I tested: temperature 0.2, min_p 0.1, frequency_penalty 0.5 (a frequency/repetition penalty is required to minimize looping errors that otherwise creep into most of these models)
gpt-3.5-turbo-0125
's JA performance, which is worth noting, and is tuned *exclusively* with the old shisa-v1 dataset (so it's chart position will be very short lived).shisa-ai/shisa-v1-llama3-70b
augmxnt/ultra-orca-boros-en-ja-v1
I've setup a fork of Lightblue's Shaberi testing framework which uses LLM-as-a-Judge style benchmarks as something probably more representative of real world LLM strength in Japanese. Here's how the new base model ablations are looking:
Dual-licensed under MIT/Apache 2.0.
Model Weights: mrfakename/styletts2-detector
Spaces: mrfakename/styletts2-detector
Here's also a simple script for checking what the output looks like:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("augmxnt/shisa-7b-v1")
messages = [
{'role': 'user', 'content': 'This is the first user input.'},
{'role': 'assistant', 'content': 'This is the first assistant response.'},
{'role': 'user', 'content': 'This is the second user input.'},
]
print()
print('Chat Template:')
print(tokenizer.chat_template)
print()
print('---')
print()
print(tokenizer.apply_chat_template(messages, tokenize=False))
BTW, I was trying to get a tree on https://huggingface.co/mlabonne/AlphaMonarch-7B and it was getting caught in a recursion loop. I started first by adding caching on the ModelCard assuming it'd figure things out but it didn't and I hacked in some stuff preventing revisits (also added some weak handling for missing models since that was looping as well since AIDC-ai-business/Marcoroni-7B-v3 for example has disappeared).
Anyway, my updated code still has broken chart rendering (cyclic graph - what was causing the looping issues) but at least it will get a list of the model lineage, which was good enough for my purposes... In case anyone wants to move this forward or needs a reference in case they run into looping issues: https://colab.research.google.com/drive/1-7w_pPWPCCQQpQ7LrvlKIdhyHsoCHH4E?usp=sharing
upstage/Solar LLM will soon be available for LG gram Laptops as an on-device LLM. 💻🌞🎉
Upstage makes LLMs accessible to everyone and every device. We'd love to see more on-device LLMs.
https://koreajoongangdaily.joins.com/news/2024-02-06/business/industry/LG-Electronics-signs-partnership-with-generative-AI-startup-Upstage-/1975528
🎉 Welcome Distilabel Capybara DPO, a multi-turn, high-quality preference dataset.
argilla/distilabel-capybara-dpo-7k-binarized
Why?
Best closed chat models are built on top of multi-turn dialogue preference data. The OSS community lacks these datasets. This dataset is the first in the series to close this gap.
Is this dataset useful?
To test this dataset, we've built our virtual launching partner:
🎉 Welcome CapybaraHermes, a preference tuned OpenHermes with increased second turn capabilities on MTBench
argilla/CapybaraHermes-2.5-Mistral-7B
As usual, models are the least important to us. We like to focus on the data. Our mission is to build and share high-quality datasets, sharing our methods in the open so the community can improve upon them.
That's why, we took some time to describe the full methodology on the dataset card, check it out and give us feedback! Data and methods are never perfect!
Finally, this is just a preview version and would love to collaborate with you to add more benchmarking results, what hyperparams work for DPO'ing models, what mix of datasets, etc.
Expect some more datasets in the coming weeks. Let's build the best data for AI, together.