AI & ML interests

Superintelligence Alignment

Recent Activity

Open-Orca's activity

not-lainย 
posted an update about 23 hours ago
view post
Post
239
we now have more than 2000 public AI models using ModelHubMixin๐Ÿค—
not-lainย 
posted an update 6 days ago
view post
Post
3636
Published a new blogpost ๐Ÿ“–
In this blogpost I have gone through the transformers' architecture emphasizing how shapes propagate throughout each layer.
๐Ÿ”— https://huggingface.co/blog/not-lain/tensor-dims
some interesting takeaways :
not-lainย 
posted an update 2 months ago
view post
Post
2271
ever wondered how you can make an API call to a visual-question-answering model without sending an image url ๐Ÿ‘€

you can do that by converting your local image to base64 and sending it to the API.

recently I made some changes to my library "loadimg" that allows you to make converting images to base64 a breeze.
๐Ÿ”— https://github.com/not-lain/loadimg

API request example ๐Ÿ› ๏ธ:
from loadimg import load_img
from huggingface_hub import InferenceClient

# or load a local image
my_b64_img = load_img(imgPath_url_pillow_or_numpy ,output_type="base64" ) 

client = InferenceClient(api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

messages = [
	{
		"role": "user",
		"content": [
			{
				"type": "text",
				"text": "Describe this image in one sentence."
			},
			{
				"type": "image_url",
				"image_url": {
					"url": my_b64_img # base64 allows using images without uploading them to the web
				}
			}
		]
	}
]

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.2-11B-Vision-Instruct", 
	messages=messages, 
	max_tokens=500,
	stream=True
)

for chunk in stream:
    print(chunk.choices[0].delta.content, end="")
louisbrulenaudetย 
posted an update 2 months ago
view post
Post
1834
Iโ€™ve published a new dataset to simplify model merging ๐Ÿค—

This dataset facilitates the search for compatible architectures for model merging with @arcee_aiโ€™s mergekit, streamlining the automation of high-performance merge searches ๐Ÿ“–

Dataset : louisbrulenaudet/mergekit-configs
  • 1 reply
ยท
Alignment-Lab-AIย 
posted an update 3 months ago
view post
Post
1108
remember boys and girls, always keep all your data, its never a waste of time!
louisbrulenaudetย 
posted an update 3 months ago
view post
Post
1221
Introducing Lemone-router, a series of classification models designed to produce an optimal multi-agent system for different branches of tax law.

Trained on a base of 49k lines comprising a set of synthetic questions generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation and authority documents, these models are based on an 8-category decomposition of the classification scheme derived from the Bulletin officiel des finances publiques - impรดts :

label2id = {
    "Bรฉnรฉfices professionnels": 0,
    "Contrรดle et contentieux": 1,
    "Dispositifs transversaux": 2,
    "Fiscalitรฉ des entreprises": 3,
    "Patrimoine et enregistrement": 4,
    "Revenus particuliers": 5,
    "Revenus patrimoniaux": 6,
    "Taxes sur la consommation": 7
}
	
id2label = {
    0: "Bรฉnรฉfices professionnels",
    1: "Contrรดle et contentieux",
    2: "Dispositifs transversaux",
    3: "Fiscalitรฉ des entreprises",
    4: "Patrimoine et enregistrement",
    5: "Revenus particuliers",
    6: "Revenus patrimoniaux",
    7: "Taxes sur la consommation"
}

It achieves the following results on the evaluation set:
- Loss: 0.4734
- Accuracy: 0.9191

Link to the collection: louisbrulenaudet/lemone-router-671cce21d6410f3570514762
louisbrulenaudetย 
posted an update 3 months ago
view post
Post
3120
๐Ÿšจ I have $3,500 in Azure credits, including access to an H100 (96 Go), expiring on November 12, 2024.

I wonโ€™t be able to use it all myself, so Iโ€™m reaching out to the @huggingface community: Are there any open-source projets with data ready for some compute power?

Letโ€™s collaborate and make the most of it together ๐Ÿ”—
ยท
louisbrulenaudetย 
posted an update 4 months ago
view post
Post
2118
My biggest release of the year: a series of 7 specialized embedding models for information retrieval within tax documents, is now available for free on Hugging Face ๐Ÿค—

These new models aim to offer an open source alternative for in-domain semantic search from largeย text corpora and will improve RAG systems and context addition for large language models.

Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat ), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.

A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c , @clem ๐Ÿค—, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions โค๏ธ

Models are available on my personal HF page, into the Lemone-embed collection: louisbrulenaudet/lemone-embed-66fdc24000df732b395df29b
  • 1 reply
ยท
louisbrulenaudetย 
posted an update 4 months ago
view post
Post
2604
The Romulus model series has been released on Hugging Face, continually pre-trained on 34,864,949 tokens of French laws and intended to serve as a foundation for fine-tuning on labeled data ๐Ÿค—

The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer ๐Ÿฆฅ

Link to the base model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1

Link to the instruct model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct

Link to the dataset: louisbrulenaudet/Romulus-cpt-fr

Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.
  • 1 reply
ยท
louisbrulenaudetย 
posted an update 4 months ago
view post
Post
1576
An example of the application of LegalKit is the production of knowledge graphs, here is a demo Space ๐Ÿ”—

With the update of the French legal code data model uploaded to ๐Ÿค— and the introduction of a column dedicated to HTML text, it's now easy to extract links between different articles and produce complex graphs with just a few lines of Python.

This simplified demo highlights the ease of implementation and creative potential, and enables the generation of complete data sets, although requiring a powerful graphics card for display. The framework used for the moment is D3.js, but perhaps other solutions are possible. I'd be delighted to hear your suggestions, and look forward to hearing from the community.

Link to the ๐Ÿค— Space: louisbrulenaudet/legalkit-knowledge-graph
  • 2 replies
ยท
louisbrulenaudetย 
posted an update 5 months ago
view post
Post
2024
Understanding the json format response with HF's Serverless Inference API ๐Ÿค—

As it stands, there seems to be an inconsistency with the OpenAI documentation on the question of implementing the JSON response format using the InferenceClient completion API.

After investigating the InferenceClient source code, I share the official solution using a JSON Schema. This consolidates the structure of the response and simplifies parsing as part of an automated process for extracting metadata, information:
from huggingface_hub import InferenceClient

client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")

messages = [
    {
        "role": "user",
        "content": "I saw a puppy a cat and a raccoon during my bike ride in the park. What did I saw and when?",
    },
]

response_format = {
    "type": "json",
    "value": {
        "properties": {
            "location": {"type": "string"},
            "activity": {"type": "string"},
            "animals_seen": {"type": "integer", "minimum": 1, "maximum": 5},
            "animals": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["location", "activity", "animals_seen", "animals"],
    },
}

response = client.chat_completion(
    messages=messages,
    response_format=response_format,
    max_tokens=500,
)

print(response.choices[0].message.content)

As a reminder, json mode is activated with the OpenAI client as follows:
response = client.chat.completions.create(
     model="gpt-3.5-turbo-0125",
     messages=[...],
     response_format={"type": "json_object"}
)

One question remains unanswered, however, and will perhaps be answered by the community: it seems that an incompatibility persists for list of dictionaries generation, and currently, the production of simple dictionaries seems to be the only functional option.
  • 2 replies
ยท
louisbrulenaudetย 
posted an update 5 months ago
view post
Post
2762
๐Ÿš€ RAGoon is now available on PyPI, GitHub, and as a Space on Hugging Face for batched embeddings generation ๐Ÿค—

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.

Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the ๐Ÿค— Space: louisbrulenaudet/ragoon