C4AI-Community (C4AI Community)

posted an update 1 day ago

Post

581

I’ve just released logfire-callback on PyPI, designed to facilitate monitoring of Hugging Face Transformer training loops using Pydantic Logfire 🤗

The callback will automatically log training start with configuration parameters, periodic metrics and training completion ⏱️

Install the package using pip:

pip install logfire-callback

First, ensure you have a Logfire API token and set it as an environment variable:

export LOGFIRE_TOKEN=your_logfire_token

Then use the callback in your training code:

from transformers import Trainer, TrainingArguments
from logfire_callback import LogfireCallback

# Initialize your model, dataset, etc.

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    # ... other training arguments
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    callbacks=[LogfireCallback()]  # Add the Logfire callback here
)

trainer.train()

If you have any feedback, please reach out at @louisbrulenaudet

tellarin

authored a paper 6 days ago

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Paper • 2503.12533 • Published 8 days ago • 60

not-lain

posted an update 12 days ago

Post

1494

🚀AraClip is now fully integrated with Hugging Face 🤗

AraClip is a specialized CLIP model that was created by @pain and optimized for Arabic text-image retrieval tasks🔥

🔗 Try it out 🔗
🤖 model: Arabic-Clip/araclip
🧩 Gradio demo: Arabic-Clip/Araclip-Simplified
🌐 website: https://arabic-clip.github.io/Arabic-CLIP/

2 replies

·

tellarin

authored a paper 12 days ago

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Paper • 2503.07920 • Published 14 days ago • 95

tellarin

authored 3 papers 13 days ago

jjzha

authored a paper 14 days ago

How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale

Paper • 2503.04290 • Published 18 days ago

yihaopeng

authored a paper 14 days ago

CodeA11y: Making AI Coding Assistants Useful for Accessible Web Development

Paper • 2502.10884 • Published Feb 15

singhsidhukuldeep

posted an update 22 days ago

Post

6791

Exciting New Tool for Knowledge Graph Extraction from Plain Text!

I just came across a groundbreaking new tool called KGGen that's solving a major challenge in the AI world - the scarcity of high-quality knowledge graph data.

KGGen is an open-source Python package that leverages language models to extract knowledge graphs (KGs) from plain text. What makes it special is its innovative approach to clustering related entities, which significantly reduces sparsity in the extracted KGs.

The technical approach is fascinating:

1. KGGen uses a multi-stage process involving an LLM (GPT-4o in their implementation) to extract entities and relations from source text
2. It aggregates graphs across sources to reduce redundancy
3. Most importantly, it applies iterative LM-based clustering to refine the raw graph

The clustering stage is particularly innovative - it identifies which nodes and edges refer to the same underlying entities or concepts. This normalizes variations in tense, plurality, stemming, and capitalization (e.g., "labors" clustered with "labor").

The researchers from Stanford and University of Toronto also introduced MINE (Measure of Information in Nodes and Edges), the first benchmark for evaluating KG extractors. When tested against existing methods like OpenIE and GraphRAG, KGGen outperformed them by up to 18%.

For anyone working with knowledge graphs, RAG systems, or KG embeddings, this tool addresses the fundamental challenge of data scarcity that's been holding back progress in graph-based foundation models.

The package is available via pip install kg-gen, making it accessible to everyone. This could be a game-changer for knowledge graph applications!

singhsidhukuldeep

posted an update 24 days ago

Post

579

Exciting Research Alert: Enhancing Dense Retrieval with Deliberate Thinking

I just came across a fascinating new paper titled "Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Search" that introduces DEBATER (Deliberate Thinking based Dense Retriever), a novel approach to improve information retrieval using large language models.

The research team from Northeastern University and Tsinghua University has developed a method that significantly outperforms existing dense retrieval systems by enabling LLMs to "think deliberately" before generating document representations.

>> Technical Details

DEBATER enhances LLM-based retrievers through two key mechanisms:

1. Chain-of-Deliberation (CoD): This approach delays the computation of document embeddings by performing several steps of reasoning. It incorporates a sequence of prompt tokens that stimulate the reasoning capability of LLMs, encouraging the model to think step-by-step before producing the final document embedding.

2. Self Distillation (SD): This mechanism distills knowledge from different thinking steps into the final document representation. It identifies the most informative thinking steps and integrates them into a unified text embedding.

The implementation uses cosine similarity to measure the similarity between queries and documents. During training, DEBATER calculates similarity scores between query representation and document representations at each thinking step, then selects the most useful thinking step from CoD.

>> Performance

What's particularly impressive is that DEBATER-4B outperforms larger 7B-scale LLM-based dense retrievers while using significantly fewer parameters. In experiments on the BEIR benchmark, DEBATER achieved more than a 2% improvement over baseline retrievers.

The researchers found that an appropriate thinking depth (around 4-8 steps) effectively activates the reasoning capabilities of LLM-based retrievers.

jjzha

authored a paper 24 days ago

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Paper • 2502.15411 • Published Feb 21 • 2

singhsidhukuldeep

posted an update 26 days ago

Post

3483

O1 Embedder: Transforming Retrieval Models with Reasoning Capabilities

Researchers from University of Science and Technology of China and Beijing Academy of Artificial Intelligence have developed a novel retrieval model that mimics the slow-thinking capabilities of reasoning-focused LLMs like OpenAI's O1 and DeepSeek's R1.

Unlike traditional embedding models that directly match queries with documents, O1 Embedder first generates thoughtful reflections about the query before performing retrieval. This two-step process significantly improves performance on complex retrieval tasks, especially those requiring intensive reasoning or zero-shot generalization to new domains.

The technical implementation is fascinating:

- The model integrates two essential functions: Thinking and Embedding
- It uses an "Exploration-Refinement" data synthesis workflow where initial thoughts are generated by an LLM and refined by a retrieval committee
- A multi-task training method fine-tunes a pre-trained LLM to generate retrieval thoughts via behavior cloning while simultaneously learning embedding capabilities through contrastive learning
- Memory-efficient joint training enables both tasks to share encoding results, dramatically increasing batch size

The results are impressive - O1 Embedder outperforms existing methods across 12 datasets in both in-domain and out-of-domain scenarios. For example, it achieves a 3.9% improvement on Natural Questions and a 3.0% boost on HotPotQA compared to models without thinking capabilities.

This approach represents a significant paradigm shift in retrieval technology, bridging the gap between traditional dense retrieval and the reasoning capabilities of large language models.

What do you think about this approach? Could "thinking before retrieval" transform how we build search systems?

singhsidhukuldeep

posted an update 27 days ago

Post

1671

I just came across a groundbreaking paper titled "Hypencoder: Hypernetworks for Information Retrieval" by researchers from the University of Massachusetts Amherst that introduces a fundamentally new paradigm for search technology.

Most current retrieval models rely on simple inner product calculations between query and document vectors, which severely limits their expressiveness. The authors prove theoretically that inner product similarity functions fundamentally constrain what types of relevance relationships can be captured.

Hypencoder takes a radically different approach: instead of encoding a query as a vector, it generates a small neural network (called a "q-net") that acts as a learned relevance function. This neural network takes document representations as input and produces relevance scores.

Under the hood, Hypencoder uses:
- Attention-based hypernetwork layers (hyperhead layers) that transform contextualized query embeddings into weights and biases for the q-net
- A document encoder that produces vector representations similar to existing models
- A graph-based greedy search algorithm for efficient retrieval that can search 8.8M documents in under 60ms

The results are impressive - Hypencoder significantly outperforms strong dense retrieval models on standard benchmarks like MS MARCO and TREC Deep Learning Track. The performance gap widens even further on complex retrieval tasks like tip-of-the-tongue queries and instruction-following retrieval.

What makes this approach particularly powerful is that neural networks are universal approximators, allowing Hypencoder to express far more complex relevance relationships than inner product similarity functions. The framework is also flexible enough to replicate any existing neural retrieval method while adding the ability to learn query-dependent weights.

Cartinoe5930

authored 2 papers 27 days ago

Multi-Step Reasoning in Korean and the Emergent Mirage

Paper • 2501.05712 • Published Jan 10

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Paper • 2502.17407 • Published 28 days ago • 25

nathanaelc

published a dataset 28 days ago

C4AI-Community/memorycode

Updated 28 days ago • 92 • 2

nathanaelc

updated a dataset 28 days ago

C4AI-Community/memorycode

Updated 28 days ago • 92 • 2

ehristoforu

posted an update 28 days ago

Post

2832

Introducing our first standalone model – FluentlyLM Prinum

Introducing the first standalone model from Project Fluently LM! We worked on it for several months, used different approaches and eventually found the optimal one.

General characteristics:
- Model type: Causal language models (QwenForCausalLM, LM Transformer)
- Number of parameters: 32.5B
- Number of parameters (not embedded): 31.0B
- Number of layers: 64
- Context: 131,072 tokens
- Language(s) (NLP): English, French, Spanish, Russian, Chinese, Japanese, Persian (officially supported)
- License: MIT

Creation strategy:
The basis of the strategy is shown in Pic. 2.
We used Axolotl & Unsloth for SFT-finetuning with PEFT LoRA (rank=64, alpha=64) and Mergekit for SLERP and TIES mergers.

Evolution:
🏆 12th place in the Open LLM Leaderboard ( open-llm-leaderboard/open_llm_leaderboard) (21.02.2025)

Detailed results and comparisons are presented in Pic. 3.

Links:
- Model: fluently-lm/FluentlyLM-Prinum
- GGUF version: mradermacher/FluentlyLM-Prinum-GGUF
- Demo on ZeroGPU: ehristoforu/FluentlyLM-Prinum-demo

7 replies

·

alielfilali01

posted an update about 1 month ago

Post

869

🚨 Arabic LLM Evaluation 🚨

Few models join the ranking of inceptionai/AraGen-Leaderboard Today.

The new MistralAI model, Saba, is quite impressive, Top10 ! Well done @arthurmensch and team.

Sadly Mistral did not follow its strategy about public weights this time, we hope this changes soon and we get the model with a permissive license.

We added other Mistral models and apparently, we have been sleeping on mistralai/Mistral-Large-Instruct-2411 !

Another impressive model that joined the ranking today is ALLaM-AI/ALLaM-7B-Instruct-preview. After a long wait finally ALLaM is here and it is IMPRESSIVE given its size !

ALLaM is ranked on OALL/Open-Arabic-LLM-Leaderboard as well.

C4AI Community

AI & ML interests

Recent Activity

C4AI-Community's activity

Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning

How Do Hackathons Foster Creativity? Towards AI Collaborative Evaluation of Creativity at Scale

CodeA11y: Making AI Coding Assistants Useful for Accessible Web Development

HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Multi-Step Reasoning in Korean and the Emergent Mirage

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

C4AI-Community/memorycode

C4AI-Community/memorycode

AI & ML interests

Recent Activity

Team members 166

C4AI-Community's activity