Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

😃 TOP 3 on HuggingFace for posts 🤗 Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Organizations

Posts 84

view post
Post
170
Good folks at @nvidia and @Tsinghua_Uni have released LLAMA-MESH - A Revolutionary Approach to 3D Content Generation!

This innovative framework enables the direct generation of 3D meshes from natural language prompts while maintaining strong language capabilities.

Here is the Architecture & Implementation!

>> Core Components

Model Foundation
- If you haven't guessed it yet, it's built on the LLaMA-3.1-8B-Instruct base model
- Maintains original language capabilities while adding 3D generation
- Context length is set to 8,000 tokens

3D Representation Strategy
- Uses the OBJ file format for mesh representation
- Quantizes vertex coordinates into 64 discrete bins per axis
- Sorts vertices by z-y-x coordinates, from lowest to highest
- Sorts faces by the lowest vertex indices for consistency

Data Processing Pipeline
- Filters meshes to a maximum of 500 faces for computational efficiency
- Applies random rotations (0°, 90°, 180°, 270°) for data augmentation
- Generates ~125k mesh variations from 31k base meshes
- Uses Cap3D-generated captions for text descriptions

>> Training Framework

Dataset Composition
- 40% Mesh Generation tasks
- 20% Mesh Understanding tasks
- 40% General Conversation (UltraChat dataset)
- 8x training turns for generation, 4x for understanding

Training Configuration
- Deployed on 32 A100 GPUs (for Nvidia, this is literally in-house)
- 21,000 training iterations
- Global batch size: 128
- AdamW optimizer with a 1e-5 learning rate
- 30-step warmup with cosine scheduling
- Total training time: approximately 3 days (based on the paper)

This research opens exciting possibilities for intuitive 3D content creation through natural language interaction. The future of digital design is conversational!
view post
Post
1071
It's not every day you see the No. 1 ranked paper of the day open-sourcing a very powerful image editing app!

Fascinating to see MagicQuill - a groundbreaking interactive image editing system that makes precise photo editing effortless through advanced AI!

The system's architecture features three sophisticated components:

1. Editing Processor:
- Implements a dual-branch architecture integrated into a latent diffusion framework
- Utilizes PiDiNet for edge map extraction and content-aware per-pixel inpainting
- Features a specialized UNet architecture with zero-convolution layers for feature insertion
- Employs denoising score matching for training the control branch
- Processes both structural modifications via scribble guidance and color manipulation through downsampled color blocks
- Maintains pixel-level control through VAE-based latent space operations

2. Painting Assistor:
- Powered by a fine-tuned LLaVA multimodal LLM using Low-Rank Adaptation (LoRA)
- Trained on a custom dataset derived from Densely Captioned Images (DCI)
- Processes user brushstrokes through specialized Q&A tasks for add/subtract/color operations
- Features bounding box coordinate normalization for precise stroke localization
- Implements streamlined single-word/phrase outputs for real-time performance

3. Idea Collector:
- Built as a modular ReactJS component library
- Supports cross-platform deployment via HTTP protocols
- Compatible with Gradio and ComfyUI frameworks
- Features comprehensive layer management and parameter adjustment capabilities
- Implements real-time canvas updates and preview generation

The system outperforms existing solutions like SmartEdit and BrushNet in edge alignment and color fidelity while maintaining seamless integration with popular AI frameworks.

What are your thoughts on AI-powered creative tools?

models

None public yet

datasets

None public yet