Sparsh-X (tactile img-mic-imu-pressure) base-sized model
Sparsh-X is a transformer-based backbone that fusing multiple touch sensing modalities available in the Digit 360 sensor (tactile image, audio, IMU and pressure). The model is trained using self-distillation SSL (DINO loss) and bottleneck fusion, specifically adapted for the Digit 360 touch sensor.
Disclaimer: This model card was written by the Sparsh-X authors. The Transformer architetcure and DINO objectives have been adapted for the multisensory touch sensing use case.
Model description
Sparsh-X is a transformer-based backbone where each input signal is first processed independently for $L_f$ layers through self-attention. Thereafter, we allow cross-modal information flow via attention bottlenecks. Specifically, we concatenate $B$ bottleneck fusion tokens to each modality’s embedding for the subsequent $Lb$ blocks. After each cross-modal update, the fusion tokens are averaged across modalities to promote information sharing. Intuitively, the bottleneck tokens act as multimodal summarizers, distilling and exchanging information between tactile modalities within each transformer block.
The inputs to Sparsh-X are image, audio, accelerometer, and pressure recorded by the Digit 360 sensor. Tactile images are sampled at 30fps and passed to the model with a temporal stride of 5 concatenated along the channel dimension. We crop to zoom-in the fish-eye image and resize to 224 × 224 × 3. Image patches (16 × 16) are then tokenized to embeddings of 768 dimensions through a linear projection layer. Audio comes from two contact microphones sampled at 48kHz. A 0.55s window of audio signal is converted into a log-mel spectogram of 128 channels computed from a 5ms Hamming window with hop length 2.5ms. We concatenate the spectograms from both microphones, resulting into an audio input of 224 × 256 which is further tokenized with a patch size of 16. IMU data from the 3-axis accelerometer is sampled at 400Hz and combined in a window of 0.55s. The pressure signal is sampled at 200Hz and combined in a window of 1.1s window. Both signals are tokenized resulting in 224 × 3 and 224 × 1 temporal signals.
Intended uses & limitations
You can utilize the Sparsh-X model to extract touch representations for the Digit 360 sensor, fusing all sensing modalities. You have two options:
- Use the frozen Sparsh-X encoder: This allows you to leverage the pre-trained weights of the Sparsh-X model without modifying them.
- Fine-tune the Sparsh-X encoder: You can fine-tune the Sparsh-X encoder along with the training of your downstream task, allowing the model to adapt to your specific use case.
Both options enable you to take advantage of the powerful touch representations learned by the Sparsh-X model.
How to Use
For detailed instructions on how to load the encoder and integrate it into your downstream task, please refer to our GitHub repository.
Citation
@inproceedings{higuera2025tactile,
title={Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation},
author={Carolina Higuera and Akash Sharma and Taosha Fan and Chaithanya Krishna Bodduluri and Byron Boots and Michael Kaess and Mike Lambeta and Tingfan Wu and Zixi Liu and Francois Robert Hogan and Mustafa Mukadam},
booktitle={9th Annual Conference on Robot Learning},
year={2025},
url={https://openreview.net/forum?id=sMs4pJYhWi}
}
@article{lambeta2024digitizing,
title={Digitizing touch with an artificial multimodal fingertip},
author={Lambeta, Mike and Wu, Tingfan and Sengul, Ali and Most, Victoria Rose and Black, Nolan and Sawyer, Kevin and Mercado, Romeo and Qi, Haozhi and Sohn, Alexander and Taylor, Byron and others},
journal={arXiv preprint arXiv:2411.02479},
year={2024}
}