Processing and acquisition traces in visual encoders: What does CLIP know about your camera?
Abstract
Visual encoders encode subtle image acquisition parameters that can significantly impact semantic predictions based on their correlation with semantic labels.
Prior work has analyzed the robustness of visual encoders to image transformations and corruptions, particularly in cases where such alterations are not seen during training. When this occurs, they introduce a form of distribution shift at test time, often leading to performance degradation. The primary focus has been on severe corruptions that, when applied aggressively, distort useful signals necessary for accurate semantic predictions. We take a different perspective by analyzing parameters of the image acquisition process and transformations that may be subtle or even imperceptible to the human eye. We find that such parameters are systematically encoded in the learned visual representations and can be easily recovered. More strikingly, their presence can have a profound impact, either positively or negatively, on semantic predictions. This effect depends on whether there is a strong correlation or anti-correlation between semantic labels and these acquisition-based or processing-based labels. Our code and data are available at: https://github.com/ryan-caesar-ramos/visual-encoder-traces
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Perceptual Classifiers: Detecting Generative Images using Perceptual Features (2025)
- CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions (2025)
- Self-Supervised YOLO: Leveraging Contrastive Learning for Label-Efficient Object Detection (2025)
- CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models (2025)
- DIP: Unsupervised Dense In-Context Post-training of Visual Representations (2025)
- Not All Attention Heads Are What You Need: Refining CLIP's Image Representation with Attention Ablation (2025)
- Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper