- LPF: A Language-Prior Feedback Objective Function for De-biased Visual Question Answering Most existing Visual Question Answering (VQA) systems tend to overly rely on language bias and hence fail to reason from the visual clue. To address this issue, we propose a novel Language-Prior Feedback (LPF) objective function, to re-balance the proportion of each answer's loss value in the total VQA loss. The LPF firstly calculates a modulating factor to determine the language bias using a question-only branch. Then, the LPF assigns a self-adaptive weight to each training sample in the training process. With this reweighting mechanism, the LPF ensures that the total VQA loss can be reshaped to a more balanced form. By this means, the samples that require certain visual information to predict will be efficiently used during training. Our method is simple to implement, model-agnostic, and end-to-end trainable. We conduct extensive experiments and the results show that the LPF (1) brings a significant improvement over various VQA models, (2) achieves competitive performance on the bias-sensitive VQA-CP v2 benchmark. 3 authors · May 29, 2021
29 OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person Virtual Try-On (VTON) has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. While diffusion models, such as Stable Diffusion series, have shown their capability in creating high-quality and photorealistic images, they encounter formidable challenges in conditional generation scenarios like VTON. Specifically, these models struggle to maintain a balance between control and consistency when generating images for virtual clothing trials. OutfitAnyone addresses these limitations by leveraging a two-stream conditional diffusion model, enabling it to adeptly handle garment deformation for more lifelike results. It distinguishes itself with scalability-modulating factors such as pose, body shape and broad applicability, extending from anime to in-the-wild images. OutfitAnyone's performance in diverse scenarios underscores its utility and readiness for real-world deployment. For more details and animated results, please see https://humanaigc.github.io/outfit-anyone/. 11 authors · Jul 23, 2024 5
1 Modulated Diffusion: Accelerating Generative Modeling with Modulated Quantization Diffusion models have emerged as powerful generative models, but their high computation cost in iterative sampling remains a significant bottleneck. In this work, we present an in-depth and insightful study of state-of-the-art acceleration techniques for diffusion models, including caching and quantization, revealing their limitations in computation error and generation quality. To break these limits, this work introduces Modulated Diffusion (MoDiff), an innovative, rigorous, and principled framework that accelerates generative modeling through modulated quantization and error compensation. MoDiff not only inherents the advantages of existing caching and quantization methods but also serves as a general framework to accelerate all diffusion models. The advantages of MoDiff are supported by solid theoretical insight and analysis. In addition, extensive experiments on CIFAR-10 and LSUN demonstrate that MoDiff significant reduces activation quantization from 8 bits to 3 bits without performance degradation in post-training quantization (PTQ). Our code implementation is available at https://github.com/WeizhiGao/MoDiff. 6 authors · Jun 17 1
- Unveiling Hallucination in Text, Image, Video, and Audio Foundation Models: A Comprehensive Survey The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area. 6 authors · May 15, 2024
- Interventional Causal Representation Learning Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect do interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure. 4 authors · Sep 24, 2022
- Modeling Temperature, Frequency, and Strain Effects on the Linear Electro-Optic Coefficients of Ferroelectric Oxides An electro-optic modulator offers the function of modulating the propagation of light in a material with electric field and enables seamless connection between electronics-based computing and photonics-based communication. The search for materials with large electro-optic coefficients and low optical loss is critical to increase the efficiency and minimize the size of electro-optic devices. We present a semi-empirical method to compute the electro-optic coefficients of ferroelectric materials by combining first-principles density-functional theory calculations with Landau-Devonshire phenomenological modeling. We apply the method to study the electro-optic constants, also called Pockels coefficients, of three paradigmatic ferroelectric oxides: BaTiO3, LiNbO3, and LiTaO3. We present their temperature-, frequency- and strain-dependent electro-optic tensors calculated using our method. The predicted electro-optic constants agree with the experimental results, where available, and provide benchmarks for experimental verification. 5 authors · Jun 5, 2021
- Boosting Multi-modal Model Performance with Adaptive Gradient Modulation While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023. 6 authors · Aug 15, 2023
- Practical considerations for high-fidelity wavefront shaping experiments Wavefront shaping is a technique for directing light through turbid media. The theoretical aspects of wavefront shaping are well understood, and under near-ideal experimental conditions, accurate predictions for the expected signal enhancement can be given. In practice, however, there are many experimental factors that negatively affect the outcome of the experiment. Here, we present a comprehensive overview of these experimental factors, including the effect of sample scattering properties, noise, and response of the spatial light modulator. We present simple means to identify experimental imperfections and to minimize their negative effect on the outcome of the experiment. This paper is accompanied by Python code for automatically quantifying experimental problems using the OpenWFS framework for running and simulating wavefront shaping experiments. 3 authors · Mar 22, 2024
- Probing solar modulation of AMS-02 time-dependent D, ^3He and ^4He fluxes with modified force field approximation The AMS-02 experiment recently published time-dependent fluxes of deuterons (D) from May 2011 to April 2021, divided into 33 periods of four Bartels rotations each. These temporal structures are associated with solar modulation. In this study, three modified force-field approximation are employed to examine the long-term behavior of cosmic-ray (CR) isotopes such as D, ^3He, and ^4He, as well as the ratios D/^3He and ^3He/^4He. The solar modulation potential is rigidity-dependent for these modified force-field approximation models. Due to the unknown local interstellar spectrum (LIS) for these isotopes, we utilize the Non-LIS method for solar modulation. By fitting to the AMS-02 time-dependent fluxes, we derive the solar modulation parameters. Our findings prove the assumption in literature that all isotopes can be fitted using the same solar modulation parameters and it shown that the modified FFA models are validated parametrization for solar modulation. Based on these, we forecast the daily fluxes of D, ^3He and ^4He from 2011 to 2020. 2 authors · Feb 14
- Wave optics lensing of gravitational waves: theory and phenomenology of triple systems in the LISA band We study lensing of gravitational waves by a black hole in the deep wave optics regime, i.e. when the wavelength is much larger than the black hole Schwarzschild radius. We apply it to triple systems, with a binary of stellar mass objects in the inspiraling phase orbiting around a central massive black hole. We describe the full polarisation structure of the wave and derive predictions for the polarisation modes of the scattered wave measured by the observer. We show that lensing in the wave optics regime is not helicity preserving, as opposed to lensing in the geometric optics regime. The amplitude of the total wave is modulated due to interference between the directly transmitted and lensed components. The relative amplitude of the modulation is fixed by the lensing geometry and can reach unity in the most favourable settings. This indicates that wave optics lensing is potentially detectable by LISA for sufficiently high SNR systems. Our findings show that in the wave optics regime it is necessary to go beyond the usual lensing description where the amplification factor is assumed to be the same for both helicity modes. While motivated by GW190521 and the AGN formation scenario, our results apply more broadly to stellar-mass binaries orbiting a third body described as a Schwarzschild black hole, with a period comparable to the GW observation time. 4 authors · Apr 10, 2024
- MetaModulation: Learning Variational Feature Hierarchies for Few-Shot Learning with Fewer Tasks Meta-learning algorithms are able to learn a new task using previously learned knowledge, but they often require a large number of meta-training tasks which may not be readily available. To address this issue, we propose a method for few-shot learning with fewer tasks, which we call MetaModulation. The key idea is to use a neural network to increase the density of the meta-training tasks by modulating batch normalization parameters during meta-training. Additionally, we modify parameters at various network levels, rather than just a single layer, to increase task diversity. To account for the uncertainty caused by the limited training tasks, we propose a variational MetaModulation where the modulation parameters are treated as latent variables. We also introduce learning variational feature hierarchies by the variational MetaModulation, which modulates features at all layers and can consider task uncertainty and generate more diverse tasks. The ablation studies illustrate the advantages of utilizing a learnable task modulation at different levels and demonstrate the benefit of incorporating probabilistic variants in few-task meta-learning. Our MetaModulation and its variational variants consistently outperform state-of-the-art alternatives on four few-task meta-learning benchmarks. 6 authors · May 17, 2023
- Feature diversity in self-supervised learning Many studies on scaling laws consider basic factors such as model size, model shape, dataset size, and compute power. These factors are easily tunable and represent the fundamental elements of any machine learning setup. But researchers have also employed more complex factors to estimate the test error and generalization performance with high predictability. These factors are generally specific to the domain or application. For example, feature diversity was primarily used for promoting syn-to-real transfer by Chen et al. (2021). With numerous scaling factors defined in previous works, it would be interesting to investigate how these factors may affect overall generalization performance in the context of self-supervised learning with CNN models. How do individual factors promote generalization, which includes varying depth, width, or the number of training epochs with early stopping? For example, does higher feature diversity result in higher accuracy held in complex settings other than a syn-to-real transfer? How do these factors depend on each other? We found that the last layer is the most diversified throughout the training. However, while the model's test error decreases with increasing epochs, its diversity drops. We also discovered that diversity is directly related to model width. 2 authors · Sep 2, 2022