new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Oct 29

GW-YOLO: Multi-transient segmentation in LIGO using computer vision

Time series data and their time-frequency representation from gravitational-wave interferometers present multiple opportunities for the use of artificial intelligence methods associated with signal and image processing. Closely connected with this is the real-time aspect associated with gravitational-wave interferometers and the astrophysical observations they perform; the discovery potential of these instruments can be significantly enhanced when data processing can be achieved in O(1s) timescales. In this work, we introduce a novel signal and noise identification tool based on the YOLO (You Only Look Once) object detection framework. For its application into gravitational waves, we will refer to it as GW-YOLO. This tool can provide scene identification capabilities and essential information regarding whether an observed transient is any combination of noise and signal. Additionally, it supplies detailed time-frequency coordinates of the detected objects in the form of pixel masks, an essential property that can be used to understand and characterize astrophysical sources, as well as instrumental noise. The simultaneous identification of noise and signal, combined with precise pixel-level localization, represents a significant advancement in gravitational-wave data analysis. Our approach yields a 50\% detection efficiency for binary black hole signals at a signal-to-noise ratio (SNR) of 15 when such signals overlap with transient noise artifacts. When noise artifacts overlap with binary neutron star signals, our algorithm attains 50\% detection efficiency at an SNR of 30. This presents the first quantitative assessment of the ability to detect astrophysical events overlapping with realistic, instrument noise present in gravitational-wave interferometers.

  • 3 authors
·
Aug 24

Peakbagging the K2 KEYSTONE sample with PBjam: characterising the individual mode frequencies in solar-like oscillators

The pattern of individual mode frequencies in solar-like oscillators provides valuable insight into their properties and interior structures. The identification and characterisation of these modes requires high signal-to-noise and frequency resolution. The KEYSTONE project unlocks the asteroseismic potential of the K2 mission by providing individually reduced, high-quality time series data, global asteroseismic parameters, and spectroscopic analysis for 173 solar-like oscillators. In this work, we build on the KEYSTONE project and present the first analysis of the pattern of individual modes in the oscillation spectra for the K2 KEYSTONE stars. We perform a robust identification and characterisation of the modes through peakbagging methods in the open-source analysis tool PBjam. We present over 6000 mode frequencies, widths, and heights for 168 stars in the sample, covering the HR diagram from FGK dwarfs to sub-giants and the lower red giant branch, providing a significant increase in the number of individual mode frequency detections for main sequence and sub-giant oscillators. This study also presents sample-wide trends of oscillation patterns as a function of the fundamental stellar properties, and improves the precision of the global asteroseismic parameters. These measurements are part of the legacy of the K2 mission, and can be used to perform detailed modelling to improve the precision of fundamental properties of these stars. The results of this analysis provides evidence for the validity of using PBjam to identify and characterise the modes resulting from the observations of the future PLATO mission.

  • 8 authors
·
Oct 24

When the signal is in the noise: Exploiting Diffix's Sticky Noise

Anonymized data is highly valuable to both businesses and researchers. A large body of research has however shown the strong limits of the de-identification release-and-forget model, where data is anonymized and shared. This has led to the development of privacy-preserving query-based systems. Based on the idea of "sticky noise", Diffix has been recently proposed as a novel query-based mechanism satisfying alone the EU Article~29 Working Party's definition of anonymization. According to its authors, Diffix adds less noise to answers than solutions based on differential privacy while allowing for an unlimited number of queries. This paper presents a new class of noise-exploitation attacks, exploiting the noise added by the system to infer private information about individuals in the dataset. Our first differential attack uses samples extracted from Diffix in a likelihood ratio test to discriminate between two probability distributions. We show that using this attack against a synthetic best-case dataset allows us to infer private information with 89.4% accuracy using only 5 attributes. Our second cloning attack uses dummy conditions that conditionally strongly affect the output of the query depending on the value of the private attribute. Using this attack on four real-world datasets, we show that we can infer private attributes of at least 93% of the users in the dataset with accuracy between 93.3% and 97.1%, issuing a median of 304 queries per user. We show how to optimize this attack, targeting 55.4% of the users and achieving 91.7% accuracy, using a maximum of only 32 queries per user. Our attacks demonstrate that adding data-dependent noise, as done by Diffix, is not sufficient to prevent inference of private attributes. We furthermore argue that Diffix alone fails to satisfy Art. 29 WP's definition of anonymization. [...]

  • 5 authors
·
Apr 18, 2018

An OFDM Signal Identification Method for Wireless Communications Systems

Distinction of OFDM signals from single carrier signals is highly important for adaptive receiver algorithms and signal identification applications. OFDM signals exhibit Gaussian characteristics in time domain and fourth order cumulants of Gaussian distributed signals vanish in contrary to the cumulants of other signals. Thus fourth order cumulants can be utilized for OFDM signal identification. In this paper, first, formulations of the estimates of the fourth order cumulants for OFDM signals are provided. Then it is shown these estimates are affected significantly from the wireless channel impairments, frequency offset, phase offset and sampling mismatch. To overcome these problems, a general chi-square constant false alarm rate Gaussianity test which employs estimates of cumulants and their covariances is adapted to the specific case of wireless OFDM signals. Estimation of the covariance matrix of the fourth order cumulants are greatly simplified peculiar to the OFDM signals. A measurement setup is developed to analyze the performance of the identification method and for comparison purposes. A parametric measurement analysis is provided depending on modulation order, signal to noise ratio, number of symbols, and degree of freedom of the underlying test. The proposed method outperforms statistical tests which are based on fixed thresholds or empirical values, while a priori information requirement and complexity of the proposed method are lower than the coherent identification techniques.

  • 2 authors
·
Dec 29, 2014 2

RFUAV: A Benchmark Dataset for Unmanned Aerial Vehicle Detection and Identification

In this paper, we propose RFUAV as a new benchmark dataset for radio-frequency based (RF-based) unmanned aerial vehicle (UAV) identification and address the following challenges: Firstly, many existing datasets feature a restricted variety of drone types and insufficient volumes of raw data, which fail to meet the demands of practical applications. Secondly, existing datasets often lack raw data covering a broad range of signal-to-noise ratios (SNR), or do not provide tools for transforming raw data to different SNR levels. This limitation undermines the validity of model training and evaluation. Lastly, many existing datasets do not offer open-access evaluation tools, leading to a lack of unified evaluation standards in current research within this field. RFUAV comprises approximately 1.3 TB of raw frequency data collected from 37 distinct UAVs using the Universal Software Radio Peripheral (USRP) device in real-world environments. Through in-depth analysis of the RF data in RFUAV, we define a drone feature sequence called RF drone fingerprint, which aids in distinguishing drone signals. In addition to the dataset, RFUAV provides a baseline preprocessing method and model evaluation tools. Rigorous experiments demonstrate that these preprocessing methods achieve state-of-the-art (SOTA) performance using the provided evaluation tools. The RFUAV dataset and baseline implementation are publicly available at https://github.com/kitoweeknd/RFUAV/.

  • 7 authors
·
Mar 11

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

During both pretraining and fine-tuning, Large Language Models (LLMs) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or noisy training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (CoT) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (TInt) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: static noise, a local form of noise which is applied after the CoT trace is computed, and dynamic noise, a global form of noise which propagates errors in the trace as it is computed. We then evaluate the test performance of pretrained models both prompted and fine-tuned on noised datasets with varying levels of dataset contamination and intensity. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise. In contrast, few-shot prompted models appear more sensitive to even static noise. We conclude with a discussion of how our findings impact noise filtering best-practices, in particular emphasizing the importance of removing samples containing destructive dynamic noise with global errors.

  • 2 authors
·
Feb 6, 2024

Noise in Relation Classification Dataset TACRED: Characterization and Reduction

The overarching objective of this paper is two-fold. First, to explore model-based approaches to characterize the primary cause of the noise. in the RE dataset TACRED Second, to identify the potentially noisy instances. Towards the first objective, we analyze predictions and performance of state-of-the-art (SOTA) models to identify the root cause of noise in the dataset. Our analysis of TACRED shows that the majority of the noise in the dataset originates from the instances labeled as no-relation which are negative examples. For the second objective, we explore two nearest-neighbor-based strategies to automatically identify potentially noisy examples for elimination and reannotation. Our first strategy, referred to as Intrinsic Strategy (IS), is based on the assumption that positive examples are clean. Thus, we have used false-negative predictions to identify noisy negative examples. Whereas, our second approach, referred to as Extrinsic Strategy, is based on using a clean subset of the dataset to identify potentially noisy negative examples. Finally, we retrained the SOTA models on the eliminated and reannotated dataset. Our empirical results based on two SOTA models trained on TACRED-E following the IS show an average 4% F1-score improvement, whereas reannotation (TACRED-R) does not improve the original results. However, following ES, SOTA models show the average F1-score improvement of 3.8% and 4.4% when trained on respective eliminated (TACRED-EN) and reannotated (TACRED-RN) datasets respectively. We further extended the ES for cleaning positive examples as well, which resulted in an average performance improvement of 5.8% and 5.6% for the eliminated (TACRED-ENP) and reannotated (TACRED-RNP) datasets respectively.

  • 3 authors
·
Nov 20, 2023

A Dataset of Dynamic Reverberant Sound Scenes with Directional Interferers for Sound Event Localization and Detection

This report presents the dataset and baseline of Task 3 of the DCASE2021 Challenge on Sound Event Localization and Detection (SELD). The dataset is based on emulation of real recordings of static or moving sound events under real conditions of reverberation and ambient noise, using spatial room impulse responses captured in a variety of rooms and delivered in two spatial formats. The acoustical synthesis remains the same as in the previous iteration of the challenge, however the new dataset brings more challenging conditions of polyphony and overlapping instances of the same class. The most important difference of the new dataset is the introduction of directional interferers, meaning sound events that are localized in space but do not belong to the target classes to be detected and are not annotated. Since such interfering events are expected in every real-world scenario of SELD, the new dataset aims to promote systems that deal with this condition effectively. A modified SELDnet baseline employing the recent ACCDOA representation of SELD problems accompanies the dataset and it is shown to outperform the previous one. The new dataset is shown to be significantly more challenging for both baselines according to all considered metrics. To investigate the individual and combined effects of ambient noise, interferers, and reverberation, we study the performance of the baseline on different versions of the dataset excluding or including combinations of these factors. The results indicate that by far the most detrimental effects are caused by directional interferers.

  • 6 authors
·
Jun 13, 2021

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise T affects performance as the size of the training set P and the scale of initialization alpha are varied. For gradient descent, alpha is a key parameter that controls if the network is `lazy'(alphagg1) or instead learns features (alphall1). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the (alpha,T) plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing T or decreasing alpha both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that the characteristic temperature T_c where the noise of SGD starts affecting the trained model (and eventually performance) is a power law of P. We relate this finding with the observation that key dynamical quantities, such as the total variation of weights during training, depend on both T and P as power laws. These results indicate that a key effect of SGD noise occurs late in training by affecting the stopping process whereby all data are fitted. Indeed, we argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. A stronger signal and a longer training time are also required when the size of the training set P increases. We confirm these views in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain.

  • 3 authors
·
Jan 31, 2023

Look Once to Hear: Target Speech Hearing with Noisy Examples

In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence. We provide code and data at: https://github.com/vb000/LookOnceToHear.

  • 5 authors
·
May 10, 2024

Exploring Quality and Generalizability in Parameterized Neural Audio Effects

Deep neural networks have shown promise for music audio signal processing applications, often surpassing prior approaches, particularly as end-to-end models in the waveform domain. Yet results to date have tended to be constrained by low sample rates, noise, narrow domains of signal types, and/or lack of parameterized controls (i.e. "knobs"), making their suitability for professional audio engineering workflows still lacking. This work expands on prior research published on modeling nonlinear time-dependent signal processing effects associated with music production by means of a deep neural network, one which includes the ability to emulate the parameterized settings you would see on an analog piece of equipment, with the goal of eventually producing commercially viable, high quality audio, i.e. 44.1 kHz sampling rate at 16-bit resolution. The results in this paper highlight progress in modeling these effects through architecture and optimization changes, towards increasing computational efficiency, lowering signal-to-noise ratio, and extending to a larger variety of nonlinear audio effects. Toward these ends, the strategies employed involved a three-pronged approach: model speed, model accuracy, and model generalizability. Most of the presented methods provide marginal or no increase in output accuracy over the original model, with the exception of dataset manipulation. We found that limiting the audio content of the dataset, for example using datasets of just a single instrument, provided a significant improvement in model accuracy over models trained on more general datasets.

  • 2 authors
·
Jun 9, 2020

Through the Haze: a Non-Convex Approach to Blind Gain Calibration for Linear Random Sensing Models

Computational sensing strategies often suffer from calibration errors in the physical implementation of their ideal sensing models. Such uncertainties are typically addressed by using multiple, accurately chosen training signals to recover the missing information on the sensing model, an approach that can be resource-consuming and cumbersome. Conversely, blind calibration does not employ any training signal, but corresponds to a bilinear inverse problem whose algorithmic solution is an open issue. We here address blind calibration as a non-convex problem for linear random sensing models, in which we aim to recover an unknown signal from its projections on sub-Gaussian random vectors, each subject to an unknown positive multiplicative factor (or gain). To solve this optimisation problem we resort to projected gradient descent starting from a suitable, carefully chosen initialisation point. An analysis of this algorithm allows us to show that it converges to the exact solution provided a sample complexity requirement is met, i.e., relating convergence to the amount of information collected during the sensing process. Interestingly, we show that this requirement grows linearly (up to log factors) in the number of unknowns of the problem. This sample complexity is found both in absence of prior information, as well as when subspace priors are available for both the signal and gains, allowing a further reduction of the number of observations required for our recovery guarantees to hold. Moreover, in the presence of noise we show how our descent algorithm yields a solution whose accuracy degrades gracefully with the amount of noise affecting the measurements. Finally, we present some numerical experiments in an imaging context, where our algorithm allows for a simple solution to blind calibration of the gains in a sensor array.

  • 2 authors
·
Oct 27, 2016

Golden Noise for Diffusion Models: A Learning Framework

Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are ``golden noises'' that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the noise prompt, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the noise prompt learning framework that systematically learns ``prompted'' golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale noise prompt dataset~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small noise prompt network~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.

  • 6 authors
·
Nov 14, 2024

STARSS22: A dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events

This report presents the Sony-TAu Realistic Spatial Soundscapes 2022 (STARS22) dataset for sound event localization and detection, comprised of spatial recordings of real scenes collected in various interiors of two different sites. The dataset is captured with a high resolution spherical microphone array and delivered in two 4-channel formats, first-order Ambisonics and tetrahedral microphone array. Sound events in the dataset belonging to 13 target sound classes are annotated both temporally and spatially through a combination of human annotation and optical tracking. The dataset serves as the development and evaluation dataset for the Task 3 of the DCASE2022 Challenge on Sound Event Localization and Detection and introduces significant new challenges for the task compared to the previous iterations, which were based on synthetic spatialized sound scene recordings. Dataset specifications are detailed including recording and annotation process, target classes and their presence, and details on the development and evaluation splits. Additionally, the report presents the baseline system that accompanies the dataset in the challenge with emphasis on the differences with the baseline of the previous iterations; namely, introduction of the multi-ACCDOA representation to handle multiple simultaneous occurences of events of the same class, and support for additional improved input features for the microphone array format. Results of the baseline indicate that with a suitable training strategy a reasonable detection and localization performance can be achieved on real sound scene recordings. The dataset is available in https://zenodo.org/record/6387880.

  • 10 authors
·
Jun 4, 2022

Adversarial Approximate Inference for Speech to Electroglottograph Conversion

Speech produced by human vocal apparatus conveys substantial non-semantic information including the gender of the speaker, voice quality, affective state, abnormalities in the vocal apparatus etc. Such information is attributed to the properties of the voice source signal, which is usually estimated from the speech signal. However, most of the source estimation techniques depend heavily on the goodness of the model assumptions and are prone to noise. A popular alternative is to indirectly obtain the source information through the Electroglottographic (EGG) signal that measures the electrical admittance around the vocal folds using dedicated hardware. In this paper, we address the problem of estimating the EGG signal directly from the speech signal, devoid of any hardware. Sampling from the intractable conditional distribution of the EGG signal given the speech signal is accomplished through optimization of an evidence lower bound. This is constructed via minimization of the KL-divergence between the true and the approximated posteriors of a latent variable learned using a deep neural auto-encoder that serves an informative prior. We demonstrate the efficacy of the method at generating the EGG signal by conducting several experiments on datasets comprising multiple speakers, voice qualities, noise settings and speech pathologies. The proposed method is evaluated on many benchmark metrics and is found to agree with the gold standard while proving better than the state-of-the-art algorithms on a few tasks such as epoch extraction.

  • 3 authors
·
Mar 28, 2019 2

Noise2Score: Tweedie's Approach to Self-Supervised Image Denoising without Clean Images

Recently, there has been extensive research interest in training deep networks to denoise images without clean reference. However, the representative approaches such as Noise2Noise, Noise2Void, Stein's unbiased risk estimator (SURE), etc. seem to differ from one another and it is difficult to find the coherent mathematical structure. To address this, here we present a novel approach, called Noise2Score, which reveals a missing link in order to unite these seemingly different approaches. Specifically, we show that image denoising problems without clean images can be addressed by finding the mode of the posterior distribution and that the Tweedie's formula offers an explicit solution through the score function (i.e. the gradient of log likelihood). Our method then uses the recent finding that the score function can be stably estimated from the noisy images using the amortized residual denoising autoencoder, the method of which is closely related to Noise2Noise or Nose2Void. Our Noise2Score approach is so universal that the same network training can be used to remove noises from images that are corrupted by any exponential family distributions and noise parameters. Using extensive experiments with Gaussian, Poisson, and Gamma noises, we show that Noise2Score significantly outperforms the state-of-the-art self-supervised denoising methods in the benchmark data set such as (C)BSD68, Set12, and Kodak, etc.

  • 2 authors
·
Jun 13, 2021

Validation of artificial neural networks to model the acoustic behaviour of induction motors

In the last decade, the sound quality of electric induction motors is a hot topic in the research field. Specially, due to its high number of applications, the population is exposed to physical and psychological discomfort caused by the noise emission. Therefore, it is necessary to minimise its psychological impact on the population. In this way, the main goal of this work is to evaluate the use of multitask artificial neural networks as a modelling technique for simultaneously predicting psychoacoustic parameters of induction motors. Several inputs are used, such as, the electrical magnitudes of the motor power signal and the number of poles, instead of separating the noise of the electric motor from the environmental noise. Two different kind of artificial neural networks are proposed to evaluate the acoustic quality of induction motors, by using the equivalent sound pressure, the loudness, the roughness and the sharpness as outputs. Concretely, two different topologies have been considered: simple models and more complex models. The former are more interpretable, while the later lead to higher accuracy at the cost of hiding the cause-effect relationship. Focusing on the simple interpretable models, product unit neural networks achieved the best results: for MSE and for SEP. The main benefit of this product unit model is its simplicity, since only 10 inputs variables are used, outlining the effective transfer mechanism of multitask artificial neural networks to extract common features of multiple tasks. Finally, a deep analysis of the acoustic quality of induction motors in done using the best product unit neural networks.

  • 5 authors
·
Jan 27, 2024

FSD50K: An Open Dataset of Human-Labeled Sound Events

Most existing datasets for sound event recognition (SER) are relatively small and/or domain-specific, with the exception of AudioSet, based on over 2M tracks from YouTube videos and encompassing over 500 sound classes. However, AudioSet is not an open dataset as its official release consists of pre-computed audio features. Downloading the original audio tracks can be problematic due to YouTube videos gradually disappearing and usage rights issues. To provide an alternative benchmark dataset and thus foster SER research, we introduce FSD50K, an open dataset containing over 51k audio clips totalling over 100h of audio manually labeled using 200 classes drawn from the AudioSet Ontology. The audio clips are licensed under Creative Commons licenses, making the dataset freely distributable (including waveforms). We provide a detailed description of the FSD50K creation process, tailored to the particularities of Freesound data, including challenges encountered and solutions adopted. We include a comprehensive dataset characterization along with discussion of limitations and key factors to allow its audio-informed usage. Finally, we conduct sound event classification experiments to provide baseline systems as well as insight on the main factors to consider when splitting Freesound audio data for SER. Our goal is to develop a dataset to be widely adopted by the community as a new open benchmark for SER research.

  • 5 authors
·
Oct 1, 2020

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse

  • 5 authors
·
Aug 11, 2022

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

  • 5 authors
·
Sep 6, 2020

Deciphering GunType Hierarchy through Acoustic Analysis of Gunshot Recordings

The escalating rates of gun-related violence and mass shootings represent a significant threat to public safety. Timely and accurate information for law enforcement agencies is crucial in mitigating these incidents. Current commercial gunshot detection systems, while effective, often come with prohibitive costs. This research explores a cost-effective alternative by leveraging acoustic analysis of gunshot recordings, potentially obtainable from ubiquitous devices like cell phones, to not only detect gunshots but also classify the type of firearm used. This paper details a study on deciphering gun type hierarchies using a curated dataset of 3459 recordings. We investigate the fundamental acoustic characteristics of gunshots, including muzzle blasts and shockwaves, which vary based on firearm type, ammunition, and shooting direction. We propose and evaluate machine learning frameworks, including Support Vector Machines (SVMs) as a baseline and a more advanced Convolutional Neural Network (CNN) architecture for joint gunshot detection and gun type classification. Results indicate that our deep learning approach achieves a mean average precision (mAP) of 0.58 on clean labeled data, outperforming the SVM baseline (mAP 0.39). Challenges related to data quality, environmental noise, and the generalization capabilities when using noisy web-sourced data (mAP 0.35) are also discussed. The long-term vision is to develop a highly accurate, real-time system deployable on common recording devices, significantly reducing detection costs and providing critical intelligence to first responders.

  • 4 authors
·
Jun 25

Are we certain it's anomalous?

The progress in modelling time series and, more generally, sequences of structured data has recently revamped research in anomaly detection. The task stands for identifying abnormal behaviors in financial series, IT systems, aerospace measurements, and the medical domain, where anomaly detection may aid in isolating cases of depression and attend the elderly. Anomaly detection in time series is a complex task since anomalies are rare due to highly non-linear temporal correlations and since the definition of anomalous is sometimes subjective. Here we propose the novel use of Hyperbolic uncertainty for Anomaly Detection (HypAD). HypAD learns self-supervisedly to reconstruct the input signal. We adopt best practices from the state-of-the-art to encode the sequence by an LSTM, jointly learned with a decoder to reconstruct the signal, with the aid of GAN critics. Uncertainty is estimated end-to-end by means of a hyperbolic neural network. By using uncertainty, HypAD may assess whether it is certain about the input signal but it fails to reconstruct it because this is anomalous; or whether the reconstruction error does not necessarily imply anomaly, as the model is uncertain, e.g. a complex but regular input signal. The novel key idea is that a detectable anomaly is one where the model is certain but it predicts wrongly. HypAD outperforms the current state-of-the-art for univariate anomaly detection on established benchmarks based on data from NASA, Yahoo, Numenta, Amazon, and Twitter. It also yields state-of-the-art performance on a multivariate dataset of anomaly activities in elderly home residences, and it outperforms the baseline on SWaT. Overall, HypAD yields the lowest false alarms at the best performance rate, thanks to successfully identifying detectable anomalies.

  • 7 authors
·
Nov 16, 2022

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.

  • 5 authors
·
Dec 14, 2022

Dehazing Ultrasound using Diffusion Models

Echocardiography has been a prominent tool for the diagnosis of cardiac disease. However, these diagnoses can be heavily impeded by poor image quality. Acoustic clutter emerges due to multipath reflections imposed by layers of skin, subcutaneous fat, and intercostal muscle between the transducer and heart. As a result, haze and other noise artifacts pose a real challenge to cardiac ultrasound imaging. In many cases, especially with difficult-to-image patients such as patients with obesity, a diagnosis from B-Mode ultrasound imaging is effectively rendered unusable, forcing sonographers to resort to contrast-enhanced ultrasound examinations or refer patients to other imaging modalities. Tissue harmonic imaging has been a popular approach to combat haze, but in severe cases is still heavily impacted by haze. Alternatively, denoising algorithms are typically unable to remove highly structured and correlated noise, such as haze. It remains a challenge to accurately describe the statistical properties of structured haze, and develop an inference method to subsequently remove it. Diffusion models have emerged as powerful generative models and have shown their effectiveness in a variety of inverse problems. In this work, we present a joint posterior sampling framework that combines two separate diffusion models to model the distribution of both clean ultrasound and haze in an unsupervised manner. Furthermore, we demonstrate techniques for effectively training diffusion models on radio-frequency ultrasound data and highlight the advantages over image data. Experiments on both in-vitro and in-vivo cardiac datasets show that the proposed dehazing method effectively removes haze while preserving signals from weakly reflected tissue.

  • 6 authors
·
Jul 20, 2023

Differentially Private Sequential Learning

In a differentially private sequential learning setting, agents introduce endogenous noise into their actions to maintain privacy. Applying this to a standard sequential learning model leads to different outcomes for continuous vs. binary signals. For continuous signals with a nonzero privacy budget, we introduce a novel smoothed randomized response mechanism that adapts noise based on distance to a threshold, unlike traditional randomized response, which applies uniform noise. This enables agents' actions to better reflect both private signals and observed history, accelerating asymptotic learning speed to Theta_{epsilon}(log(n)), compared to Theta(log(n)) in the non-private regime where privacy budget is infinite. Moreover, in the non-private setting, the expected stopping time for the first correct decision and the number of incorrect actions diverge, meaning early agents may make mistakes for an unreasonably long period. In contrast, under a finite privacy budget epsilon in (0,1), both remain finite, highlighting a stark contrast between private and non-private learning. Learning with continuous signals in the private regime is more efficient, as smooth randomized response enhances the log-likelihood ratio over time, improving information aggregation. Conversely, for binary signals, differential privacy noise hinders learning, as agents tend to use a constant randomized response strategy before an information cascade forms, reducing action informativeness and hampering the overall process.

  • 2 authors
·
Feb 26

Prediction of speech intelligibility with DNN-based performance measures

This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. This model does not require the clean speech reference nor the word labels during testing as the ASR decoding step, which finds the most likely sequence of words given phoneme posterior probabilities, is omitted. The model is evaluated via the root-mean-squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with eight noise maskers covering different modulation types, from speech-shaped stationary noise to a single-talker masker. The prediction performance is compared to five established models and an ASR-model using word labels. Two combinations of features and networks were tested. Both include temporal information either at the feature level (amplitude modulation filterbanks and a feed-forward network) or captured by the architecture (mel-spectrograms and a time-delay deep neural network, TDNN). The TDNN model is on par with the DNN while reducing the number of parameters by a factor of 37; this optimization allows parallel streams on dedicated hearing aid hardware as a forward-pass can be computed within the 10ms of each frame. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.

  • 5 authors
·
Mar 17, 2022

Physics-guided Noise Neural Proxy for Practical Low-light Raw Image Denoising

Recently, the mainstream practice for training low-light raw image denoising methods has shifted towards employing synthetic data. Noise modeling, which focuses on characterizing the noise distribution of real-world sensors, profoundly influences the effectiveness and practicality of synthetic data. Currently, physics-based noise modeling struggles to characterize the entire real noise distribution, while learning-based noise modeling impractically depends on paired real data. In this paper, we propose a novel strategy: learning the noise model from dark frames instead of paired real data, to break down the data dependency. Based on this strategy, we introduce an efficient physics-guided noise neural proxy (PNNP) to approximate the real-world sensor noise model. Specifically, we integrate physical priors into neural proxies and introduce three efficient techniques: physics-guided noise decoupling (PND), physics-guided proxy model (PPM), and differentiable distribution loss (DDL). PND decouples the dark frame into different components and handles different levels of noise flexibly, which reduces the complexity of noise modeling. PPM incorporates physical priors to constrain the generated noise, which promotes the accuracy of noise modeling. DDL provides explicit and reliable supervision for noise distribution, which promotes the precision of noise modeling. PNNP exhibits powerful potential in characterizing the real noise distribution. Extensive experiments on public datasets demonstrate superior performance in practical low-light raw image denoising. The code will be available at https://github.com/fenghansen/PNNP.

  • 6 authors
·
Oct 13, 2023

Unsupervised Real-World Denoising: Sparsity is All You Need

Supervised training for real-world denoising presents challenges due to the difficulty of collecting large datasets of paired noisy and clean images. Recent methods have attempted to address this by utilizing unpaired datasets of clean and noisy images. Some approaches leverage such unpaired data to train denoisers in a supervised manner by generating synthetic clean-noisy pairs. However, these methods often fall short due to the distribution gap between synthetic and real noisy images. To mitigate this issue, we propose a solution based on input sparsification, specifically using random input masking. Our method, which we refer to as Mask, Inpaint and Denoise (MID), trains a denoiser to simultaneously denoise and inpaint synthetic clean-noisy pairs. On one hand, input sparsification reduces the gap between synthetic and real noisy images. On the other hand, an inpainter trained in a supervised manner can still accurately reconstruct sparse inputs by predicting missing clean pixels using the remaining unmasked pixels. Our approach begins with a synthetic Gaussian noise sampler and iteratively refines it using a noise dataset derived from the denoiser's predictions. The noise dataset is created by subtracting predicted pseudo-clean images from real noisy images at each iteration. The core intuition is that improving the denoiser results in a more accurate noise dataset and, consequently, a better noise sampler. We validate our method through extensive experiments on real-world noisy image datasets, demonstrating competitive performance compared to existing unsupervised denoising methods.

  • 2 authors
·
Mar 27

Spectral Bottleneck in Deep Neural Networks: Noise is All You Need

Deep neural networks are known to exhibit a spectral learning bias, wherein low-frequency components are learned early in training, while high-frequency modes emerge more gradually in later epochs. However, when the target signal lacks low-frequency components and is dominated by broadband high frequencies, training suffers from a 'spectral bottleneck', and the model fails to reconstruct the entire signal, including the frequency components that lie within the network's representational capacity. We examine such a scenario in the context of implicit neural representations (INRs) with sinusoidal representation networks (SIRENs), focusing on the challenge of fitting high-frequency-dominant signals that are susceptible to spectral bottleneck. To effectively fit any target signal irrespective of it's frequency content, we propose a generalized target-aware 'weight perturbation scheme' (WINNER - weight initialization with noise for neural representations) for network initialization. The scheme perturbs uniformly initialized weights with Gaussian noise, where the noise scales are adaptively determined by the spectral centroid of the target signal. We show that the noise scales can provide control over the spectra of network activations and the eigenbasis of the empirical neural tangent kernel. This method not only addresses the spectral bottleneck but also yields faster convergence and with improved representation accuracy, outperforming state-of-the-art approaches in audio fitting and achieving notable gains in image fitting and denoising tasks. Beyond signal reconstruction, our approach opens new directions for adaptive weight initialization strategies in computer vision and scientific machine learning.

  • 5 authors
·
Sep 9

Early warning signals: The charted and uncharted territories

The realization that complex systems such as ecological communities can collapse or shift regimes suddenly and without rapid external forcing poses a serious challenge to our understanding and management of the natural world. The potential to identify early warning signals that would allow researchers and managers to predict such events before they happen has therefore been an invaluable discovery that offers a way forward in spite of such seemingly unpredictable behavior. Research into early warning signals has demonstrated that it is possible to define and detect such early warning signals in advance of a transition in certain contexts. Here we describe the pattern emerging as research continues to explore just how far we can generalize these results. A core of examples emerges that shares three properties: the phenomenon of rapid regime shifts, a pattern of 'critical slowing down' that can be used to detect the approaching shift, and a mechanism of bifurcation driving the sudden change. As research has expanded beyond these core examples, it is becoming clear that not all systems that show regime shifts exhibit critical slowing down, or vice versa. Even when systems exhibit critical slowing down, statistical detection is a challenge. We review the literature that explores these edge cases and highlight the need for (a) new early warning behaviors that can be used in cases where rapid shifts do not exhibit critical slowing down, (b) the development of methods to identify which behavior might be an appropriate signal when encountering a novel system; bearing in mind that a positive indication for some systems is a negative indication in others, and (c) statistical methods that can distinguish between signatures of early warning behaviors and noise.

  • 3 authors
·
May 29, 2013

Taming Visually Guided Sound Generation

Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds. Moreover, sampling 1 second of audio from the state-of-the-art model takes minutes on a high-end GPU. In this work, we propose a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU. We train a transformer to sample a new spectrogram from the pre-trained spectrogram codebook given the set of video features. The codebook is obtained using a variant of VQGAN trained to produce a compact sampling space with a novel spectrogram-based perceptual loss. The generated spectrogram is transformed into a waveform using a window-based GAN that significantly speeds up generation. Considering the lack of metrics for automatic evaluation of generated spectrograms, we also build a family of metrics called FID and MKL. These metrics are based on a novel sound classifier, called Melception, and designed to evaluate the fidelity and relevance of open-domain samples. Both qualitative and quantitative studies are conducted on small- and large-scale datasets to evaluate the fidelity and relevance of generated samples. We also compare our model to the state-of-the-art and observe a substantial improvement in quality, size, and computation time. Code, demo, and samples: v-iashin.github.io/SpecVQGAN

  • 2 authors
·
Oct 17, 2021

BearLLM: A Prior Knowledge-Enhanced Bearing Health Management Framework with Unified Vibration Signal Representation

We propose a bearing health management framework leveraging large language models (BearLLM), a novel multimodal model that unifies multiple bearing-related tasks by processing user prompts and vibration signals. Specifically, we introduce a prior knowledge-enhanced unified vibration signal representation to handle various working conditions across multiple datasets. This involves adaptively sampling the vibration signals based on the sampling rate of the sensor, incorporating the frequency domain to unify input dimensions, and using a fault-free reference signal as an auxiliary input. To extract features from vibration signals, we first train a fault classification network, then convert and align the extracted features into word embedding, and finally concatenate these with text embedding as input to an LLM. To evaluate the performance of the proposed method, we constructed the first large-scale multimodal bearing health management (MBHM) dataset, including paired vibration signals and textual descriptions. With our unified vibration signal representation, BearLLM using one set of pre-trained weights achieves state-of-the-art performance on nine publicly available fault diagnosis benchmarks, outperforming specific methods designed for individual datasets. We provide a dataset, our model, and code to inspire future research on building more capable industrial multimodal models (https://github.com/hatton613/BearLLM).

  • 5 authors
·
Aug 20, 2024

Self-Calibration and Bilinear Inverse Problems via Linear Least Squares

Whenever we use devices to take measurements, calibration is indispensable. While the purpose of calibration is to reduce bias and uncertainty in the measurements, it can be quite difficult, expensive, and sometimes even impossible to implement. We study a challenging problem called self-calibration, i.e., the task of designing an algorithm for devices so that the algorithm is able to perform calibration automatically. More precisely, we consider the setup y = A(d) x + epsilon where only partial information about the sensing matrix A(d) is known and where A(d) linearly depends on d. The goal is to estimate the calibration parameter d (resolve the uncertainty in the sensing process) and the signal/object of interests x simultaneously. For three different models of practical relevance, we show how such a bilinear inverse problem, including blind deconvolution as an important example, can be solved via a simple linear least squares approach. As a consequence, the proposed algorithms are numerically extremely efficient, thus potentially allowing for real-time deployment. We also present a variation of the least squares approach, which leads to a~spectral method, where the solution to the bilinear inverse problem can be found by computing the singular vector associated with the smallest singular value of a certain matrix derived from the bilinear system. Explicit theoretical guarantees and stability theory are derived for both techniques; and the number of sampling complexity is nearly optimal (up to a poly-log factor). Applications in imaging sciences and signal processing are discussed and numerical simulations are presented to demonstrate the effectiveness and efficiency of our approach.

  • 2 authors
·
Nov 13, 2016

Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling

Reasoning from sequences of raw sensory data is a ubiquitous problem across fields ranging from medical devices to robotics. These problems often involve using long sequences of raw sensor data (e.g. magnetometers, piezoresistors) to predict sequences of desirable physical quantities (e.g. force, inertial measurements). While classical approaches are powerful for locally-linear prediction problems, they often fall short when using real-world sensors. These sensors are typically non-linear, are affected by extraneous variables (e.g. vibration), and exhibit data-dependent drift. For many problems, the prediction task is exacerbated by small labeled datasets since obtaining ground-truth labels requires expensive equipment. In this work, we present Hierarchical State-Space Models (HiSS), a conceptually simple, new technique for continuous sequential prediction. HiSS stacks structured state-space models on top of each other to create a temporal hierarchy. Across six real-world sensor datasets, from tactile-based state prediction to accelerometer-based inertial measurement, HiSS outperforms state-of-the-art sequence models such as causal Transformers, LSTMs, S4, and Mamba by at least 23% on MSE. Our experiments further indicate that HiSS demonstrates efficient scaling to smaller datasets and is compatible with existing data-filtering techniques. Code, datasets and videos can be found on https://hiss-csp.github.io.

  • 7 authors
·
Feb 15, 2024 1

ClearBuds: Wireless Binaural Earbuds for Learning-Based Speech Enhancement

We present ClearBuds, the first hardware and software system that utilizes a neural network to enhance speech streamed from two wireless earbuds. Real-time speech enhancement for wireless earbuds requires high-quality sound separation and background cancellation, operating in real-time and on a mobile phone. Clear-Buds bridges state-of-the-art deep learning for blind audio source separation and in-ear mobile systems by making two key technical contributions: 1) a new wireless earbud design capable of operating as a synchronized, binaural microphone array, and 2) a lightweight dual-channel speech enhancement neural network that runs on a mobile device. Our neural network has a novel cascaded architecture that combines a time-domain conventional neural network with a spectrogram-based frequency masking neural network to reduce the artifacts in the audio output. Results show that our wireless earbuds achieve a synchronization error less than 64 microseconds and our network has a runtime of 21.4 milliseconds on an accompanying mobile phone. In-the-wild evaluation with eight users in previously unseen indoor and outdoor multipath scenarios demonstrates that our neural network generalizes to learn both spatial and acoustic cues to perform noise suppression and background speech removal. In a user-study with 37 participants who spent over 15.4 hours rating 1041 audio samples collected in-the-wild, our system achieves improved mean opinion score and background noise suppression. Project page with demos: https://clearbuds.cs.washington.edu

  • 7 authors
·
Jun 27, 2022

Spatial-frequency channels, shape bias, and adversarial robustness

What spatial frequency information do humans and neural networks use to recognize objects? In neuroscience, critical band masking is an established tool that can reveal the frequency-selective filters used for object recognition. Critical band masking measures the sensitivity of recognition performance to noise added at each spatial frequency. Existing critical band masking studies show that humans recognize periodic patterns (gratings) and letters by means of a spatial-frequency filter (or "channel'') that has a frequency bandwidth of one octave (doubling of frequency). Here, we introduce critical band masking as a task for network-human comparison and test 14 humans and 76 neural networks on 16-way ImageNet categorization in the presence of narrowband noise. We find that humans recognize objects in natural images using the same one-octave-wide channel that they use for letters and gratings, making it a canonical feature of human object recognition. On the other hand, the neural network channel, across various architectures and training strategies, is 2-4 times as wide as the human channel. In other words, networks are vulnerable to high and low frequency noise that does not affect human performance. Adversarial and augmented-image training are commonly used to increase network robustness and shape bias. Does this training align network and human object recognition channels? Three network channel properties (bandwidth, center frequency, peak noise sensitivity) correlate strongly with shape bias (53% variance explained) and with robustness of adversarially-trained networks (74% variance explained). Adversarial training increases robustness but expands the channel bandwidth even further away from the human bandwidth. Thus, critical band masking reveals that the network channel is more than twice as wide as the human channel, and that adversarial training only increases this difference.

  • 4 authors
·
Sep 22, 2023

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

  • 4 authors
·
Oct 2

Removing Neural Signal Artifacts with Autoencoder-Targeted Adversarial Transformers (AT-AT)

Electromyogenic (EMG) noise is a major contamination source in EEG data that can impede accurate analysis of brain-specific neural activity. Recent literature on EMG artifact removal has moved beyond traditional linear algorithms in favor of machine learning-based systems. However, existing deep learning-based filtration methods often have large compute footprints and prohibitively long training times. In this study, we present a new machine learning-based system for filtering EMG interference from EEG data using an autoencoder-targeted adversarial transformer (AT-AT). By leveraging the lightweight expressivity of an autoencoder to determine optimal time-series transformer application sites, our AT-AT architecture achieves a >90% model size reduction compared to published artifact removal models. The addition of adversarial training ensures that filtered signals adhere to the fundamental characteristics of EEG data. We trained AT-AT using published neural data from 67 subjects and found that the system was able to achieve comparable test performance to larger models; AT-AT posted a mean reconstructive correlation coefficient above 0.95 at an initial signal-to-noise ratio (SNR) of 2 dB and 0.70 at -7 dB SNR. Further research generalizing these results to broader sample sizes beyond these isolated test cases will be crucial; while outside the scope of this study, we also include results from a real-world deployment of AT-AT in the Appendix.

  • 1 authors
·
Feb 7

Forecasting Thermoacoustic Instabilities in Liquid Propellant Rocket Engines Using Multimodal Bayesian Deep Learning

The 100 MW cryogenic liquid oxygen/hydrogen multi-injector combustor BKD operated by the DLR Institute of Space Propulsion is a research platform that allows the study of thermoacoustic instabilities under realistic conditions, representative of small upper stage rocket engines. We use data from BKD experimental campaigns in which the static chamber pressure and fuel-oxidizer ratio are varied such that the first tangential mode of the combustor is excited under some conditions. We train an autoregressive Bayesian neural network model to forecast the amplitude of the dynamic pressure time series, inputting multiple sensor measurements (injector pressure/ temperature measurements, static chamber pressure, high-frequency dynamic pressure measurements, high-frequency OH* chemiluminescence measurements) and future flow rate control signals. The Bayesian nature of our algorithms allows us to work with a dataset whose size is restricted by the expense of each experimental run, without making overconfident extrapolations. We find that the networks are able to accurately forecast the evolution of the pressure amplitude and anticipate instability events on unseen experimental runs 500 milliseconds in advance. We compare the predictive accuracy of multiple models using different combinations of sensor inputs. We find that the high-frequency dynamic pressure signal is particularly informative. We also use the technique of integrated gradients to interpret the influence of different sensor inputs on the model prediction. The negative log-likelihood of data points in the test dataset indicates that predictive uncertainties are well-characterized by our Bayesian model and simulating a sensor failure event results as expected in a dramatic increase in the epistemic component of the uncertainty.

  • 5 authors
·
Jul 1, 2021

Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain

The selection of maskers and playback gain levels in a soundscape augmentation system is crucial to its effectiveness in improving the overall acoustic comfort of a given environment. Traditionally, the selection of appropriate maskers and gain levels has been informed by expert opinion, which may not representative of the target population, or by listening tests, which can be time-consuming and labour-intensive. Furthermore, the resulting static choices of masker and gain are often inflexible to the dynamic nature of real-world soundscapes. In this work, we utilized a deep learning model to perform joint selection of the optimal masker and its gain level for a given soundscape. The proposed model was designed with highly modular building blocks, allowing for an optimized inference process that can quickly search through a large number of masker and gain combinations. In addition, we introduced the use of feature-domain soundscape augmentation conditioned on the digital gain level, eliminating the computationally expensive waveform-domain mixing process during inference time, as well as the tedious pre-calibration process required for new maskers. The proposed system was validated on a large-scale dataset of subjective responses to augmented soundscapes with more than 440 participants, ensuring the ability of the model to predict combined effect of the masker and its gain level on the perceptual pleasantness level.

  • 6 authors
·
Apr 29, 2022

ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North, Central and temperate Western Europe

Currently available tools for the automated acoustic recognition of European insects in natural soundscapes are limited in scope. Large and ecologically heterogeneous acoustic datasets are currently needed for these algorithms to cross-contextually recognize the subtle and complex acoustic signatures produced by each species, thus making the availability of such datasets a key requisite for their development. Here we present ECOSoundSet (European Cicadidae and Orthoptera Sound dataSet), a dataset containing 10,653 recordings of 200 orthopteran and 24 cicada species (217 and 26 respective taxa when including subspecies) present in North, Central, and temperate Western Europe (Andorra, Belgium, Denmark, mainland France and Corsica, Germany, Ireland, Luxembourg, Monaco, Netherlands, United Kingdom, Switzerland), collected partly through targeted fieldwork in South France and Catalonia and partly through contributions from various European entomologists. The dataset is composed of a combination of coarsely labeled recordings, for which we can only infer the presence, at some point, of their target species (weak labeling), and finely annotated recordings, for which we know the specific time and frequency range of each insect sound present in the recording (strong labeling). We also provide a train/validation/test split of the strongly labeled recordings, with respective approximate proportions of 0.8, 0.1 and 0.1, in order to facilitate their incorporation in the training and evaluation of deep learning algorithms. This dataset could serve as a meaningful complement to recordings already available online for the training of deep learning algorithms for the acoustic classification of orthopterans and cicadas in North, Central, and temperate Western Europe.

  • 26 authors
·
Apr 29