Spaces:

ortal1602
/

ARvsFM

Running

App Files Files Community

ortal1602 commited on Jun 11

Commit

926529d

verified ·

1 Parent(s): d323fd3

Update index.html

Browse files

Files changed (1) hide show

index.html +13 -4

index.html CHANGED Viewed

@@ -32,25 +32,29 @@
 <!-- Hero Section -->
 <div class="container text-center">
   <img src="figures/ARvsFM.png" alt="AR vs FM" style="width: 100%; border-radius: 20px; box-shadow: 0 4px 16px rgba(0,0,0,0.2); margin-bottom: 20px;">
-  <h1>AR vs FM: A Comparative Study on Audio Modeling Paradigms</h1>
   <p>
     <a href="https://scholar.google.com/citations?user=QK3_J9IAAAAJ" target="_blank">Or Tal</a> ·
     <a href="https://scholar.google.com/citations?user=UiERcYsAAAAJ" target="_blank">Felix Kreuk</a> ·
     <a href="https://scholar.google.com/citations?user=ryMtc7sAAAAJ" target="_blank">Yossi Adi</a>
   </p>
 </div>
 <!-- Abstract Section -->
-<div class="container">
   <h2>Abstract</h2>
   <p>
-    We compare two major paradigms for text-to-music generation—Auto-Regressive (AR) and Flow-Matching (FM)—under tightly controlled settings. All models are trained from scratch with the same dataset, representations, and backbone architecture. We evaluate on fidelity, control adherence, inpainting, inference efficiency, and robustness to training scale. Our results reveal clear trade-offs: AR achieves better fidelity and control accuracy, while FM enables faster inference and smoother inpainting. This study helps guide future decisions in music generation research and development.
   </p>
 </div>
 <!-- Unified Paper Highlights Section -->
 <div class="container">
-  <h2>Paper Highlights</h2>
   <div id="highlight-box" style="padding: 20px; border: 1px solid #ccc; border-radius: 12px; background: #fefefe; box-shadow: 0 2px 8px rgba(0,0,0,0.05);">
     <div style="text-align: center;">
       <img id="highlight-image" src="" alt="Highlight figure"
@@ -113,6 +117,11 @@
   // Initialize
   showHighlight(highlightIndex);
 </script>
 <!-- Section 1 -->
 <div class="container">

 <!-- Hero Section -->
 <div class="container text-center">
   <img src="figures/ARvsFM.png" alt="AR vs FM" style="width: 100%; border-radius: 20px; box-shadow: 0 4px 16px rgba(0,0,0,0.2); margin-bottom: 20px;">
+  <h1>Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation</h1>
   <p>
     <a href="https://scholar.google.com/citations?user=QK3_J9IAAAAJ" target="_blank">Or Tal</a> ·
     <a href="https://scholar.google.com/citations?user=UiERcYsAAAAJ" target="_blank">Felix Kreuk</a> ·
     <a href="https://scholar.google.com/citations?user=ryMtc7sAAAAJ" target="_blank">Yossi Adi</a>
   </p>
+  <br>
+  <p>
+    <a href="https://arxiv.org/abs/2506.08570" target="_blank">Full Paper</a>
+  </p>
 </div>
 <!-- Abstract Section -->
+<div class="container text-center">
   <h2>Abstract</h2>
   <p>
+    Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly across many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and pinpoint which design choices most influence performance. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems Specifically, we compare the two arguably most common modeling paradigms: Auto-Regressive decoding and Conditional Flow-Matching. We conduct a controlled comparison by training all models from scratch using identical datasets, training configurations, and similar backbone architectures. Performance is evaluated across multiple axes, including generation quality, robustness to inference configurations, scalability, adherence to both textual and temporally aligned conditioning, and editing capabilities in the form of audio inpainting.  This comparative study sheds light on distinct strengths and limitations of each paradigm, providing actionable insights that can inform future architectural and training decisions in the evolving landscape of text-to-music generation.
   </p>
 </div>
 <!-- Unified Paper Highlights Section -->
 <div class="container">
+  <h2>Key Take-aways</h2>
   <div id="highlight-box" style="padding: 20px; border: 1px solid #ccc; border-radius: 12px; background: #fefefe; box-shadow: 0 2px 8px rgba(0,0,0,0.05);">
     <div style="text-align: center;">
       <img id="highlight-image" src="" alt="Highlight figure"
   // Initialize
   showHighlight(highlightIndex);
 </script>
+<div class="container">
+  <h1>Bibtex</h1>
+  <p>TODO</p>
+</div>
 <!-- Section 1 -->
 <div class="container">