facebook/jasco-chords-drums-melody-1B
Updated
ā¢
5
in a somewhat similar vein (not rly though), has anyone over there experimented with taking a current encoder arch (ie modernbert), ripping out the transformers, replacing them with something like mamba2/griffin temporal mixing layers, then distilling the original model onto it? seems like it could be a lot less lossy than a straight static embedding layer but still better complexity-wise than self-attention
i was trying this earlier but the first shape error in the forward pass made me give up š