HuggingFaceTB/SmolLM3-3B · Time to go Big/Sota!!!

about 14 hours ago

time for huggingface to go big and compete withother sota models , the quality and performance from the smol family have been outstanding , time to try and compete with the big models

david-thrower

about 13 hours ago

I partially agree. I just hope someone stands to pay the costs for that. I only say partially agree, because I think a novel architecture paradigm shift altogether is around the corner (e.g. not a multi-head attention transformer, something that scales in linear or substantially sub-quadratic timing, ... ), and it may be better to put the big chips on the table when we get there.

ZiggyS

about 13 hours ago

i think part of the point of being small is so those who dont have a lot of resources can still learn from them. Go big, it cuts many out.

david-thrower

about 11 hours ago

@ZiggyS I can definitely relate to resource constraints being a bottle neck.

Nonetheless, it would be great though overly-optimistic to have a model source for full scale SOTA models that truly are open source and platform independent ... I know the EU is working on a project to that effect, but it would be good to have 1 commercial / non - government affiliated source also that is fully transparent and free of international conflicts of interest.

As I mention though, it would be a big risk to develop one now, because everything is moving so fast, that whatever you do could be fundamentally obsolete in a week. Especially as it is inevitable that someone some where will publish more effective solutions (in terms of actual published models and code, not just theoretical papers) to problems like these any day now:

More robust capability for a model to continuously re-train / fine tune itself at inference time to replicate its user's behavior with granularity and learns from past user corrections and clarifications, especially if one that can update itself on a user specific basis with controlled multi-tenancy
A generative model that scales in linear complexity timing or in a substantially sub-quadratic complexity. A proof of concept in text classification that scales in linear / O(n) to O(n log(n)) timing with increased sequence length already exists.
A generative model architecture having a truly unlimited context window without performance degradation as the context grows
...