Office space meme:
“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”
Office space meme:
“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”
They could disclose how they sourced the training data, what the training data is and how you could source it. Also, did they publish their hyperparameters?
They could jpst not call it Open Source, if you can’t open source it.
For neural nets the method matters more. Data would be useful, but at the amount these things get trained on the specific data matters little.
They can be trained on anything, and a diverse enough data set would end up making it function more or less the same as a different but equally diverse set. Assuming publicly available data is in the set, there would also be overlap.
The training data is also by necessity going to be orders of magnitude larger than the model itself. Sharing becomes impractical at a certain point before you even factor in other issues.
That… Doesn’t align with years of research. Data is king. As someone who specifically studies long tail distributions and few-shot learning (before succumbing to long COVID, sorry if my response is a bit scattered), throwing more data at a problem always improves it more than the method. And the method can be simplified only with more data. Outside of some neat tricks that modern deep learning has decided is hogwash and “classical” at least, but most of those don’t scale enough for what is being looked at.
Also, datasets inherently impose bias upon networks, and it’s easier to create adversarial examples that fool two networks trained on the same data than the same network twice freshly trained on different data.
Sharing metadata and acquisition methods is important and should be the gold standard. Sharing network methods is also important, but that’s kind of the silver standard just because most modern state of the art models differ so minutely from each other in performance nowadays.
Open source as a term should require both. This was the standard in the academic community before tech bros started running their mouths, and should be the standard once they leave us alone.