I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?
I assume they all crib from the same training sets, but surely one of the billion dollar companies behind them can make their own?
This is due to the training sets, one of them being CommonCrawl, which is disgusting. The Chinese LLMs like DeepSeek R1 and Qwen 3 use a different set of training materials that was actually good, despite it being censored too.
wdym ‘disgusting’? isn’t common crawl just popular websites (alexa ranking? idk) crawled and provided raw?
What’s common crawl?