Office space meme:

“If y’all could stop calling an LLM “open source” just because they published the weights… that would be great.”

  • Prunebutt@slrpnk.netOP
    link
    fedilink
    arrow-up
    10
    arrow-down
    5
    ·
    2 days ago

    You could train it yourself too.

    How, without information on the dataset and the training code?

    • Pennomi@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      3
      ·
      2 days ago

      Training code created by the community always pops up shortly after release. It has happened for every major model so far. Additionally you have never needed the original training dataset to continue training a model.

      • Prunebutt@slrpnk.netOP
        link
        fedilink
        arrow-up
        13
        arrow-down
        2
        ·
        2 days ago

        So, Ocarina of Time is considered open source now, since it’s been decompiled by the community, or what?

        Community effort and the ability to build on top of stuff doesn’t make anything open source.

        Also: initial training data is important.

    • WraithGear@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      3
      ·
      edit-2
      2 days ago

      So i am leaning as much as i can here, so bear with me. But it accepts tokenized data and structures it via a transformer as a json file or sun such. The weights are a binary file that’s separate and is used to, well, modify the tokenized data to generate outcomes. As long as you used a compatible tokenization structure, and weights structure, you could create a new training set. But that can be done with any LLM. You can’t pull the data from this just as you can’t make wheat from dissecting bread. But they provide the tools to set your own data, and the way the LLM handles that data is novel, due to being hamstrung by US sanctions. A “necessity is the mother of invention” and all that. Running comparable ai’s on inferior hardware and much smaller budget is what makes this one stand out, not the training data.