LeoLM: German Foundation Language Model

justynasty@lemmy.kya.moe · 2 years ago

LeoLM: German Foundation Language Model

justynasty@lemmy.kya.moe · edit-2 2 years ago

TL;DR You’re right, pytorch and transformers need more memory.

I will respond to the CPU inference first, for the transformers library.

Do you use it on CPU? If yes, do you use quantization? Which one? I’d like to try myself. But I don’t wan’t to try all the possibilities to find out which one works.

In transformers, I don’t use quant. :L If you’re used to the Q_4 speed, then it will be slower than that. For a 7B it’s almost okay on CPU.

And I seriously doubt it’s getting as fast now that I’m squeezing 15GB of data through the memory bus instead of like 4GB.

Yeah, it seems like you use low quant downloads. D: It’s not for you.

But you were on the right track with that 15GB download because you downloaded the raw release. Since not the GPTQ, nor AWS what we use in transformers (for new releases). ^^

With llama.cpp or KoboldCpp in my case, it’s just git pull; make clean make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 and you go.

That’s why I prefer c/rust code, it just works. It will always be faster whatever HF will release, with or without quant.

Regarding CPU inference and quantization, they are ahead…

Right, C/Rust code is more optimized.

If you’re only using HF Transformers for one of its well supported use cases, without a complex project around it and using the same hardware as everone else does, it’s easy.

Pytorch w/o Nvidia card is less common >:D That’s how I started.

Imho most github sources release buggy code, they do not set device(‘cpu’) for cpu users. Avoiding dependency hell is a must. I prefer a commented single file, not a complex python project that spits out “bitsandbytes” errors.

They can do nice things like reuse the KV cache and avoid recalculating the whole context…

So as HF, in their cough code cough. It is likely that the same code in C is also more readable.

The reason I mentioned transformers because this line takes care of new model releases with all the bugfixes, just as ***cpp projects do.

generator= pipeline('text-generation', model='NousResearch/Nous-Capybara-7B')
generator('Here is the prompt.')

We run out of context?! Fix that. Use rotary embedding.

model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Nous-Capybara-7B",
    rope_scaling={"type": "dynamic", "factor": 2.0},
)
generator= pipeline('text-generation', model=model)
generator('Here is the prompt.')

Does it eat all your RAM? It does. It just works ™ with fineprint.

How to train with tools? Download another tool! With transformers:

model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Nous-Capybara-7B",
    use_flash_attention_2=True,
)
trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,  # context size is 512
)

trainer.train()

Does it eat all you RAM? Yup, it goes beyond 64GB.

peft_config = LoraConfig(r=16, task_type="CAUSAL_LM")
trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    peft_config=peft_config
)

Now it eats less RAM with Lora.

LeoLM: German Foundation Language Model

LeoLM: German Foundation Language Model

LeoLM/leo-mistral-hessianai-7b-chat · Hugging Face