SlimOrca 7B: Trained on 1GB dataset

justynasty@lemmy.kya.moe · 2 years ago

You can download 7B, 13B Q_8 models for such gpu. 30B Q_2 models would probably run out of memory.

This shows that larger models have lower perplexity (i.e. more coherent). You can run conversational models, but not those with infinite knowledge base.

Most of the paid services that provide open-source models use 13B models (for $15 per month); you can run those for free on your card.

Someone else needs to recommend a tool to run models locally.

justynasty@lemmy.kya.moe · 2 years ago

I meant smaller models profit more from the stable perplexity in a long prompt with the recently released code changes. Because the paper(s) mention that some of these changes do not require further fine-tuning, we can use small models in a text that is longer than their context size.

justynasty@lemmy.kya.moe · edit-2 2 years ago

TL;DR You’re right, pytorch and transformers need more memory.

I will respond to the CPU inference first, for the transformers library.

Do you use it on CPU? If yes, do you use quantization? Which one? I’d like to try myself. But I don’t wan’t to try all the possibilities to find out which one works.

In transformers, I don’t use quant. :L If you’re used to the Q_4 speed, then it will be slower than that. For a 7B it’s almost okay on CPU.

And I seriously doubt it’s getting as fast now that I’m squeezing 15GB of data through the memory bus instead of like 4GB.

Yeah, it seems like you use low quant downloads. D: It’s not for you.

But you were on the right track with that 15GB download because you downloaded the raw release. Since not the GPTQ, nor AWS what we use in transformers (for new releases). ^^

With llama.cpp or KoboldCpp in my case, it’s just git pull; make clean make LLAMA_OPENBLAS=1 LLAMA_CLBLAST=1 and you go.

That’s why I prefer c/rust code, it just works. It will always be faster whatever HF will release, with or without quant.

Regarding CPU inference and quantization, they are ahead…

Right, C/Rust code is more optimized.

If you’re only using HF Transformers for one of its well supported use cases, without a complex project around it and using the same hardware as everone else does, it’s easy.

Pytorch w/o Nvidia card is less common >:D That’s how I started.

Imho most github sources release buggy code, they do not set device(‘cpu’) for cpu users. Avoiding dependency hell is a must. I prefer a commented single file, not a complex python project that spits out “bitsandbytes” errors.

They can do nice things like reuse the KV cache and avoid recalculating the whole context…

So as HF, in their cough code cough. It is likely that the same code in C is also more readable.

The reason I mentioned transformers because this line takes care of new model releases with all the bugfixes, just as ***cpp projects do.

generator= pipeline('text-generation', model='NousResearch/Nous-Capybara-7B')
generator('Here is the prompt.')

We run out of context?! Fix that. Use rotary embedding.

model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Nous-Capybara-7B",
    rope_scaling={"type": "dynamic", "factor": 2.0},
)
generator= pipeline('text-generation', model=model)
generator('Here is the prompt.')

Does it eat all your RAM? It does. It just works ™ with fineprint.

How to train with tools? Download another tool! With transformers:

model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Nous-Capybara-7B",
    use_flash_attention_2=True,
)
trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,  # context size is 512
)

trainer.train()

Does it eat all you RAM? Yup, it goes beyond 64GB.

peft_config = LoraConfig(r=16, task_type="CAUSAL_LM")
trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    dataset_text_field="text",
    peft_config=peft_config
)

Now it eats less RAM with Lora.

justynasty@lemmy.kya.moe · edit-2 2 years ago

I have less faith in llamacpp, new papers pop up every day, HF has more resource to keep up with everything.

Pytorch works well on CPU (without cuda), people would need a tutorial on how to set up venv (or smth better), use git clone, then assign the AutoModelForCausalLM to the downloaded model, then open a port on 5001. The code fits on the screen without having to scroll.

Any UI can connect to such a program. Character cards (from chub or other sources) are injected to the prompt from the frontend side. I used to download GGML files because they are lighter, and my download speed is from the past century.

justynasty@lemmy.kya.moe · 2 years ago

Quote from the newly released paper

Mistral 7B, we fine-tuned it on instruction datasets publicly available on the Hugging Face repository. No proprietary data or training tricks were utilized.

justynasty@lemmy.kya.moe · edit-2 2 years ago

SlimOrca 7B: Trained on 1GB dataset

justynasty@lemmy.kya.moe · 2 years ago

justynasty@lemmy.kya.moe · edit-2 2 years ago

deleted by creator

justynasty@lemmy.kya.moe · edit-2 2 years ago

German GGUF from other repo

I have only seen Mistral models speaking German and French. If I had some free time (I don’t have rn) I could compare it to M2M100.

justynasty@lemmy.kya.moe · edit-2 2 years ago

LLaMA2 7B PoSE YaRN 16k: LLMs via Positional Skip-wise Training

justynasty@lemmy.kya.moe · edit-2 2 years ago

This is from March 31, 2021

justynasty@lemmy.kya.moe · edit-2 2 years ago

This is the document from one of their links on the page, and the quotes are from there: Direct Preference Optimization: Your Language Model is Secretly a Reward Model.

The official doc is less boring. The doc refers to the paper again.

TRL supports the DPO Trainer for training language models from preference data, as described in the paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafailov et al., 2023.

<3

justynasty@lemmy.kya.moe · edit-2 2 years ago

Zephyr 7B: A model that people like, but it has biases too

justynasty@lemmy.kya.moe · 2 years ago

Self-attention struggles with long sequences, due to its quadratic dependency on the sequence length. One query would attend to all keys and values, leading to computational inefficiencies. Sparse attention alleviates this issue by restricting the query’s access to a subset of keys and values.

I think this reduces the training time (improve on quadratic time complexity), but the attention is still not spread out unevenly in an infinitely long text - the question that future models need to answer.

justynasty@lemmy.kya.moe · 2 years ago

This is a fix for a problem that shouldn’t have been there in the first place. One of the many architectural oversights in the Llama model (and its predecessors).

There isn’t much to see for an end-user. This change won’t mean much to those who use (koboldcpp’s) smart context, summarization or character cards.

Smaller models (7B down to 350M) can handle long conversations better, and they won’t produce garbage output without the truncation of text.

I am still waiting for the breakthrough in large models.

justynasty@lemmy.kya.moe · edit-2 2 years ago

Pay more attention: Recap of the last week

justynasty@lemmy.kya.moe · 2 years ago

LeoLM: German Foundation Language Model

justynasty@lemmy.kya.moe · 2 years ago

Speechless 7B: coding, reasoning and planning

justynasty@lemmy.kya.moe · 2 years ago

deleted by creator

justynasty@lemmy.kya.moe · edit-2 2 years ago

Shurale: an open-domain dialogue model for chit-chat conversations

justynasty@lemmy.kya.moe · edit-2 2 years ago

Not yet. Tabby has context search on its roadmap:

https://tabby.tabbyml.com/docs/roadmap

justynasty@lemmy.kya.moe · 2 years ago

deleted by creator

justynasty@lemmy.kya.moe · edit-2 2 years ago

Llama Nous-Capybara-7B, Mistral airoboros, Mistral Aria

justynasty@lemmy.kya.moe · 2 years ago

I also have a de-googled smartphone, with a firewall installed (without a jailbrake). My name doesn’t show up on Google. I use generic usernames, not unique ones. I don’t upload photographs of my relatives to the cloud, as services acquire fingerprint (hash) of their faces and extract metadata from the uploaded jpegs. …and I’m not hiding from anyone, I don’t like the unremovable (unforgettable) traces we leave here.

translates between arbitraty languages on the fly

That’s what Firefox has in its browser now. :D desktop version…

hallucinates less and gets adapters for specific tasks and multimodal capabilities

People will have less time to talk to other people because they’ll exchange pics with their favorite agent. xd

And that’s where I expect their gifts to stop. I will still have my chatbot / AI companion.

There are already services that charge for ML tasks. “You want a calendar notification from AI?” - pay more.

“You want to summarize your daily emails” - pay double, save more.

“You want to talk to your friend, who is asleep.” - talk to a virtual AI character, that looks and sounds like your friend. It even remembers your past conversations! /s

justynasty@lemmy.kya.moe · 2 years ago

I was concerned that a large dataset with low sentence similarity may take longer to train. I’m not sure if my idea that novels take less time to train than a Q&A dataset with detailed answers is true: generic roleplay vs encyclopedic knowledge.

Reading these datasets, I think these GPT3/4 conversations go into too much detail, and current (1-40B) language models cannot be trained in such detail. These conversations would be only useful for humans. But I might be wrong about training because I don’t have experience with 100B+ models, and how they scale down.

justynasty@lemmy.kya.moe · edit-2 2 years ago

OpenOrca and Dolphin seem to have the same purpose, with different flavors. There is already a Mistral fine tune for roleplay, nsfw :/ (for both Orca and Dolphin) People mix and upload releases faster than we can post news about them. ^^

It took 48 hours to train 10 epochs on 4x A100s.

Does anyone know why some releases only take 1 epoch to train and others take up to 10 epochs?

justynasty@lemmy.kya.moe · edit-2 2 years ago

He said let himself prepare first 🤣

Chat is a friendly man, age 21, who talks to the user, please roleplay.

USER: Hi there!

CHAT: Chat the friendly man Sure, I'd be happy to roleplay as Chat, the friendly man. Just give me a moment to prepare myself for our conversation. 😊

Hello! So, nice to meet you! My name is Chat, and I'm really excited to talk with you today. What's your name, and tell me a little bit about yourself?

justynasty@lemmy.kya.moe · edit-2 2 years ago

Rotary positional embeddings for NeoX, Falcon and LLama

justynasty@lemmy.kya.moe · edit-2 2 years ago

Mistral 7B uses a sliding window attention (SWA) mechanism (Child et al., Beltagy et al.), in which each layer attends to the previous 4,096 hidden states. The main improvement, and reason for which this was initially investigated, is a linear compute cost of O(sliding_window.seq_len). In practice, changes made to FlashAttention and xFormers yield a 2x speed improvement for sequence length of 16k with a window of 4k. Source: Mistral 7B news For longer prompts.

Talk about merging changes

justynasty@lemmy.kya.moe · edit-2 2 years ago

Mistral 7B OpenOrca released

justynasty@lemmy.kya.moe · 2 years ago

NexusRaven 13B released

justynasty@lemmy.kya.moe · 2 years ago

Sliding attention: Sliding window to speed-up inference

justynasty@lemmy.kya.moe · 2 years ago

Thoughts about making a multi-bot chat with fictional characters