This release is trained on a curated filtered subset of most of our GPT-4 augmented data.
HF Leaderboard evals place this model as #2 for all models smaller than 30B at release time, outperforming all but one 13B model.
GGUF files:
Warning (if I’m not mistaken):
Llama.cpp hasn’t assigned high priority tag to the sliding window. Axolotl replaced Mistral’s attention block by a “simple” flash attention.
That implies, in my opinion, that the new releases do not capitalize on the speedup claimed by Mistral developers.
We can’t expect the new versions to be faster than Llama, because there is no sliding attention to speed up inference.
I LOVE orca tunes, they almost always end up feeling like smarter versions of the base, so i’m looking forward to trying this one out when the GPTQ is finished
GPTQ/AWQ links:
https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GPTQ
https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ
Does sliding attention speed up inference? I thought it was more about extending the capabilities of the context above what it was trained on. I suppose I could see it being used to drop context which would save on memory/inference, but didn’t think that was the point of it, just a happy side effect, i could be wrong though
Talk about merging changes
Ah good point, definitely looking forward to it being implemented then