Thursday, January 18, 2024

Getting Mixtral to run on your local machine.

 This post is about getting a mixtral 8x7b model running on your local machine. There are probably a lot of different ways of doing this, but this is what worked for me.   This is the most state of the art, smartest, and fastest open source model that you can run locally right now.

This is the github to get to run the mixtral model with llama.cpp

https://github.com/ggerganov/llama.cpp/tree/mixtral

Follow the directions at the bottom of the page to install.  There are ways to compile it to take advantage of different gpus and cups  but the OFAST option is almost a requirement. Read the docs to figure it all out.  Right now there are just too many options for me to tell you how to do it except for the most standard cpu only with no speed ups way. 

Then goto hugging face and find a mixtral file and install it into the llama.cpp/models folder.

https://huggingface.co/models?search=mixtral%20gguf

I have been using this:

TheBloke/dolphin-2.7-mixtral-8x7b-GGUF

It is uncensored so it doesn't complain if you want to write a PG-13 story or want to have a discussion with a grown up.  This seems to make the model more human seeming and natural to talk to. 

The gguf format is a way to store the model weights and biases in less bits and it requires a lot less memory the fewer bits you use, but at a cost of the model not performing as well.  I recommend going with a medium 5 bit version:

https://huggingface.co/TheBloke/dolphin-2.7-mixtral-8x7b-GGUF/resolve/main/dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf?download=true

Go ahead and run this model using llama.cpp as described on the llama.cpp github.  And you should be chatting with something that is about as smart as a bright 5 year old.  But with an encyclopedic knowledge of a ton of things. 



No comments:

Post a Comment