Many businesses hear GenAI and think immediately of ChatGPT but, the world of open source models offers even more opportunity with more flexibility and less cost. As the name implies, open source models are a type of machine learning model whose inner workings are openly shared. This means that anyone can access details like the model's structure, its weights, biases, and other key parameters.
The release of Meta's LLaMA and LLaMA 2 models significantly boosted the popularity of open-source models. Since then, hundreds of different models have been made available to the public. Hosting these models is challenging and time consuming without the right knowledge or tools. This is what we set out to solve when building Verta’s GenAI Workbench.
In this post, we'll explore how we at Verta host open source LLMs, providing insights into our approach and methodology.
Hosting a large language model (LLM) means operating an instance of the model using your own resources and making it accessible to your users.
Like all computer programs, LLMs need CPUs and RAM to operate. The required amount of each resource varies depending on the specific models being hosted. Additionally, most LLMs (but not all) need a GPU or another type of accelerator to achieve satisfactory performance.
To make the models accessible to others, some form of network server architecture is needed. This can be as simple as an HTTP server with a REST interface.
LLMs can now be run in a variety of environments, ranging from small IoT devices like Raspberry Pi or your laptop to GPU clusters in the cloud.
At Verta, we utilize llama.cpp, an open source project that enables LLMs to operate across a range of environments.
llama.cpp is an open-source project created by Georgi Gerganov from Sofia, Bulgaria. It evolved from Georgi's earlier project, whisper.cpp, which is an open-source implementation of the Whisper speech-to-text model.
Llama.cpp aims to bring model inference to less powerful, commonly available hardware, as stated in its "manifesto." Initially designed for CPU-only inference, it has expanded to support various GPU architectures, including CUDA, Apple's Metal Performance Shaders, and hybrid CPU+GPU inference. It offers advanced features like quantization to lower memory requirements and boasts high portability across macOS, Linux, Windows, BSD, and containerized environments like Docker.
Distributed under the MIT license, llama.cpp now enjoys contributions from nearly 600 developers.
The easiest way to get started with llama.cpp is to download it from git via `git clone` (The llama.cpp `README.md` has extensive details on how to build the project).
Llama.cpp provides various interaction methods, including command-line arguments, an interactive loop, and its own HTTP server implementation.
A key feature of the llama.cpp server is its compatibility with OpenAI client libraries, enabling a wide range of applications to run against locally hosted LLMs with minimal changes. This interoperability was a significant factor in our decision to use a llama.cpp-based server at Verta. It allows us to maintain a single application codebase that can integrate with both OpenAI models like ChatGPT and locally hosted models, offering a streamlined and efficient solution.
Another option is the llama-cpp-py Python bindings library, which brings most llama.cpp functionality into Python for easy integration into existing Python codebases. We chose the llama-cpp-py binding strategy because it seamlessly fits into our Python environment, allowing us to reuse our existing libraries and tooling.
This library can be used by adding `llama-cpp-py` to your project requirements.txt, Poetry file, or directly installed with pip.
When using llama.cpp to run a model, a few key factors must be considered:
The other way to get your model data into llamaccp is by running the conversion script
The quantization or half precision can be controlled with the `--outtype` argument:
Conversion is a CPU intensive process. If you have a machine with a lot of cores, you can increase the performance by setting the `--concurrency` parameter to the number of cores you have.
Many developers and organizations have begun to package models as GGUF as a first class file type. This removes the conversion step completely.
You can also often find GGUF versions of models on HuggingFace, from developers such as [Tom Jobbins] who run conversions and upload the results.
Before running the model in server mode, it can be good to interact with it directly and ensure it is working as expected.
Below is an example of running llama.cpp for the [Nous Hermes 2 - Mixtral 8x7B - SFT ] model, which has been quantized down to 2 bits per parameter.
Here's a breakdown of the command line arguments:
You will see various debug messages as the model is loaded.
Finally, you will be presented with an interactive prompt.
At this point, you can have a conversation with your model!
Running the model in server mode
The llama.cpp server command is similar to the interactive command:
This will start an OpenAI compatible server listening on port 41430. You can now use the host and port from above to connect the OpenAI library by setting its API Base URL to the host and port above.
For example, using the Python OpenAI library there are several ways that this url can be defined:
openai.base_url = “http://192.168.1.2:41430/v1” # where the ip is the external ip of the llama.cpp server
openai.api_key = "any" # server defaults to allowing any key. This can be changed with server arguments.
When utilizing a GPU with substantial VRAM or hosting multiple small models, it's feasible to co-host several models on a single GPU. The primary limitation is the total available GPU VRAM. You can operate multiple instances of the llama.cpp server, utilizing a single graphics card until the memory is fully allocated. In CUDA environments, GPU memory usage can be monitored using nvidia-smi. Alternatively, you can activate Prometheus metrics on the server with the --metrics flag for memory usage tracking.
For instance, each model can be hosted on a server that listens on a distinct TCP port. A reverse proxy, such as nginx, can be configured to sit in front of this group of servers, effectively combining them into a single virtual server.
An nginx configuration can utilize proxy_pass to route requests to different models based on the request path:
Llama.cpp is a versatile tool that can be utilized at every stage of LLM deployment. It's great for early-stage testing in interactive mode and can also power your Python apps that rely on OpenAI libraries.
With significant advantages like its ability to run almost anywhere, this greatly lowers the barrier to entry for those looking to get started with LLMs that aren’t just ChatGPT. If you want to see this work in action, try out our GenAI workbench at app.vert.ai.