local llm – stevencholerton.com

Introduction

What is a local LLM? When people talk about AI they are often referring to the chatbots such as Claude, ChatGPT or Gemini. These chatbots are user interfaces for one or more Large Language Models. They are huge, advanced, highly trained AI models that sit in the cloud and serve thousands or more people at a time.

A local LLM is a smaller model that generally sits on your own computer or within your local network. These models are significantly smaller and therefore slower and less capable than their larger cousins. They do however have a few advantages that the larger models lack:

They are private and secure. They don’t generally send or receive information via the Internet.
They cost nothing to run, except for the electricity to power them.
They are yours and can be configured any way you want.

There are several different ways to run these local LLMs and several hardware platforms you can use. Software to run local LLMs include Ollama, vLLM and LM Studio. The hardware usually consists of a fast PC running Linux or Windows, a powerful GPU with at least 16GB of VRAM, or a Mac Studio. My preferred platform, and the one I bought for just this purpose is an M4 Max Mac Studio with 64GB of RAM. The fact that it uses very little electricity, is virtually silent, and doesn’t heat my office goes some way towards justifying the cost!

To be honest, go with your favourite platform, or what you already have. It is even possible to run smaller models on a laptop with some success. Keep your expectations realistic and you will have fun, learn a lot and maybe find that you don’t need to rely on subscriptions to the cloud models, or at least not as often.

Above you can see a high-spec PC configured for AI workloads, next to an Apple Mac Studio. Both great machines for what we need.

Getting Started

Once you have identified the hardware that you wish to use for your local LLM and confirmed you have enough space for the models you want to try (local LLM sizes vary from a few Mb to 100Gb+) you need to look at the software you are going to use.

I started with Ollama and recently moved to LM Studio. I’ve not personally tried vLLM. I found LM Studio to be a good choice and for the time being I intend to stick with it. Screenshots and examples in this (and probably future articles) will be based on LM Studio.

All of the software choices are available to download for free and are all under active development. LM Studio can be downloaded from here:

https://lmstudio.ai/download

Select your operating system and download.

After a simple installation, you can run LM Studio and one of the first things you need to do is download one (or more) models. There are hundreds of different models available and you could start with those that you recognise either the model or the creator by name, or just pick one. This is the model selection window:

There is a lot of terminology relating to the file names of some of these models, often they include the manufacturer, the amount of parameters, the quantisation level etc. No need to worry about any of these for now, but going forward it would pay you to learn what these terms mean and how they could affect your experience.

The most important thing is the size of the model and whether it will operate comfortably within the amount of RAM (usually VRAM on Windows / Linux PCs) you have available. If the selected model will fit comfortably then you will be given the opportunity to download, if not LM Studio will warn you first, and you have the opportunity to download a different model.

The way RAM and VRAM is used differs significantly between PCs and Macs. Basically the Mac uses unified memory, meaning the total RAM is split by the operating system between GPU and CPU. On a PC you will have two values RAM and VRAM, usually you will have an adapter card containing your GPU and it will have a specific amount of RAM, your computer itself will have a separate amount of RAM. This is an important distinction at the moment but hopefully will become less relevant over time as some PC manufacturers are starting to introduce unified memory, or at least making hardware changes that give the appearance and functionality of unified memory.

Note: The GPU is the Graphics Processing Unit and the CPU is the computer’s Central Processing Unit. LLM calculations generally take place on the GPU because it is faster at performing the kind of numerical calculations required by the LLM.

Your First Prompt

Once you have downloaded a model, you need to select it before you can run a prompt against it.

Once you have a model selected, you can type a prompt in the same way that you would using ChatGPT or Claude.

Note that LM Studio is still in development and I have seen issues where you sometimes have to select the model twice in order for LM Studio to realise it’s loaded.

Below you can see two example chats that I tested against this model.

You have much more control over your models in LM Studio than you would have with most commercial models, so feel free to experiment. You can change parameters like temperature, context length, system prompts, and choice of model weights.

Summary

There are many additional things that you can do with your own hosted LLMs, including accessing them across your network or even running these large models across the internet using an integrated Tailscale VPN. In addition, this allows you to run larger models from a computer with much less VRAM, as the compute is carried out on your main computer, where the larger models reside. This technology is quite recent and is called LM Link.

There is also a fascinating technology called ‘Speculative Decoding’ which allows you to pair a smaller LLM (the Draft model) with a larger one. In many circumstances this can significantly increase performance.

Tag: local llm

Local AI: With LM Studio