In-Depth

Pros and Cons of Running a Large Language Model Locally

It's safe to say that hundreds of companies (at least) are implementing AI large language model systems. The standard way to do this is to use an LLM model such as GPT-4 from OpenAI or Llama from Google or Claude from Anthropic. These systems are hosted in the Cloud and so using them requires an active Internet connection and a paid subscription. But it is also possible to run an LLM system locally on company server machines in a completely isolated manner, free of charge.

Briefly, advantages of running an LLM system locally include:

  1. Local systems are free to use, decreasing usage costs.
  2. Local systems are less likely to suffer a network outage, increasing reliability.
  3. Local systems maintain all data on-premises, increasing security.

These three advantages of local LLM systems depend on specific usage scenario conditions.

The disadvantages of running a local LLM system include:

  1. Relatively high initial costs, to purchase beefy hardware.
  2. Require moderate-to-high levels of in-house technical expertise.
  3. Usually cannot use the most advanced LLM models.

In short, the decision of whether to use a local LLM system or a cloud LLM system, or a combination of systems, depends on the usage scenario. As a general rule of thumb, relatively straightforward scenarios, such as document summarization, question-answering, and sentiment analysis, are best-suited for implementation as a local LLM system. Complex scenarios, such as computer programming code generation, data analysis, and chain-of-thought questions, are least-suited for local LLM implementation.

An Example
Understanding the pros and cons of using a local LLM system is perhaps best illustrated by a concrete example. The screenshot in Figure 1 shows a local LLM answering the question, "What is a good science fiction movie from 1956? Tell me just the movie title and year it was released." The system responds with "Forbidden Planet (1956)".

Figure 1: A Local LLM System in Action
[Click on image for larger view.] Figure 1: A Local LLM System in Action

The demo is run using Ollama, an open source application. In spite of its name, Ollama is not developed or maintained by Meta, the company that creates the Llama family of large language models. An alternative application that hosts locally running LLMs is LM Studio. LM Studio is developed by Element Labs, and is not open source. Although it's impossible to get accurate usage data, Ollama and LM Studio appear to be roughly equal in popularity. A third application to run LLMs locally is GPT4All, an open source effort from Nomic AI.

The left side of Figure 1 shows how Ollama runs behind the scenes. Ollama downloads open source LLMs using the "pull" command. In this example, the system has previously downloaded the gpt-oss:20b LLM. This is an open source version from OpenAI that produces results roughly similar to GPT-4, with 20 billion parameters. Other open source LLMs that can be run locally include Llama (from Meta), Mistral (from Mistral), and Gemma (from Google). LLMs that are not open source cannot be run locally.

The right side of Figure 1 shows how Ollama programmatically accepts a request and returns a response. Each LLM has its own unique application programming interface. Because the example system is running a gpt-oss LLM from Open AI, the demo program uses the OpenAI API.

The demo program is written using the Python language and begins with:

from openai import OpenAI
print("Begin run LLM locally demo ")
# set the OpenAI client to point to the local Ollama server
client = OpenAI(
  base_url="http://localhost:11434/v1",  # default endpoint
  api_key="dummy_key"  # required but not used by Ollama
)

Notice that the system points to localhost, which is an alias for the machine that the program is running on, rather than to the OpenAI servers in the Cloud. Also, no subscription key is needed, as would be required for a paid subscription.

The demo program uses the client.chat.completions.create() function to send a question and fetch the response from the gpt-oss LLM. At the time this article was written, OpenAI had introduced an improved function client.responses.create(), however, Ollama did not yet support the new function. This points out that dealing with development lag is a disadvantage of running a local LLM system.

Some Observations
When the Pure AI editors asked Dr. James McCaffrey to comment, he noted, "In general, for the client companies I work with, running a local LLM system is less preferred than running a standard Cloud-based LLM system. Three scenarios where companies choose to run an LLM system locally are 1.) small companies that have in-house technical expertise and have a large volume of requests, 2.) companies that need an isolated system without a guaranteed Internet connection, 3.) companies that want completely local systems to strive for a high level of security."

He added, "Many companies are hedging their bets by looking at both standard Cloud-based and locally-based systems. AI systems are developing so rapidly that it's a good idea to architect systems with maximum flexibility. For many startups and enterprise companies, a hybrid approach works well -- run the core model locally for sensitive or latency critical tasks, and offload heavy or high volume workloads to a managed API."

Featured