Summary: GPT-4o vs Local Language Models in Ollama

Summary: GPT-4o vs Local Language Models in Ollama
Photo by Pavel S / Unsplash

This entry was intended to be Part-2 of my most recent report but realized on review that it is probably the natural follow-up to the other writings on the current state of LLMs (we've not even gone down the road of Small Language Models, yet!) in the first half of 2024. Also, isn't it apt that Star Wars: Episode II is titled "Attack of the Clones"?

So, how do we do this?

Let's start with the overall measurements. There performance metrics can be grouped into four-(4) and are described as:

  • Total Duration: The total time taken for each model to complete the tasks.
  • Load Duration: The time taken to load the model.
  • Prompt Eval Count: The number of tokens evaluated in the prompt.
    • Prompt Eval Duration: The time taken to evaluate the prompt.
    • Prompt Eval Rate: The rate at which tokens are evaluated in the prompt (tokens per second).
  • Eval Count: The total number of tokens evaluated.
    • Eval Duration: The time taken to evaluate the tokens.
    • Eval Rate: The rate at which tokens are evaluated (tokens per second).
Ollama performance metrics across selected normalized models on Apple Intel hardware.

Here are two-(2) really good inference measuring articles from Baseten and Databricks. While, there is guidance in terms of hardware, quantization (model re/compression), power, batch sizing, network speed – and the like; remember that we're sort of stuck on some of these given the aforementioned base hardware. Again, thats fine, since this has afforded us a semi-stable baseline to compare things from in this discussion. When and if we need to map it against cloud cost then we can simply make an assumption for apples-to-apples comparison.

Moving on. We're going to concentrate on these specific metrics:

  1. Time To First Token (TTFT) - When streaming, it is literally how soon users start seeing the first character of a model's output after entering their query. Low waiting times for a response are a must in real-time interactions, but less so obvious in offline workloads. This metric is derived from the time required to process the prompt and then generate the first output token.
  2. Time Per Output Token (TPOT) - This is amount of time needed to generate an output token for each user that is querying your system. This metric corresponds with how each user perceives the "speed" of the model. This translates to how fast the output is able to be processed and converted to the speed and cadence at which a response displays on screen in milliseconds.
  3. Words Per Minute (WPM) - In Part-1 of this performance test we used the formula for calculating Words per Minute (WPM) based on Tokens per Second (TPS).

To estimate the potential cost of running the models based on the data, we need to consider several factors, including the compute time, instance type, and hourly cost of the compute resources. Since the performance metrics such as total duration, load duration, and evaluation duration, we can use these to approximate the compute time.

Assumptions: Google Cloud Platform
Instance Type: NVIDIA V100 GPU instance.
Hourly Cost: Approximate cost of an NVIDIA V100 GPU instance is $3/hour.
Total Duration: Includes the load duration and evaluation duration.

Example calculations: myllama2
Total Duration: 56.044795989 seconds
Compute Time (hours): ( \frac{56.044795989}{3600} = 0.01557 ) hours
Cost: ( 0.01557 \times 3.00 = $0.04671 )

When Cost($) is factored-in, however, the sorting goes like this:

  • Primary Criteria: Cost (lower is better)
  • Secondary Criteria: TTFT (lower is better)
  • Tertiary Criteria: TPOT (lower is better)
  • Quaternary Criteria: WPM (higher is better)
Models sorted by Cost > TTFT > TPOT > WPM

All things considered, Mistral 7B is the most economical with metric performance around the upper third of the pack. This is unsurprising as apart from cost, it's performance when combination with Llama 2B is what used to power Perplexity AI.

Llama 3 and its derivatives rank lower in the cost spectrum. However, that needs to be noted here is that this model shines in the conversational length side of things when temperate and top_p are tuned. I was really impressed by its chat capability the first time I tested this model on Ollama, granted it was at temperature = 0.8

Some of the key takeaways here are:

  1. using the right model for the right task. Function calling agents is now becoming more possible and consider mixing and matching sections of your flow to use the most appropriate model.
  2. sacrificing a little cost to gain better performance could be worth it in the long run. Model training is a different class all itself and while you might get away with using the cheapest it might come back later in having to validate against hallucinations or other unintended artifacts.

(this entry is published in Draft; details to be filled-in as time permits this week; and after a nap)