Community post by Anup Ghatage

Log messages are essential for debugging and monitoring applications, but they can often be overly verbose and cluttered, making it difficult to quickly identify and understand critical information. This is especially true in large codebases with multiple contributors over time, leading to inconsistencies in log line formats, lengths, and language due to employee churn and transient open-source contributions.

Most organizations are ‘AI curious’ in terms of using it to solve some of their existing workflows and problems, but have the common conundrum:


These reasons alone are enough for most enterprises to ‘stop and wait till things get clearer.’ It’s true that all things AI are moving at a breakneck speed and it will certainly get better, but today’s customer demands the same benefits from the enterprises to boost their productivity. This makes waiting more precarious as companies will lose the opportunity to capture the majority of the value at the start of a new tech cycle.

In this post, we’ll explore how to use open source, Apache 2.0 licensed local LLMs to tackle the task of streamlining log lines. By reducing the verbosity of log lines while maintaining the context and details, we can not only improve and maintain their readability but also realize cost savings when using log management tools like Splunk.

So how do we get started?

When working at the enterprise level, using third party software has to be done with caution, mainly because of licensing restrictions. One of the most popular open source licenses is the Apache 2.0 license. This license is very permissive, allowing the software to be used commercially without restrictions. Other licenses like the MIT license are also permissive and allow free commercial usage. So now we know, whatever tools and models we use, we must ensure that they are either Apache 2.0 or MIT licensed.

Another problem still remains is how to use them (or run inference on them) if we don’t have any GPUs. The answer here is quantization.

Large Language Models (LLMs) are incredibly powerful tools, but their computational demands often limit them to powerful GPUs in cloud environments. This can be a barrier for many users, especially those without access to expensive hardware or the expertise to manage it.

Here’s where quantization comes in. It’s a technique that optimizes LLMs for local inference by reducing the precision of their weights. Traditionally, these weights are stored as high-precision floating-point numbers. Quantization converts them to a lower-precision format, like 8-bit integers in our case. This significantly reduces the model’s size and memory footprint.

The benefit? By using 8-bit quantization, we can run LLMs on standard server CPUs. This opens doors for wider adoption, particularly for enterprises. Quantization offers:

In essence, quantization acts as a bridge, allowing even those without top-tier hardware to leverage the power of LLMs directly on their own machines.

There is however a trade-off with performance for speed in this case. Lower quantizations like 4 bit or 2 bit make the model files really small but the quality of their output degrades as well. Hence, using an 8 bit quantization offers us a relatively slower but comparatively more accurate option while hosting these locally.

There are several different formats for quantization: viz, GGUF, AWQ and GPTQ.
For the purpose of this article, we will be using the GGUF models as those have the most active community around them.

Where can I get these models? How do I host them? How do I query them?

HuggingFace is one of the most popular machine learning model and dataset repositories on the internet. Many companies and users upload their models and their experiments there which can be used by anyone. Each of these models on HuggingFace come with a ‘Model Card’ – sort of an instruction recipe for how to use the model, its prompt format, licensing requirements, and limitations.

While looking at models, we must look for GGUF quantizations which offer Apache 2.0 or MIT licenses. Be sure to read the fine print even then. One of the most popular open source LLMs is the Mistral-7b v0.2. This model and its weights were released under Apache 2.0 license.
We will use this for our sample use case. Let’s go over to the HuggingFace page for a GGUF quantization of this model and try to understand it. 

Licensing

The quantizations automatically have to follow the licensing requirements of the original models. In our case, Mistral-ai’s model has Apache 2.0 License which can also be seen on the top of the page for the quantized model. 

The prompt template

<s>[INST] {prompt} [/INST]

The prompt template is the way the model expects to be given input. This is because the model was possibly trained with these special characters and tokens. In this case, we have to package our query in place of the ‘{prompt}’ in the template as shown above. So for example, if we were to ask “Why is the sea blue?” the actual instruction we want to send would be:

<s>[INST] Why is the sea blue? [/INST]

Hosting these models locally can be done in a variety of ways, but we will go over two of the easiest options today.

Both of these projects are free, open source and MIT licensed. You can read about how to build and run these from the links mentioned above.

Once up and running, both of these provide OpenAI compliant API access. We can query their endpoints via standard REST API calls.

In summary, so far we’ve got open source commercially viable LLM model files and ways to host them and query them – all locally, without the need for GPUs and for free. Now let’s jump into the use case that we started the article with.

Reducing Splunk costs with succinct logs

Log management platforms like Splunk can be expensive to operate, with costs often tied to the volume of log data ingested and stored. By shrinking log lines using LLMs, we can significantly reduce the amount of data that needs to be processed and retained, translating into cost savings.

Shorter, more concise log lines not only consume less storage space but also require less bandwidth for transmission and reduce the computational load on Splunk’s indexers and search heads. This can lead to lower infrastructure requirements and reduced operational expenses.

Leveraging llama.cpp for Local LLM Inference

To run inference on our local LLMs, we’ll be using llama.cpp. However, we want to think of the llama.cpp server endpoint as a ‘reasoning and intelligence’ endpoint, rather than a standard REST endpoint. Using it in this way will enable us to build applications around it.


To begin, let’s start the llama.cpp server with our downloaded model.

./server -m /Users/aghatage/mistral-7b-instruct-v0.2.Q8_0.gguf -c 2048

This above command starts the server with the Mistral-7b v0.2 8-bit quantized model. We can now make REST API calls to endpoints hosted on http://localhost:8000/. We will leverage the http://localhost:8000/completion endpoint. Most enterprises can host this kind of server safely within their own environments or even on a developers laptop.

The Log Line shrinking process

As discussed before, if we want to use the LLM as an intelligence endpoint, we must speak its language, we will have to do some scripting to hand-deliver the exact information we want the LLM to process.

1. **Log Line Extraction**: We’ll start by extracting all single-line log lines from our code base using a Python script.

import os
import re

def find_log_lines(codebase_path):
	# Regex to match a log line, modified to be case-insensitive
	log_line_pattern = re.compile(r'\b(log\.\w+)\(.*?\);', re.IGNORECASE)
    
	for root, dirs, files in os.walk(codebase_path):
    	for file in files:
        	if file.endswith(".java"):
            	file_path = os.path.join(root, file)
            	with open(file_path, 'r') as f:
                	lines = f.readlines()
                	for i, line in enumerate(lines):
                    	if log_line_pattern.search(line):
                        	print(f"{file_path} : {i+1} : {line.strip()}")

# Replace '/path/to/java/codebase' with the actual path to your Java codebase
codebase_path = '/path/to/java/codebase'
find_log_lines(codebase_path)

The above script prints all the log lines in the following format:

<file path> : <line number> : <actual log line>

2. **Prompt Preparation**: For each log line, we’ll query the local LLM to attempt to make it ‘shorter’. Each of these REST calls will need a prompt that includes the log line itself and a request to reduce its verbosity while preserving key details. The prompt here also explains with an example aka 1 shot learning. To give an example reference to the LLM on what to expect and how the output should be. The CODE_LINE part of the prompt is replaced with a new log line for every call. Finally, we make sure that the output should be JSON by reiterating the structure.

base_prompt = """<s> [INST]
Here is an example of shortening a log line
input: log.error("Encountered error at writing records", t);
output:
{
	"fixed" : "log.error("Write error", t);"
}

Now reword the following log line to be shorter if possible:
```
CODE_LINE
```
Please rewrite the log line to be shorter, do not add anything else.
Make sure response is in JSON format like this:
```
{
	"fixed" : "<actual shortened log line>"
}
```[/INST]
"""

3. **Inference via llama.cpp**: We’ll send these prompts to a local llama.cpp server running our quantized LLM model. To constrain the output format, we’ll use llama.cpp’s JSON restricting grammar feature, ensuring that the model’s responses are structured as JSON.

The ‘GBNF’ form of the grammar that we will use is the standard JSON grammar that comes with llama.cpp.

root   ::= object
value  ::= object | array | string | number | ("true" | "false" | "null") ws

object ::=
  "{" ws (
        	string ":" ws value
	("," ws string ":" ws value)*
  )? "}" ws

array  ::=
  "[" ws (
        	value
	("," ws value)*
  )? "]" ws

string ::=
  "\"" (
	[^"\\\x7F\x00-\x1F] |
	"\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F]) # escapes
  )* "\"" ws

number ::= ("-"? ([0-9] | [1-9] [0-9]*)) ("." [0-9]+)? ([eE] [-+]? [0-9]+)? ws

# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= ([ \t\n] ws)?

Finally, when sending the prompt to the llama.cpp server, we must also include the grammar as directions for the inference to generate only JSON as we had initially prompted it to.

def get_log_fix_response(code_line):
	final_prompt = base_prompt.replace("CODE_LINE", code_line)

	data = {"prompt": final_prompt, "n_predict": -1, "grammar": list_grammar}

	try:
    	    response = requests.post(url, headers=headers, json=data)
    	    response.raise_for_status()
    	    json_input = json.loads(response.json().get("content"))
    	    content = json_input["fixed"]
    	    return content
	except requests.RequestException as e:
    	    print(f"Error making HTTP request: {e}")
    	    return None

Finally, we can now use the JSON responses from the LLM to rewrite the original log lines in our source files, effectively shrinking them while retaining critical information.

Pros and Cons

Using local LLMs to shrink log lines offers several advantages:

– **Improved Readability**: By reducing the verbosity of log lines, we can make them easier to read and understand, facilitating more efficient debugging and monitoring.

– **Cost Savings**: Shorter log lines translate into reduced storage, bandwidth, and computational requirements for log management platforms like Splunk, leading to significant cost savings.

– **Localized Processing**: With local LLMs, all processing happens on-premises, eliminating the need to send sensitive data to external services.

– **Cost-effective**: By leveraging open source tools and runnin g on CPUs, this approach can be more cost-effective than using cloud-based LLM services.

However, there are also some potential downsides to consider:

– **Model Quality**: While open source LLMs can be highly capable, their performance may not match that of proprietary, state-of-the-art models.

– **Maintenance**: As LLM models and frameworks evolve, maintaining and updating local setups may require ongoing effort.

To summarize…

By combining open source local LLMs, quantization techniques, and tools like llama.cpp/ollama, we can develop cost-effective solutions for enterprise level requirements.
For this example, in addition to improving log readability and efficiency, this approach can lead to substantial cost savings when using log management platforms like Splunk. While there are trade-offs to consider, the potential benefits make it a compelling option for organizations looking to optimize their logging infrastructure and reduce operational expenses. This post also provides a cost efficient and low-risk blueprint for organizations looking to infuse large language models into their existing workflows.

Anup has over 10 years of experience in building data infrastructure at large enterprises. His expertise is in database internals, data storage and generative AI.