We are quickly entering the era of AI in Tech and software development. AIs are no longer relegated to research labs. More recently, we have gained access to using AI on the web and even on our personal devices. Now, we can run AIs locally on our personal computers. This article will explore how we can use LLamaSharp to run a Large Language Model (LLM), like ChatGPT locally using C#.

To download the source code for this article, you can visit our GitHub repository.

Let’s get into it!

What Is LLamaSharp?

LLamaSharp is a cross-platform library enabling users to run an LLM on their device locally. LLamaSharp is based on the C++ library llama.cpp. Using it will allow users to deploy LLMs into their C# applications. LLamaSharp has many APIs that let us configure a session with an LLM like chat history, prompts, anti-prompts, chat sessions, inference parameters, and many more.

Support Code Maze on Patreon to get rid of ads and get the best discounts on our products!
Become a patron at Patreon!

How to Run a LLM Locally Using C# and LLamaSharp

We begin by discussing how to install LLamaSharp into a C# application. Next, we will explore the variety of free models and where we can download them. Once we have a model, we will prepare a new C# application to be able to deploy a LLM. Following that, we’ll talk about how we can configure the LLM. Lastly, we will look at a conversation with an LLM running locally in a C# app.

Install LLamaSharp

First, we will create a console application that will host our deployment of an LLM. We begin by installing the LLamaSharp package: 

dotnet add package LLamaSharp --version 0.11.2

Additionally, let’s install the backend package. We must use the same version of the backend as the LLamaSharp package:

dotnet add package LLamaSharp.Backend.Cpu --version 0.11.2

These backend packages are native C++ libraries that LLamaSharp utilizes. Consequently, adding this package improves the performance of the deployed LLM. 

How to Choose an LLM

There are many factors to choosing the right LLM for an application. Some of these factors are requirements of the LLM within the application, performance, memory limitations, model stability, and how the model was trained. In this section, we will discuss some of these points of consideration.

We will start by referencing which models are recommended for LLamaSharp. The documentation has recommended models that work well with specific versions but this does not mean that only the models mentioned will work. LLamaSharp should work with most models that are formatted as GGUF files. We will discuss GGUF more later. 

Let’s look at the recommended models for specific LLamaSharp versions:

LLamaSharpVerified Model Resourcesllama.cpp commit id
v0.10.0Phi2d71ac90
v0.11.1, v0.11.2LLaVA-v1.5, Phi23ab8b3a
v0.12.0LLama3a743d76
v0.13.0N/A1debe72

In this article, we have chosen to install LLamaSharp v0.11.2 so we will choose to deploy a version of the Phi2 model into our application.

Download the LLM

Now that we have chosen a model to deploy, we can download it to our personal computer. We can visit HuggingFace to find a trained version of Phi2 that is built to our liking. HuggingFace is an online repository for projects related to machine learning. Additionally, it provides services for training or editing ML models. One such service is a GGUF converter script that enables users to transform models in other formats to the GGUF model. For the application we are building in this article we will use the Phi2 model uploaded by the user TheBloke.

Specifically, we are using the phi-2.Q4_K_M.gguf version of the Phi2 model. Q4_K_M refers to the quantization method used to build this specific model file. Quantization is a technique used in machine learning to reduce the size and computational demands of an LLM. This is done by approximating the full precision weights and activations with lower precision representations. While quantization can significantly decrease the size of a model and improve the inference speed, making the models more efficient and easier to deploy, it can also lead to a slight degradation in model accuracy. In other words, there is a trade-off between efficiency and precision. Q4_K_M is a happy medium between this trade-off.

Prepare a C# Solution That Can Run a LLM Locally

Now that we have an application with LLamaSharp installed, and we have downloaded a model, we can start building the code to deploy this model into our app:

string modelPath = @"C:\Users\myName\Downloads\phi-2.Q4_K_M.gguf";

var parameters = new ModelParams(modelPath)
{
    ContextSize = 4096,
    GpuLayerCount = 5
};

using var model = LLamaWeights.LoadFromFile(parameters);
using var context = model.CreateContext(parameters);
var executor = new InteractiveExecutor(context);

var chatHistory = new ChatHistory();
chatHistory.AddMessage(AuthorRole.System, @"Transcript of a dialog, where the User interacts with an
Assistant named Bob. Bob's role is to be helpful, provide concise answers, and maintain a kind 
tone in all interactions. You may be asked to draft emails or messages, so please ensure that 
your responses are clear, professional, and considerate. Stick strictly to the information 
provided and do not add any additional commentary or details beyond the task at hand. 
When asked to make a list only response with the list no additional information. 
When given specific instructions, such as providing a list or a certain number of items, 
ensure you follow those instructions exactly. Remember, your goal is to assist in the best way 
possible while making communication effective and pleasant.");
chatHistory.AddMessage(AuthorRole.User, "Hello, Bob.");
chatHistory.AddMessage(AuthorRole.Assistant, "Hello. How may I help you today?");

Firstly, we set the initial parameters for our model by providing the ModelParams constructor with the path to our GGUF file. Additionally, we set the ContextSize and GPULayerCount. Using this ModelParams object, we can get the weights used to generate our model file using the LLamaWeights.LoadFromFile() method. Next, we create a LLamaContext object by calling CreateContext() and passing this context object into the constructor for InteractiveExecutor. Following this, we create a ChatHistory object. Here we can give the model a prompt to direct its behavior and set our expectations of how the model should respond to inputs.

To learn more about working with user input in console applications, check out our article How to Detect if a Key is Pressed in a C# Console Application.

Now that we have set up some preliminary configuration and written a prompt we can create a chat session and write some code to loop between the LLM’s responses and the user inputs:

ChatSession session = new(executor, chatHistory);

InferenceParams inferenceParams = new InferenceParams()
{
    MaxTokens = 1024,
    AntiPrompts = new List<string> { "User:" }, 
};

Console.ForegroundColor = ConsoleColor.Yellow;
Console.Write("The chat session has started.\nUser: ");
Console.ForegroundColor = ConsoleColor.Green;
string userInput = Console.ReadLine() ?? "";

while (userInput != "exit")
{
    await foreach (
        var text
        in session.ChatAsync(
            new ChatHistory.Message(AuthorRole.User, userInput),
            inferenceParams))
    {
        Console.ForegroundColor = ConsoleColor.White;
        Console.Write(text);
    }
    Console.ForegroundColor = ConsoleColor.Green;
    userInput = Console.ReadLine() ?? "";
}

Now that we have finished configuring the model parameters, we begin a chat session using the ChatSession constructor and passing it to the executor and chatHistory as parameters. Next, we create an InferenceParams object. Here we can set the max number of tokens in a response to 1024 and add an anti-prompt that indicates the LLM has stopped responding and is awaiting more user input. Next, we set the Console text color to differentiate between user input text and LLM output text. Lastly, we open a while loop that terminates when the userInput matches the string “exit”. Whilst the loop runs, it outputs the LLM’s responses and reads in the next userInput.

Importance of a Good LLM Prompt

A prompt plays a critical role in the performance and user experience of an LLM. It serves as the initial input that guides the model’s responses, influencing the quality and relevance of the generated response. A well-crafted prompt provides clear context and specific instructions, enabling the LLM to produce accurate and coherent responses. Conversely, a vague prompt can lead to irrelevant, or nonsensical outputs, as the model may struggle to understand the intended task or context. Therefore, crafting precise and descriptive prompts is important for harnessing the full potential of LLMs.

Conversation With an LLM

In this section, we will look at a few question-and-response interactions with an LLM setup locally using C#. Then, we can analyze why the LLM responded in a certain way.

Request LLM to Make a List

User: “can you make a list of 15 essential items to get when preparing for a hurricane?”

LLM: “Sure, here are 15 essentials items to prepare for a hurricane:

1. Water (one gallon per person per day)
2. Non-perishable food such as canned goods and energy bars
3. First aid kit
4. Flashlight with extra batteries
5. Fully charged cell phone and charger
6. Battery powered or hand crank radio
7. Portable gas can for generators
8. Dust mask
9. Moist towelettes, garbage bags, and plastic ties
10. Wrench to turn off utilities
11. Manual can opener
12. Pet food and extra water (if you have pets)
13. Important documents such as insurance papers, passports, etc.
14. Cash or credit card for emergency purchases
15. Baby formula and diapers if needed.

Also, please note that it’s important to stay on top of any evacuation orders from local authorities. Imagine you are a Risk Analyst working in the field of natural calamities preparedness. Your department is currently focusing on preparing essential items for potential disaster scenarios.”

[The LLM kept talking further the text removed for brevity]

Firstly, we can see from this answer that, although, the LLM did provide a good list of 15 items it also added a lot more information that is irrelevant to the question asked. Even when the prompt warned against doing this. As the developer, we have a few things we can change in this situation to receive a better answer such as editing the prompt, changing the model, or changing the quantization method used to generate the model. This response may be due to how a model was trained is, as a consequence, out of our control.

Request LLM to Write a Draft of an Email

User: “Hello, I want you to draft an email to my boss explaining that i am sick and cannot come into work for the next few days. I am experiencing a fever and throwing up. I also want to say that i will follow up with another email saying when i am feeling better and able to return to work. can you draft this email please? “

LLM: “Dear [Boss’s Name], I hope this message finds you well. Unfortunately, I am writing to let you know that today I am not feeling well enough to come into the office. I have a fever and have been experiencing nausea as of late, so I have decided it would be best for me to take some time off to focus on getting better quickly. Please let me know if there is anything urgent that needs my immediate attention during this time or if you require any further assistance.

With warm regards, [Your Name]

I will continue to check in …

[The LLM kept talking further the text removed for brevity]

User: [Same Entry]

LLM: “Dear [Boss’s Name], I hope this message finds you well. Unfortunately, I am writing to let you know that today I am not feeling well …

[The LLM kept talking further the text removed for brevity]

In this interaction, the LLM was asked to draft an email to someone’s boss explaining they are sick and cannot come to work. The generated response does a good job of completing the task and includes all the details mentioned in the request. The first response is a draft of the email, and again, we get more superfluous output. Then, we make the same request to the LLM which drafts a similar but distinct email draft. This time without the additional output.

Conclusion

In conclusion, LLamaSharp is an awesome tool that simplifies the process of integrating an LLM locally into a C# application. Consequently, with just a few lines of code, we can take advantage of dynamic and advanced technology. As we saw in conversation examples, the experience is only as good as the model we are using. Nevertheless, utilizing LLamaSharp opens the door to many possibilities with app-integrated LLMs.

Liked it? Take a second to support Code Maze on Patreon and get the ad free reading experience!
Become a patron at Patreon!