Using the Unsloth Library on Google Colab to FineTune Models for Ollama

This is a book about running local LLMs using Ollama. That said, I use a Mac M2 Pro with 32G of memory and while my computer could be used for fine tuning models, I prefer using cloud assets. I frequently use Google’s Colab for running deep learning and other experiments.

We will be using three Colab notebooks in this chapter:

Colab notebook 1: Colab URI for this chapter is a modified copy of a Unsloth demo notebook. Here we create simple training data to quickly verify the process of fine tuning on Collab using Unsloth and exporting to a local Ollama model on a laptop. We fine tune the 1B model unsloth/Llama-3.2-1B-Instruct.
Colab notebook 2: Colab URI uses my dataset on fun things to do in Arizona. We fine tune the model unsloth/Llama-3.2-1B-Instruct.
Colab notebook 3: Colab URI This is identical to the example in Colab notebook 2 except that we fine tune the larger 3B model unsloth/Llama-3.2-3B-Instruct.

The Unsloth fine-tuning library is a Python-based toolkit designed to simplify and accelerate the process of fine-tuning large language models (LLMs). It offers a streamlined interface for applying popular techniques like LoRA (Low-Rank Adaptation), prefix-tuning, and full-model fine-tuning, catering to both novice and advanced users. The library integrates seamlessly with Hugging Face Transformers and other prominent model hubs, providing out-of-the-box support for many state-of-the-art pre-trained models. By focusing on ease of use, Unsloth reduces the boilerplate code needed for training workflows, allowing developers to focus on task-specific adaptation rather than low-level implementation details.

One of Unsloth’s standout features is its efficient resource utilization, enabling fine-tuning even on limited hardware such as single-GPU setups. It achieves this through parameter-efficient fine-tuning techniques and gradient check pointing, which minimize memory overhead. Additionally, the library supports mixed-precision training, significantly reducing computational costs without compromising model performance. With robust logging and built-in tools for hyper parameter optimization, Unsloth empowers developers to achieve high-quality results with minimal experimentation. It is particularly well-suited for applications like text summarization, chatbots, and domain-specific language understanding tasks.

Colab Notebook 1: A Quick Test of Fine Tuning and Deployment to Ollama on a Laptop

We start by installing the Unsloth library and all dependencies, then uninstalling just the unsloth library and reinstalling the latest from source code on GitHub:

1 pip install unsloth
2 pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

Now create a model and tokenizer:

 1 from unsloth import FastLanguageModel
 2 import torch
 3 max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
 4 dtype = None # None for auto detection.
 5 load_in_4bit = True # Use 4bit quantization to reduce memory usage.
 6 
 7 # More models at https://huggingface.co/unsloth
 8 
 9 model, tokenizer = FastLanguageModel.from_pretrained(
10     model_name = "unsloth/Llama-3.2-1B-Instruct",
11     max_seq_length = max_seq_length,
12     dtype = dtype,
13     load_in_4bit = load_in_4bit,
14 )

Now add LoRA adapters:

 1 model = FastLanguageModel.get_peft_model(
 2     model,
 3     r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
 4     target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
 5                       "gate_proj", "up_proj", "down_proj",],
 6     lora_alpha = 16,
 7     lora_dropout = 0, # Supports any, but = 0 is optimized
 8     bias = "none",    # Supports any, but = "none" is optimized
 9     # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
10     use_gradient_checkpointing = "unsloth", # "unsloth" for very long context
11     random_state = 3407,
12     use_rslora = False,  # We support rank stabilized LoRA
13     loftq_config = None, # And LoftQ
14 )

The original Sloth example notebook used Maxime Labonne’s FineTome-100k dataset for fine tuning data. Since I wanted to fine tune with my own test data I printed out some of Maxime Labonne’s data after being loaded into a Dataset object. Here are a few snippets to show you, dear reader, the format of the data that I will reproduce:

1 {'conversations': [{'content': 'Give three tips for staying healthy.', 'role': 'user'}, {'content': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats...', 'role': 'assistant'},
2  ...
3  ]}
4 {'conversations': [{ ... etc.
5 {'conversations': [{ ... etc.

I used a small Python script on my laptop to get the format correct for my test data:

 1 from datasets import Dataset
 2 
 3 json_data = [
 4  {"conversations": [
 5     {"content": "What are the two partitioned colors?", "role": "user"},
 6     {"content": "The two partitioned colors are brown, and grey.",
 7      "role": "assistant"},
 8     {"content": "What are the two partitioned colors?", "role": "user"},
 9     {"content": "The two partitioned colors are brown, and grey.",
10      "role": "assistant"}
11  ]},
12  {"conversations": [
13     {"content": "What is the capital of Underworld?", "role": "user"},
14     {"content": "The capital of Underworld is Sharkville.",
15      "role": "assistant"}
16  ]},
17  {"conversations": [
18     {"content": "Who said that the science of economics is bullshit?",
19      "role": "user"},
20     {"content": "Malcom Peters said that the science of economics is bullshit.",
21      "role": "assistant"}
22  ]}
23 ]
24 
25 # Convert JSON data to Dataset
26 dataset = Dataset.from_list(json_data)
27 
28 # Display the Dataset
29 print(dataset)
30 print(dataset[0])
31 print(dataset[1])
32 print(dataset[2])

Output is:

1 Dataset({
2     features: ['conversations'],
3     num_rows: 3
4 })
5 {'conversations': [{'content': 'What are the two partitioned colors?', 'role': 'user'}, {'content': 'The two partitioned colors are brown, and grey.', 'role': 'assistant'}, {'content': 'What are the two partitioned colors?', 'role': 'user'}, {'content': 'The two partitioned colors are brown, and grey.', 'role': 'assistant'}]}
6 {'conversations': [{'content': 'What is the capital of Underworld?', 'role': 'user'}, {'content': 'The capital of Underworld is Sharkville.', 'role': 'assistant'}]}
7 {'conversations': [{'content': 'Who said that the science of economics is bullshit?', 'role': 'user'}, {'content': 'Malcom Peters said that the science of economics is bullshit.', 'role': 'assistant'}]}

If you look at the notebook for this chapter on Colab you will see that I copied the last Python script as-is to the notebook, replaces code in the orgiginal Unsloth demo notebook.

The following code (copied from the Unsloth demo notebook) slightly reformats the prompts and then trains using the modified dataset:

 1 chat_template = """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.
 2 
 3 ### Instruction:
 4 {INPUT}
 5 
 6 ### Response:
 7 {OUTPUT}"""
 8 
 9 from unsloth import apply_chat_template
10 dataset = apply_chat_template(
11     dataset,
12     tokenizer = tokenizer,
13     chat_template = chat_template,
14 )
15 
16 from trl import SFTTrainer
17 from transformers import TrainingArguments
18 from unsloth import is_bfloat16_supported
19 trainer = SFTTrainer(
20     model = model,
21     tokenizer = tokenizer,
22     train_dataset = dataset,
23     dataset_text_field = "text",
24     max_seq_length = max_seq_length,
25     dataset_num_proc = 2,
26     packing = False, # for short segments
27     args = TrainingArguments(
28         per_device_train_batch_size = 2,
29         gradient_accumulation_steps = 4,
30         warmup_steps = 5,
31         max_steps = 60,
32         # num_train_epochs = 1, # For longer training runs!
33         learning_rate = 2e-4,
34         fp16 = not is_bfloat16_supported(),
35         bf16 = is_bfloat16_supported(),
36         logging_steps = 1,
37         optim = "adamw_8bit",
38         weight_decay = 0.01,
39         lr_scheduler_type = "linear",
40         seed = 3407,
41         output_dir = "outputs",
42         report_to = "none", # Use this for WandB etc
43     ),
44 )
45 
46 # Now run the trained model on Google Colab with question
47 # from fine tuning data:
48 
49 FastLanguageModel.for_inference(model)
50 messages = [ 
51     {"role": "user",
52      "content": "What are the two partitioned colors?"},
53 ]
54 input_ids = tokenizer.apply_chat_template(
55     messages,
56     add_generation_prompt = True,
57     return_tensors = "pt",
58 ).to("cuda")
59 
60 from transformers import TextStreamer
61 text_streamer = TextStreamer(tokenizer, skip_prompt = True)
62 _ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The output is (edited for brevity and to remove a token warning):

1 The two partitioned colors are brown, and grey.

The notebook has a few more tests:

 1 messages = [                    # Change below!
 2     {"role": "user", "content": "What is the capital of Underworld?"},
 3 ]
 4 input_ids = tokenizer.apply_chat_template(
 5     messages,
 6     add_generation_prompt = True,
 7     return_tensors = "pt",
 8 ).to("cuda")
 9 
10 from transformers import TextStreamer
11 text_streamer = TextStreamer(tokenizer, skip_prompt = True)
12 _ = model.generate(input_ids,
13                    streamer = text_streamer,
14                    max_new_tokens = 128,
15                    pad_token_id = tokenizer.eos_token_id)

The output is:

1 The capital of Underworld is Sharkville.<|eot_id|>

Warning on Limitations of this Example

We used very little training data and in the call to SFTTrainer we didn’t even train one epoch:

1     max_steps = 60, # a very short training run for this demo
2     # num_train_epochs = 1, # For longer training runs!

This allows us to fine tune a previously trained model very quickly for this short demo.

We will use much more training data in the next chapter to finetune a model to be an expert in recreational locations in the state of Arizona.

Save trained model and tokenizer to a GGUF File on the Colab Notebook’s File System

To experiment in the Colab Notebook Linux environment we can save the data locally:

1 model.save_pretrained("lora_model") # Local saving
2 tokenizer.save_pretrained("lora_model")

In order to create a GGUF file to allow us to run this fine tuned model on our laptop we create a local GGUF file that can be downloaded to your laptop:

1 model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

In the demo notebook, you can see where the GGUF file was written:

1 !ls -lh /content/model/unsloth.Q4_K_M.gguf
2 771M 771M Dec  5 15:51 /content/model/unsloth.Q4_K_M.gguf

Copying the GGUF File to Your Laptop and Creating a Ollama Modelfile

Depending on how fast your Internet speed is, it might take five or ten minutes to download the GGUF file since it is about 1G in size:

1 from google.colab import files
2 files.download("/content/model/unsloth.Q4_K_M.gguf")

We also will need to copy the generated Ollama model file (that the Unsloth library created for us):

1 !cat model/Modelfile

The contents of the file is:

 1 FROM /content/model/unsloth.F16.gguf
 2 
 3 TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}
 4 
 5 ### Instruction:
 6 {{ .Prompt }}{{ end }}
 7 
 8 ### Response:
 9 {{ .Response }}<|eot_id|>"""
10 
11 PARAMETER stop "<|end_of_text|>"
12 PARAMETER stop "<|start_header_id|>"
13 PARAMETER stop "<|end_header_id|>"
14 PARAMETER stop "<|finetune_right_pad_id|>"
15 PARAMETER stop "<|python_tag|>"
16 PARAMETER stop "<|eom_id|>"
17 PARAMETER stop "<|eot_id|>"
18 PARAMETER stop "<|reserved_special_token_"
19 PARAMETER temperature 1.5
20 PARAMETER min_p 0.1

After downloading the GGUF file to my laptop I made a slight edit to the generated Modelfile got the path to the GGUF file on line 1:

 1 FROM ./unsloth.Q4_K_M.gguf
 2 
 3 TEMPLATE """Below are some instructions that describe some tasks. Write responses that appropriately complete each request.{{ if .Prompt }}
 4 
 5 ### Instruction:
 6 {{ .Prompt }}{{ end }}
 7 
 8 ### Response:
 9 {{ .Response }}<|eot_id|>"""
10 
11 PARAMETER stop "<|end_of_text|>"
12 PARAMETER stop "<|start_header_id|>"
13 PARAMETER stop "<|end_header_id|>"
14 PARAMETER stop "<|finetune_right_pad_id|>"
15 PARAMETER stop "<|python_tag|>"
16 PARAMETER stop "<|eom_id|>"
17 PARAMETER stop "<|eot_id|>"
18 PARAMETER stop "<|reserved_special_token_"
19 PARAMETER temperature 1.5
20 PARAMETER min_p 0.1

Once the model is downloaded to your laptop, create a local Ollama model to use:

1 $ ls -lh
2 -rw-r--r--  1 markw  staff   580B Dec  5 09:26 Modelfile
3 -rw-r--r--@ 1 markw  staff   770M Dec  5 09:19 unsloth.Q4_K_M.
4 $ ollama create unsloth -f Modelfile

I can now use the model unsloth that was just created on my laptop:

 1 $ ollama run unsloth                
 2 >>> what is 2 + 5?
 3 two plus five equals eight.
 4 
 5 >>> What are the two partitioned colors?
 6 The two partitioned colors are brown, and grey.
 7 
 8 >>> Who said that the science of economics is bullshit?
 9 Malcom Peters said that the science of economics is bullshit.
10 
11 >>> write a Python program to sum and print a list of numbers
12 ```python
13 # list of numbers
14 numbers = [1, 2, 3, 4, 5]
15 
16 # use the built-in function sum()
17 sum_of_numbers = sum(numbers)
18 
19 # print the sum
20 print(sum_of_numbers)
21 >>> /bye

Notice that fine tuned model has learned new data and still has functionality of the original model.

Fine Tuning Test Wrap Up

This was a short example that can be run on a free Google Colab notebook. Now we will use a larger fine tuning training data set.

Fine Tuning Using a Fun Things To Do in Arizona Data Set

Here we use the slightly modified Colab notebook 2.

I created a GitHub repository for the Arizona fine tuning data set that contains small individual JSON files and a larger file ArizonaFun.json that is a concatenation of the smaller files. Let’s look at az_flagstaff_parks.json (edited to remove some text for brevity):

 1 [
 2  {
 3     "conversations": [
 4         {
 5             "content": "Can you list some city parks in Flagstaff, Arizona?",
 6             "role": "user"
 7         },
 8         {
 9             "content": "Here are some city parks in Flagstaff, Arizona:\n\n- **Bushmaster Park**: Located at 3150 N Alta Vista Dr, this 20-acre park features playgrounds, lighted tennis and basketball courts, a skate park, roller skating arena, dog park, picnic ramadas, and permanent restrooms.\n\n- **Thorpe Park**: Situated at 191 N Thorpe Rd, Thorpe Park offers playgrounds, lighted tennis courts, sand volleyball court, basketball court, soccer fields, youth baseball fields, adult softball fields, a disc golf course, dog park, picnic ramada, and both portable and permanent restrooms.\n\n- **Foxglenn Park**: Found at 4200 E Butler Ave, this park includes a playground, basketball courts, soccer fields, a youth softball field, skate track, picnic ramadas, and permanent restrooms.\n\n- **Buffalo Park**: Located at 2400 N Gemini Rd, Buffalo Park is a 215-acre open space with a 2-mile loop trail, ...",
10                 "role": "assistant"
11             }
12         ]
13     }, ...

There are a total of 40 fine tuning examples in the file ArizonaFun.json. You can see in the second and third Colab notebooks for this chapter I just pasted the JSON data from the file ArizonaFun.json into a cell:

 1 from datasets import Dataset
 2 
 3 json_data = [
 4     {
 5         "conversations": [
 6             {
 7                 "content": "Can you list some city parks in Flagstaff, Arizona?",
 8                 "role": "user"
 9             },
10         ... } ]

Unfortunately, the fine tooled model often performs well, but also hallucinates. Here is an example of using the fine tuned model in the Colab notebook:

 1 messages = [                    # Change below!
 2     {"role": "user", "content": "Where is Petrified Forest National Park located?"},
 3 ]
 4 input_ids = tokenizer.apply_chat_template(
 5     messages,
 6     add_generation_prompt = True,
 7     return_tensors = "pt",
 8 ).to("cuda")
 9 
10 from transformers import TextStreamer
11 text_streamer = TextStreamer(tokenizer, skip_prompt = True)
12 _ = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 128, pad_token_id = tokenizer.eos_token_id)

The output is:

1 Petrified Forest National Park is situated in northeastern Arizona, near the town of Holbrook. [oai_citation_attribution:2‡National Park Service](https://www.nps.gov/state/az/index.htm)<|eot_id|>

This answer is correct.

The second Colab notebook also contains code cells for downloading the fine tuned model and the directions for importing the model into Ollama that we saw earlier also apply here.

Third Colab Notebook That Fine Tunes a Larger Model

There are only two changes made to the second notebook:

We now fine tune a 3B model unsloth/Llama-3.2-3B-Instruct.
Because the fine tuned model is large, I added code to store the model in Google Drive:

1 from google.colab import drive
2 drive.mount('/content/drive')
3 
4 import shutil
5 shutil.move("/content/model/unsloth.Q4_K_M.gguf", '/content/drive/My Drive/LLM/')

I created an empty folder LLM on my Google Drive before running this code.

Fine Tuning Wrap Up

I don’t usually fine tune models. I usually use larger prompt contexts and include one shot or two shot examples. That said there are good use cases for fine tuning small models with your data and I hope the simple examples in this chapter will save you time if you have an application requiring fine tuning.

Up next

Reasoning with Large Language Models