Mastering LLMs: A Conference For Developers & Data Scientists

An ~~LLM fine-tuning course~~ online conference for everything LLMs.

Build skills to be effective with LLMs

Course website: https://maven.com/parlance-labs/fine-tuning

<<< Fine-Tuning Workshop 1 | Conference: From prompt to model (with Kyle Corbitt) >>>

Fine-Tuning Workshop 2: Fine-Tuning with Axolotl

Guest speakers: Wing Lian, Zach Mueller

Syllabus:

We will show you how to fine-tune with Axolotl with guest speaker Wing Lian, creator of Axolotl & Zach Mueller, lead maintainer of HuggingFace acellerate.

Logistic

Today is going to be little bit more technical when you've decided you're going to fine-tune, how do you do that.

Growing Into Something Bigger

Slide 2

Backstory

This is a list of guest speakers, and we have added so many guest speakers, and for some of them their area of expertise is narrowly on fine-tuning. For many of them it is all the other pieces and we saw from the questions in the previous session that you guys have questions about so many things. So we've brought in instructors on so many areas. As a result, change the name to focus on all of LLMs instead of just narrowly on fine tuning.

Many of you who follow us on Twitter probably also seen that the amount of energy and excitement that this has brought is far greater than I think any of us imagined possible.

So Jonathan Larkin share this Tweet, which we've seen many Tweets like this about this being the Woodstock for AI, or "join the revolution". This has really grown from a course and there were going to be 4 lessons about very specific topics to at the very least, a conference.

Maybe it's a little bit too much to call it a movement. But we have really grown to be something much more. The way that we structure everything going forward and the breath of talks that you'll see all reflects that.

Hamel: We have 3 kinds of sessions:

Fine tuning workshops (this is the first kind)
Conference talks
Office hours

The first kind, the fine tuning workshop that's the original course. All the other events are like everything else.

How To Be Successful With This Conference

(watch the video at 00:04:50)

Slide 3:

Tinker with the tools we show you -> Office Hours
The importance of blogging. Tips.
How to share your work (blogs, projects, etc)
- Axolotl - @winglian, @axolotl_ai, @hamelhusain
- Deepspeed/FSDP/Accelerate - @TheZachMueller
- Modal - @charles_irl

How to take this event and use it as a turning point, career-wise.

Hamel: So what I suggest is, make sure you take some time and tinker with the tools that we're showing you. We actually like not going too in-depth like tutorial about the tools themselves. We will show you how to use them. We'll point you to the documentation. We'll give you like a guideline of the workflow, and things like that. But what you should do is, because most people don't learn like that. You have to look at it. Try it yourself.

Hamel: Then if you have questions, we reformatted the course so that we have office hours. For example today we're going to show you Axolotl and then we're using Accelerate on Axolotl. Zach and Wing will have office hours about that and we'll be in the office hours too. The idea is that everyone will come to those office hours with a lot more context after having used those tools. That's really important. Try to do that before the office hours because the office hours is like an opportunity that you don't want to miss.

(watch the video at 00:06:24)

Hamel: Next, what I recommend in what I've experienced in other courses like fast.ai, which I think a lot of people have taken here is, the importance of blogging. What I suggest is, if you learn something here today or anytime, it doesn't matter what it is, Axolotl, Accelerate, any kind of tools, anything you learn, it's actually really helpful if you write it down, blog about it, and share it with other people, especially the people that run these projects. For example, if you're going to do something really fun with Axolotl, you should share it with Wing.

It's not really about us. We want to make you successful. If I see somebody out there writing a really good blog, I'll definitely share it as well. This changed my career. I started out like this.

How do you get started of blogging? If you click on that "Tips" link, there's this blog post by Rachel Thomas (fast.ai co-instructor at that time), "Why you (yes you) should blog". It's a really helpful guide to people who are starting out with blogging or kind of nervous about blogging.

Plan For Today

(watch the video at 00:08:15)

Slide 4:

What is Axolotl and how to use it to fine-tune model
- Honeycomb example
- Convo with Wing Lian
Parallelism & HF Accelerate w/ Zach Mueller
Fine-tuning on Modal
Q&A

We're going to switch gear and talk about fine-tuning.

Plan for today, we're going to talk about Axolotl, how to use it broadly, and then go into the Honeycomb example that we introduced last time, and we'll do just a quick catch up there. For those of you didn't see last time the Honeycomb example, Hamel will walkthrough that.

We will have some time to get a conversation both our questions and your questions with Wing, and then we will have some time for Zach to share about parallelism and HuggingFace Accelerate.

Then very quick run through of fine-tuning on Modal and we'll have a little bit of time at the end of this for Q&A.

Modeling Choices

(watch the video at 00:09:06)

Slide 5:

Base model
LoRA vs Full Fine Tune

The most frequent question that I get from people when they first starting to fine tune is, they're really related to model capacity, which is how much are we going to be able to learn.

The 2 parts of that are, what model should I fine tune often? and then the question which is simultaneously more technical and has an easier answer, which is should I use LoRA? or should I do a full fine tune?

It's useful to understand LoRA because you're going to use it a lot. You should almost always in my opinion be using LoRA rather than full fine tune.

Choosing a Base Model

(watch the video at 00:09:54)

Slide 6:

What base model do you use?

There are 2 dimensions to this.

One is what model size do I use? 7 billion or 13 or 70 billion or some other size parameter model.

The second is, what model family do I use? Do I use LLaMA 2, LLama 3, Mistral, Zephyr, Gemma, whatever else.

Model size

On the model size, I think different people will have different experiences. I never fine-tuned a 70 billion parameter model. It's not that we can't. It's actually thanks to Axolotl and Accelerate, it's not so difficult. But I've fine tuned 7 billion and 13 billion parameter models. I think most of the use cases I have, the breadth of what we are asking the model to do is not so wide. My experience has been that fine tuning a 7 billion parameter model versus 13. Actually, the 7 billion parameter model the output quality of these for the projects I've worked on has been close enough that I never felt the need to deal with the parallelism of required for much larger models. So I typically ended up using just 7 billion parameter models. Those are a little bit faster. It's a little bit easier to get a GPU that those run on.

If you look at the download counts, this is not a perfect proxy for what others are doing, but it's some proxy for what others are doing, and you do see that 7 billion parameter models are the most popular. (These are not instruction tuned models.) So these are models that people are typically fine tuning of. You see that the 7 billion parameter model is the most popular.

Model family

This is one where again, thanks to the way that it's been abstracted from Axolotl. It is extremely easy to try different models, especially if they all fit on the same GPU or even if you have to boot up a new instance. That's also not so so hard. But it's extremely easy to try different models and just do a vibes check.

I tend to just do whatever is fashionable. So recent released model. Recently released model is Llama 3. If I were starting with something today, I would just use Llama 3, not because I've thought about it in incredible depth but rather because it's the new newly released model that's widely known to be reasonably good. If you want to find out what's fashionable, there are many places to find that out. You could go to HuggingFace and then for models, there's a way to sort it by hotness and just see what's hot.

The "/r/LocalLLaMA" subreddit is a community of people who think about these things a lot and that's a good place to look at for running models, though it has local in the name. They spend a lot of time just thinking about different models and how they behave differently. So "/r/LocalLLaMA" is another community to look up if you want to choose a model.

I think people over index on this and that if you run a couple of models that are just the most popular models at the time, that should be good enough. You won't probably improve on that immensely by trying many more models. I'll talk in a couple of slides about why that is.

LoRA In A Nutshell

(watch the video at 00:14:00)

Slide 8:

The second problem, LoRA vs full fine tune, let me start with an image.

Imagine that we've got one layer, it goes from an input to the output.

(I'm going to for a second actually simplify the transformer architecture. So we don't think about a query, matrix, keys, and values. Imagine this almost is just like for the moment a feed forward network.)

So you've got one layer that we're going to look at and it's taking in input that is an embedding of the meaning of the text up to that point in the string. It's going to output another vector representation.

Slide 9:

In most of these models, the inputs and outputs are somewhere on the order of 4,000 dimensions. Just for that one layer, you'd have 4,000 dimensional input 4,000 dimensional output, so that matrix would be 4,000 by 4,000. That would be 16 million weights.

The idea behind LoRA is that we learn something that you can add to that original matrix that is much lower dimensional and we'll still change the behavior in a similar way but have many fewer weights. As a result it can be fine-tuned on less GPU with less RAM.

I think it's safe to say that the vast majority of fine tuning that happens is either LoRA (or I'll talk about QLoRA, which is going to work functionally in a similar way.) But the vast majority that happens is LoRA.

I think for everyone in this course you should use LoRA for a while and may be someday you'll do a full fine tune. As a practitioner, may never need full fine tune. There are theoretical reasons that full fine tunes, if you have a lot of data could be higher performance. Zach, Wing or Hamel can contradict me here. I think for most people LoRA is all you need.

Slide 10:

How LoRA works

(watch the video at 00:16:26)

We want to make some changes to a 4,000 by 4,000 matrix, which is the original weights. We do that by having 2 matrices that we're going to multiply together. Those of you who remember your linear algebra will know that if you have a 4,000 by 16 matrix times a 16 by 4,000 matrix, that is 4,000 by 4,000. If we multiply these 2 pieces together, that is going to create a new matrix that we can add to the original weights. It can change the original weights quite a bit.

Slide 11:

The number of parameters that are required here, each of these is this one is 4,000 by 16 and this one is 16 by 4,000. How many parameters is that?

Each of these two matrices is 16 by 4,000. As the number of parameters, you have two of those. So now we have a 128,000 weights that we're going to need to fit when we're doing fine-tuning.

That's a lot less than 16 million as a result. It just requires a lot less RAM. GPU VRAM is frequently a binding constraint as we train our models. As a result, it's nice to be able to reduce the RAM usage by using LoRA. You'll see there's just a configuration flag. It's quite easy to do this in Axolotl.

QLoRA

(watch the video at 00:18:12)

Slide 12:

LoRA at lower precision
Memory savings with possible loss in quality

The other piece I think conceptually also potentially somewhat complex to understand well, but extremely easy to use is, going from LoRA to QLoRA.

So here we had each of these matrices, and those are just numbers. Each element in those numbers are stored in computers with a number of bits. If you store it with many, many bits, then you get very fine variations of what that number can be. So you can go 2 to 2.001, 2.002 and so on. We tend to think of those almost as being continuous.

QLoRA is dividing the possible values for numbers into a smaller set of values. For instance, if you start with something that is stored in 16 bits, you can think that is almost continuous.

If the lowest value that you want to go to store is -2 and the highest is (just to pick a number) 2.4. You've got lots of numbers in between there. QLoRA will divide that space so that it can be stored in 4 bits. The number of possible values there is 2 to the 4. So it's 16 values.

The exact way that we choose the 16 values is a technical topic that I think isn't worth going into in this moment. There's some details about how you do back propagation there that we don't really need to know in practice.

By storing every number in 4 bits, you cut down on the memory usage by quite a bit. A lot of people do this. You'll see again that this is not so complex to do.

In practice, it saves some RAM and it has some small impact on results. But I think my intuition would have been that has a bigger impact on results than I've actually observed it having. I think most people would agree with that. So a lot of people run QLoRA models or train with QLoRA either as their default first step or at the very least, it's something that they do frequently. We'll show you how to do that, and it's shockingly easy.

(watch the video at 00:20:42)

Hamel: Maybe it's a good time to just pause for a second. Wing, do you have any opinions on QLoRA, LoRA? When do you use them? Any observations, feelings? Do you agree?

Wing: I know that, sometimes people see a difference between the actual losses of some of the evaluations that you get during fine tuning with QLoRA because what's happening is you've quantized the weights and then you're training on those. But then when you merge those LoRA back into sort of the original model, because the quantization, there's like quantization errors or due to quantization, that you're not actually getting the exact same model that you train. So there has been some like debate over that I don't feel like that's a huge issue. Otherwise people would not be using it anymore. So that's really the only thing that I have about that.

Wing: I think there was also something that I personally didn't understand with QLoRA with the quantization. I think there were like double quantization and there's some like nuances like that as well when you're quantizing the weights. Maybe Dan understands that better than me.

Dan: I think I don't. So at Workshop 4 we're gonna have Travis Addair, who is the CTO of Predibase. [Twitter] He built LoRAX (LoRA Exchange), which is a serving framework and talked about some of the quantization errors as you merge the weights back. He has thought about this like way more deeply than I have. I don't know much more about it than that.

(watch the video at 00:22:51)

Slide 13:

Meme:

Fiddle with hyperparameters
      vs.
Improve your data

There's so many places in AI and before that ML where it's tempting to get really detailed about all sorts of things that seem very mathematical. The payoff to doing that, even though most of us we're good at math from an early age and we're told. I used to do a lot of math. The thing with hyperparameters while sounding cool has a much lower payoff than spending that time looking at your data.

Improving your data. You might think like my data is what it is, how can I improve it? What Hamel shows about his work with Honeycomb, you'll see you actually can improve your data. The payoff to improving your data is so large. I think Hamel made a comment about (many of you might know who Teknium (Twitter) is), but I don't know if Hamel you wanted to jump in here. Anyway, improving your data the payoffs are massive. You should do more of that.

What Is Axolotl

(watch the video at 00:24:03)

Slide 14:

Wrapper for Hugging Face tools
Easy to use. So you can focus on your data
Best practices built-in

We're going to switch from the abstract like, "hey, here's some ideas" to how do we implement this.

Axolotl is as a wrapper for lower level Hugging Face libraries.

One of the things that I most loved about this switch from Hugging Face lower level libraries that give you a lot of granular control to using Axolotl is that Axolotl was so easy to use that I never thought about like, "Oh, what's the error in my code?". I just spent actually less time looking at code. I spent more time just psychologically looking at my data. So the ease of changing some things around and being able to run things freed up some mental space for me to focus on my data, which, we said is a great thing to do.

If you just use the examples and I'll show you some of the examples. There are a lot of best practices and default values that are built-in. It does a lot of smart things as defaults.

There are a couple of things that I quite like what it does that we don't have time to cover is, I'm going to make a couple of videos and then just post them either in the Discord or on the Maven platform, or both showing things like sample packing, which is a quite clever thing that it does that speeds up your training process. It has a lot of things that you could spend a lot of time figuring out for yourself or you could just use some of these examples in Axolotl and change relatively few things. A lot of best practices built-in by default.

(watch the video at 00:26:05)

Hamel: One thing I want to maybe it's worth lingering on for a second is, I'll let Wing tell the story. Have have you been surprised by what kind of people are able to fine tune models like really competitive ones without knowing any deep mathematics or things like that?

Wing: If you think about actually the most popular model, I think generally you know with Teknium's Hermes models and those sorts of ones like they're generally very popular. And if you actually talk to Ryan he's also very much like me, where he doesn't quite get deep into like Transformers and the math and all that and just wants to train models and focus on good data. Really, all of his models are really good.

Wing: They're people like I think Miguel Sarah. I forget which models he has that he releases. I think his background is more deep learning but he also uses Axolotl. They don't really need to go deep into the Transformers. So Dan was saying, they just able to spend more time focusing on just procuring good data and doing data synthesis rather than thinking about all of the everything else that goes on underneath the hood.

Using Axolotl

(watch the video at 00:27:56)

Let's get one level more practical or concrete. So using Axolotl. Some people here have used it a bunch. We're gonna make the assumption that most of you have either use it very little or I think even more. When we did a survey at some point of some students, most of you have not used it at all.

So this is going to be really a how do you actively get started. I think you'll be surprised that it is not so difficult to run your first job. I highly recommend doing that. You'll feel different about yourself as someone in this space once you've run a couple of jobs and you feel like a practitioner now. Highly recommend using it.

Slide 16:

The way to get started is, if you go to the Axolotl GitHub repo, there is a separate documentation page, but just the README is fantastic and has most of what you'll need.

Slide 17:

I'm going to point out a couple of things that you should look for while you are in that README.

So the very first is examples. I mentioned earlier that there are a lot of examples. Axolotl takes yaml config files (the config files are reasonably long). Maybe Wing could do it, but I don't think there is anyone else who could open up them or have like a blinking cursor and then just type one out beginning to end and get it right.

You and almost everyone else will go to one of these examples. Copy it. (At the first time you should just run it and I'll show you how to do that). But then you're unlikely to change one or two parameters by the first one. You might change the dataset that you use. But you might change one or two parameters, rerun it. It will always be an experience of taking something that works and then changing it around a little bit rather than starting from scratch. So you're going to use these examples.

Slide 18:

To show you one of them, here's one. This is to fine tune a Mistral-7B model with QLoRA. The very top is showing you what is the model that I'm fine tuning off of. This is QLoRA. Here we are loading in 4-bit. We have a dataset. I'll show you that dataset in a moment. We're going to store the dataset after the prep phase in in some location. We're going to have some validation data.

Most of these you won't change that frequently.

sample_packing - I'll make a separate video about this.
lora_r - This is related to the size of those LoRA matrices where (that's that matrix that I was showing earlier)
lora_alpha - a scaling parameter

I wouldn't worry about some of these bottom ones. I think the ones that you probably want to focus on upfront would be actually it's not the easiest one to change. You could change something else just to get an experience of changing it.

When you really start working on your own use cases, the first one you'll change is the dataset.

Slide 19:

The format of the dataset. There's a lot of different formats. I think one of the things that's really nice about Axolotl. Out there in the wild data is stored in a variety of formats. If you tell Axolotl what formats it's stored in, you can use most if it's not all of the common formats.

So this is a format called Alpaca. Each row for each sample has an instruction to the model. Optionally some input you'll see in these. Most of those are empty. It has the output, which is what we want the model to learn to reproduce. Then it has some text which will go above these. So the text would be below as an instruction that describes a task. Then you'll have a question like, "who is the world's most famous painter?". Then here's the training output, which is what we're going to train on and try and have the model learn to replicate the behavior.

(watch the video at 00:32:16)

Hamel: Talk about the config files. When I start a project, I look at the examples too. I message Wing sometimes (Please don't message Wing) with questions like that. There is an Axolotl Discord channel. That's a good place to kind of trade configs. Starting with the known good config is a good idea. It's like, "hey I'm training this model that just came out. Does anyone have a config?". Usually either by searching that Discord or looking at the examples or something else, you can find a config. There's a lot of times in Hugging Face repos you can find, nowadays you can find Axolotl configs as well.

Hamel: Wing do you have any other tips on where to find configs or how people should go about it?

Wing: Depending on some model creators I know personally, I try and include sort of the model configs when I'm releasing models, either somewhere in the repo or in the README. Axolotl by default also stores in your README the Axolotl config. Sometime if you go through Hugging Face, there is a link where you can find models that are tagged that are trained by Axolotl depending or not whether they modify their README. You can sort of like get configs from there as well. But other than that, a lot of times, you'll see some examples in the Discord people have. I'm happy to also help, like with various things. It's generally pretty self explanatory most of the time I think.

Usually you're getting, you're taking like little bits from one config and maybe combining with another piece whether it's like FSDP or DeepSpeed or the LoRA versus QLoRA. Most of the various configurations are pretty composable with each other. If they're not, I believe we do enough validation that it will tell you that it's not composable.

(watch the video at 00:34:48)

Slide 20:

Dan: Then there are a lot of other parameters. I won't go through most of these. Most of them you won't change but I will say a couple of things. One is, many of us like using wandb (weights and biases). It's a very nice wandb integration in Axolotl. micro_batch_size is basically batch size per GPU. Yeah. Highly recommend starting with any of the example configs and then changing it. Just small pieces don't get overwhelmed by all the things that you aren't changing.

(watch the video at 00:35:40)

Slide 21:

Then once you have your config, the next step is to run it. This Github README is so useful. So after you've got your example, click on the "Quickstart" section.

Slide 22:

That will bring you to a set of depending how we count either 3 or 4 commands. There are 3 steps:

One is preprocessing your data
The second is this training step
After that, you're going to want to test out the model that you've trained. (There is a CLI tool to do that. That's this third step. Hamel will actually show another way to do this.)
The thing that I like to do is also, if you run this bottom version instead of the third that launches a very lightweight Gradio app so that you can in the Web type something into a form and that gets sent to the model and inference happens and then at the output is shown. I quite like using this bottom step.

Hamel: I think it's worth mentioning you only want to do this to kind of like spotcheck your model. This is not for production. No one doing inference in production with this.

Dan: Yes, we'll cover inference in production in the deployment workshop.

You will not remember these commands. The thing that I hope you remember is that everything you want is in the GitHub repo.

(watch the video at 00:37:31)

Slide 23 (a video demo showing linux command-line connected remotely to Runpod)

So what does it look like if you run that?

This is a very quick view of what happens when you train the model. Here I am typing out that first preprocess command. I use the debug flag. When you do that, there's some output here in a moment. I'm going to go into that in more depth. After that, I run the next command that was shown on that last screen. This is just doing training. That kicks off training. Training depending on the amount of data you have can take minutes, hours. I suppose sometimes days, though. The projects I do. (Actually, I do have one project where it can take days.) It's typically an hour or so and sometimes much less.

Slide 24:

In there there was a section that it printed out from the preprocessing step with the debug flag that it would be easier to overlook, but I think is really critical for your understanding of what is happening here.

Though we started with data that had multiple fields, your model is going to train on a string (or I'll show you in a moment). It's actually a string in one other piece, but it's gonna train on a string. So this is showing you the template or what does that string look like. We create in the preprocessing step and then we later use for modeling.

(watch the video at 00:37:31)

Slide 25 (a screenshot of HF alpaca_2k_test dataset):

There's an instruction and input and output. Actually, those are for each sample. Here's the instruction. Here's the output. Here's the text.

When you use this for inference, you're going to want to provide everything up through this response part. But then not the outputs. You wouldn't know the output when you use this for inference, but this template is showing you what the string looks like, and then we're going to use that auto complete type logic, so that we provide everything before the output and our model will provide the output. This looks like it's just a string.

(watch the video at 00:40:19)

Slide 27:

There is one other piece that I think is important for your understanding of fine-tuning that is shown here. It's actually a string and a mask.

When you calculate your loss function (for those of you who are familiar with deep learning which is part of figuring out how do we change our parameters to change the model's behavior) we don't want to train the model to write the words below as an instruction that describes a task. The input here is a proxy for what the users of your apps input will be. We don't want to train the model to be the user. We want it to instead be good at responding to user inputs. So these pieces upfront are not going to inform the loss. So when we look at the output, we can look at it on a token by token basis.

So somewhere in there there was some input. There were the words that appropriately completes the request with a period. Each of these are tokens. Before that we have pairs of, over there is token id 2899. But because we don't want it to feed into the loss, we have the first piece of this tuple here is -100, which is just a way of preventing it from influencing the loss and thus influencing the behavior of our model.

If you look at the output (that's in green here), and for those we have the token id. Then we also have for the purpose of calculating a loss with token is this and it's the same. There is a flag which I think is called train_on_inputs that will let you change this behavior. But, broadly speaking, this is just showing that there's a way of being able to see very clearly what are the tokens that are influencing (that are the inputs to the model) and what are the tokens that are influencing loss and that are eventually going to be the outputs of the model or that we're training the model to output.

(watch the video at 00:42:31)

Hamel: Wing do you use that debug thing?

Wing: Yeah, all the time, mostly because I want to be sure that the tokenization is correct. Because a lot of times I'm using ChatML and because it's not a default token, I just want to make sure I didn't mess anything up and sort of setting those special tokens for ChatML and just to double check that the outputs look right.

Hamel: Just so people know ChatML is a specific type of prompt template. So if you go back to the previous slide that Dan had, you know that this I believe is an Alpaca template. This is Alpaca. So this is a specific type template and ChatML is different.

Dan: In general chat templates tend to be there's a side complexity or nuance to them than instruction tuning templates, I think arguably are a little simpler.

Wing: Then sort of like checking like the end tokens making sure that sort of the stop tokens are in in there correctly. Because it's sometimes if it's not in there, you can get a model that just starts to ramble on and on and never stop. So it's just a good spotcheck for myself and especially multi-turn conversations, just to make sure that it's mexing out the the responses correctly. You could sort of see that because it'll go like red, green, red, green, red, green. So yeah it's just the easy spotcheck and having the color just makes it easy to glance at it because that is actually really hard on the eyes to try and debug.

(watch the video at 00:44:20)

Slide 28:

Let me show this last step. We've done training. There was one more command. I'm going to show the Gradio version of it.

The last step was to kick off the app and around this Accelerate launch have the inference command, pass in the yaml file, the directory with the LoRA, and then this Gradio flag. This kicks off an app. You can click on that link, open something in the browser, and you can type in, test things in the browser.

Before other things get on your to do list, run through this so that you have hands-on experience using Axolotl.

Case Study

(watch the video at 00:45:30)

Let me hand it off to Hamel to go through a case study. It's the Honeycomb case study.

Honeycomb - NL to Query

(watch the video at 00:46:03)

Slide 30 (one of the same slide from workshop 1)

There's a through example in the fine-tuning workshops. That's this use case of Honeycomb. We discussed it in the first workshop. Because we have so many students, I'm going to go over it really quickly again.

(I'm not going to note down anything for these parts. Watch the video or refer to workshop 1 notes)

Honeycomb Case Study

(watch the video at 00:47:34)

Slide 31:

Let's jump right into the case study. For the case study, I'm just going to walking through some slides.

Let me open this GitHub repo: https://github.com/parlance-labs/ftcourse

You don't have to open it right now. Actually just follow along with what I'm doing. It's a repo that looks like this. I'm going to go through the notebooks. They are numbered 1 through 8.

I'm going to go through some steps. These steps are not necessarily linear, but it'll give you a good idea. I'm going to be focusing a lot on what we did with Honeycomb to fine tune a model.

A lot of the steps are going to be around dataset curation, data filtering, debugging, and evaluation. We're not as Dan mentioned, we're not really focused on the model so much.

Basically, I just want to go through the prompt, real quick. So this is the Honeycomb prompt.

Go through Jupyter notebooks in ftcourse repo in the following order:

The Prompt
- Prompt structure:
  - The original Honeycomb prompt is a system prompt (not a user prompt)
  - COLUMNS - The schema.
  - QUERY SPEC - Query specification which is like bit of a programming, a very terse programming guide to the Honeycomb query language.
  - TIPS - Additional instructions.
  - NLQ: ... - NLQ: Error count, NLQ: Slow requests, etc. these are few shot examples of user query and then Honeycomb query.
  - NLQ:{{question}} \n EXISTING QUERY - This is a completion model. So when Honeycomb launches, they use the completion API so the Chat API. They're just completing this based on the user's questions just templated.
- All this stuff is fixed except for the columns and the question. That's a lot of boilerplate to be sending to LLM.
- It's hard to specify everything you want in this prompt, no matter how hard you try you hit the wall. That's where fine tuning kind of move the needle.
- Honeycomb launched this product. (They Launched with just the prompt) There's a link to the blog post.
  - We built some minimal evals first. (One of the things you should think about is writing evals.)
Write Minimal Evals
- What is evals? I have this blog post about evals. I won't go through it in too much detail.
- Level 1: Assertions – these are my unit tests.
  - Assertions are not just for tests - These assertions that you can use them in different places. You not only want to use it for test. You also want to use these evals to filter out bad data for fine tuning and curating data and in inference time to automatically "heal" data. (you can do self "healing")
- My testing logic
  - I had to iterate on this for a while until I caught all edge cases. But this was probably the most critical and impactful work!
  - Many people skip this. You won't and you'll have a massive advantage.
- YOUR JOB IS TO CLEAN AND LOOK AT DATA
Generating Synthetic Data
- One thing that you will often have to do when you're fine-tuning is like acquire data. A lot of times like you don't have the data in an applied use case. Honeycomb launched this to production. Then not only did not have lots of data, a lot of that data was private. I can't see that data. Honeycomb gave me a thousand examples and I wanted to set aside a fair amount of those examples in the eval set so I could test the model. I wasn't left with much. What do I do from here?
- It's good to know how to generate synthetic data. There's no hard and fast rule again, about how many examples you need. I just generate as many examples as I feasibly can based on intuition, how much it costs, how much time it takes. I end up generating 30,000 examples synthetically. But I kind of went overboard. You don't have to do that. Just use your intuition based on your budget and what you have.
- You can do this with prompting. Let me give you a concrete example because if I just say, "Hey, you can use the LLM synthetically generate data", you're like "How?".
- My prompt - Let me show you what we did for Honeycomb. The prompt is basically the same exact prompt that you've seen before except there's a second part that says:
  
  "Your goal is to generate correct variations of the combination of NLQ, candidate columns and query to build syntetic dataset that is a valid representation of the Honeycomb Query Language. You can build synthetic data by re-wording the query and/or substituting a column name in both the query and candidate column lists. Your response should be in json with the following three keys: “nlq”, “cols”, and “query”."
  
  ... so on and so forth.
  - NLQ: ... \n COLUMNS: ... \n QUERY: - Giving it the inputs now.
  - Then saying basically perform data augmentation. So substitute, rewrite the natural language query, substitute the columns, and substitute the query.
- You might be wondering is that good data? Is it duplicated? Yes, you have to clean it up.
- For example you want to use those level 1 assertions, is your first line of defense. A lot of the stuff doesn't come out of this is going to be junk, maybe, or some amount of it, you want to get rid of it. So the level 1 assertions is already going to help you. It's going to help you throughout this whole thing.
- So you have a way of getting lots of data. I'm not going to show you the code of doing that. It's fairly straightforward. Use your favorite large model to do this. Use the most powerful model you feel comfortable with to help you generate the synthetic data.

(watch the video at 00:58:00)

The Prepared data

The next step in this is preparing the data for Axolotl.
Usually what I do is I run all the way through and I see what's going wrong and then I come back and improve it. You don't want to try to make your data perfect the first time. You want to go all the way through, see some predictions, make sure the plumbing works, etc. Then you can come back and curate and filter the data. That's what I recommend, because you can get stuck. Is good to know where the problems are and have an idea.

You want to prepare your data to look like this.

{
    "conversations": [
        {
            "from": "system",
            "value": "Honeycomb is an observability platform that allows you to write queries to inspect trace data. You are an assistant that takes a natural language query (NLQ) and a list of valid columns and produce a Honeycomb query."
        },
        {
            "from": "human",
            "value": "\n\nNLQ: \"group by HTTP method\"\n\nColumns: ['query_string_num_tokens', 'query_string_length', 'data_queries', 'http.target', 'task.id', 'trace_root.http.target', 'topic', 'http.host', 'total_hits', 'db.user', 'domain_types', 'db.name', 'graphql.document', 'history', 'http.scheme', 'http.method', 'frontend.version', 'disposition_for_dBVVysC8x4Ymwg9rtjMckgw9', 'db.system', 'event_name', 'organization', 'auth.logout', 'organizations', 'name', 'net.transport', 'db.operation', 'disposition_for_UvsPPBVUn9FDuzDjsjYCqopq', 'disposition_for_1RUGSd7GdnP5tuKdgqBRZUm2', 'process.pid', 'disposition_for_6uyAoBc3PuvEcTTPFgPM3Rtk', 'exception.stacktrace', 'data_ingestion_individuals_count', 'disposition_for_qrnUBUz8YBfNX7Liekq6nKi3', 'task_type.type', 'disposition_for_JQDNbuUdaQcEbEwQNxUbV5EF', 'disposition_for_rAcWoXfbHw4eWoJFH4ZcY8ue', 'disposition_for_eShqQoC9jUi9VQBidpp2oXHP', 'parent_name', 'template', 'graphql.operation.name', 'span.num_links', 'disposition_for_kNSPtvsCWkDoEyFP2QE6VPmQ', 'disposition_for_UUqf9L1qkFxDNEvcgsVMA2yy', 'disposition_for_vwbbN76HZ7uitLubvkUjPFQE', 'disposition_for_aAto1pGrdF5RunpSX8sY5hvn', 'disposition_for_UbKCMdnkPQ6TuHrfdBo5juZu', 'disposition_for_QfrvmoHxSgLPJXPKZCrZfGo8', 'disposition_for_NoKSSruBRCX6UG28PzmkybUd', 'disposition_for_UZAqvZ5XVBZjKKWuMeRkRayS', 'organization_token', 'duration_ms', 'trace.parent_id', 'db.statement', 'exception.message', 'error', 'service.name', 'http.status_code', 'http.route']"
        },
        {
            "from": "gpt",
            "value": "\n{\"breakdowns\": [\"http.method\"], \"calculations\": [{\"op\": \"COUNT\"}], \"time_range\": 7200}"
        }
    ]
}

In this case, because I'm using the Alpaca-ShareGPT format.

Basically in Axolotl, there's this config, hc.yml

# [ ... truncated ...]

datasets:
  - path: sample_data/alpaca_synth_queries.jsonl
    type: sharegpt
    conversation: alpaca

# [ ... truncated ...]

Let me open the Axolotl docs so you can see that.

Axolotl Docs: Dataset Formats

I'm using a conversation format.

There is sharegpt. You can see you have to structure you data like this.

{"conversations": [{"from": "...", "value": "..."}]}

Axolotl expects your data in this format. It's also important because if you remember Dan talking about train on inputs, not training on inputs. This is considered an input.

The system role and the human question is considered inputs. The output is this "from": "gpt" (it is the Honeycomb query)

What we're doing is we're forcing the model to learn to get the right query and not trying to have it predict what the question is.

The config
- The thing you want to pay attention to here. (Dan already went over the config). In this case:
  - Change the dataset
  - Change train_on_inputs
- HF & WandB
  - You need to change the following things in your config: (because you won't be able to access my Weights & Biases account and Hugging Face account.)
    - wandb_project
    - wandb_entity
    - hub_model_id

Now what do you do? I don't ever jump straight into training because I'm dumb and I make a lot of mistakes in dataset preparation. Always do something wrong. Honestly I think a lot of people do something wrong here.

I like to look at the data. I like to double check how Axolotl is preparing the data.
The way I do that is I do this Axolotl preprocess command: python -m axolotl.cli.preprocess hc.yml
- That will basically flatten the data and assemble it in the right format.
I like to look at the data manually so I can kind of play with it a bit more, manipulate it, inspect things.

Basically what happens is when you preprocess the data, Axolotl dumps that data by default into this last_run_prepared directory. That is a Hugging Face dataset format. You can load that Hugging Face dataset format and inspect it. That's what I'm doing here with this code.

import json, yaml
from transformers import AutoTokenizer
from datasets import load_from_disk

with open('hc.yml', 'r') as f:
    cfg = yaml.safe_load(f)
model_id = cfg['base_model']
tok = AutoTokenizer.from_pretrained(model_id)
ds = load_from_disk('last_run_prepared/22cf9f5f00f9d3b9504fbaf9b68a2f75/')

print(tok.decode(ds['input_ids'][0])) - You can see it has sort of flattened than JSONL into a format that looks like this.

<s> Honeycomb is an observability platform that allows you to write queries to inspect trace data. You are an assistant that takes a natural language query (NLQ) and a list of valid columns and produce a Honeycomb query.

 ### Instruction: 

NLQ: "group by HTTP method"

Columns: ['query_string_num_tokens', 'query_string_length', 'data_queries', 'http.target', 'task.id', 'trace_root.http.target', 'topic', 'http.host', 'total_hits', 'db.user', 'domain_types', 'db.name', 'graphql.document', 'history', 'http.scheme', 'http.method', 'frontend.version', 'disposition_for_dBVVysC8x4Ymwg9rtjMckgw9', 'db.system', 'event_name', 'organization', 'auth.logout', 'organizations', 'name', 'net.transport', 'db.operation', 'disposition_for_UvsPPBVUn9FDuzDjsjYCqopq', 'disposition_for_1RUGSd7GdnP5tuKdgqBRZUm2', 'process.pid', 'disposition_for_6uyAoBc3PuvEcTTPFgPM3Rtk', 'exception.stacktrace', 'data_ingestion_individuals_count', 'disposition_for_qrnUBUz8YBfNX7Liekq6nKi3', 'task_type.type', 'disposition_for_JQDNbuUdaQcEbEwQNxUbV5EF', 'disposition_for_rAcWoXfbHw4eWoJFH4ZcY8ue', 'disposition_for_eShqQoC9jUi9VQBidpp2oXHP', 'parent_name', 'template', 'graphql.operation.name', 'span.num_links', 'disposition_for_kNSPtvsCWkDoEyFP2QE6VPmQ', 'disposition_for_UUqf9L1qkFxDNEvcgsVMA2yy', 'disposition_for_vwbbN76HZ7uitLubvkUjPFQE', 'disposition_for_aAto1pGrdF5RunpSX8sY5hvn', 'disposition_for_UbKCMdnkPQ6TuHrfdBo5juZu', 'disposition_for_QfrvmoHxSgLPJXPKZCrZfGo8', 'disposition_for_NoKSSruBRCX6UG28PzmkybUd', 'disposition_for_UZAqvZ5XVBZjKKWuMeRkRayS', 'organization_token', 'duration_ms', 'trace.parent_id', 'db.statement', 'exception.message', 'error', 'service.name', 'http.status_code', 'http.route']

 ### Response: 
{"breakdowns": ["http.method"], "calculations": [{"op": "COUNT"}], "time_range": 7200}</s>

That is the Alpaca format.

What I recommend is, check multiple examples. Make sure it looks right. Make sure you didn't put the wrong thing in the wrong place or have things in there that you didn't intend in your data happens all the time.
One thing that I'll mention is there's these spaces right here ### Instruction:. You might be wondering what the hell is that? It's a little bit of a tricky issue. It's kind of some artifact about the way Axolotl assembles tokens. I don't know if Wing wants to say something about this yet, but I found it not to be an issue as long as you're consistent with inference time. I'll talk more about that and I have a blog post about that as well.

Verbose debugging - Dan already covered.
- Command: python -m axolotl.cli.preprocess hc.yml --debug
- This helps you check things like:
  - ignoring inputs (train_on_inputs:False) - notice the red color, which indicate tokens that are ignored.
  - token ids (example: what are those spaces right before ##?
  - The logs tell you what the special tokens are.
Look at special tokens
- Example: What is <0x0A>? - '\n'
- But where is the space coming from? - '###'. It's pretty confusing! See this blog post

(watch the video at 01:05:52)

Training
- The config file
  - It's also located here (huggingface)
- Train Command: accelerate launch -m axolotl.cli.train simple.yml. Zach is going to be talking about Accelerate. I don't want to go into that deep rabbit hole right now.
- W&B Experiments
  - I used Weights & Biases to track my various fine tuning experiments. You can view them here. You can log your runs and the results, look at training loss curves.
- Approach
  - Basically with training I tried different parameters:
    - I varied the learning rate. First of all, this is Mistral-7B. I went into the examples. I asked in the Discord what is the best config for Mistral. I started with that.
    - I varied the learning rate. I tried different learning rate schedulers. I actually tried different distributed schemes like using DeepSpeed Stage Zero, One, Two, Three just to test stuff. I mean not that it matters but actually this is a small model. It fit on my GPU just fine.
    - Another thing is there's Sample Packing that you might want to try to save GPU space or to increase throughput. (I will upload a video for that or talk about that in a little bit more detail later on.)
- Model Artifacts
  - When the training is done, it's uploaded into Hugging Face, which is here (parlance-labs/hc-mistral-alpaca).

(watch the video at 01:08:40)

Sanity Check: Local Inference
- There's a lot of different ways you can sanity check your model.
- I like to actually use code like Hugging Face Transformers to make this work.
- model_id='parlance-labs/hc-mistral-alpaca' # this will be different for you based upon hub_model_id - This is your model.
- Next, we have to construct a prompt template that is as close as possible to the prompt template we saw earlier.
- Another reason why sanity check things this way is, I want to make sure that I understand the template, that it works.
- The way I want to do is I want 2 inputs (nlq, cols), a natural language query and the columns.
- This is code to run the template:
```
def prompt_tok(nlq, cols, return_ids=False):
    _p = prompt(nlq, cols)
    input_ids = tokenizer(_p, return_tensors="pt", truncation=True).input_ids.cuda()
    out_ids = model.generate(input_ids=input_ids, max_new_tokens=5000, 
                          do_sample=False)
    ids = out_ids.detach().cpu().numpy()
    if return_ids: return out_ids
    return tokenizer.batch_decode(ids, 
                                  skip_special_tokens=True)[0][len(_p):]
```
- We sanity check that at least the plumbing works and some results look plausible.
- So the question is, is this any good? Yes, it passes these level one evals. You can track the different metrics of the level 1 evals, you can know which assertions are failing, what kind of errors are you getting the most. That's all good. Then beyond the level one assertions after you conquer those, are these good or bad?

(watch the video at 01:11:06)

Optimize Model

I launched this model onto Replicate for inference (we'll go through inference later) and I did more sanity checking.
Basically Phillip at Honeycomb did some sanity checking (a vibe check) and had feedback. (see the notebook "Initial Feeback" section)
Analysis
- Basically, Phillip said this model is OK but it's not great. It's still making some mistakes in some places. It turns out that the training data that Phillip gave me wasn't great either. This will happen all the time.
- You have to do some error analysis and figure out like if a result isn't great, why is that.
- Look at the data. Look at the training data. Try to debug this. In this case. I looked at similar queries in the training data and try to see what was happening. We found that actually the training data could be better. Things are passing the level 1 test just fine but they're not the greatest queries. They're syntactically correct.
What do we do now?
- You want to try to encode the knowledge of Phillip in his opinions into a model like can you have Phillip as an AI in this situation.
- I started building LLM as a "judge".

Example Critiques

Basically it's the same exact original prompt but with an instruction that you are going to be a query validator. ( "You are an EXPERT query evaluator that has advanced capabilities to judge ...")

Then there's a bunch of few shot examples here.

CRITIC_PROMPT="""## Background

Honeycomb is an observability platform that allows you to write queries to inspect trace data.
The specification of the Honeycomb query language is as follows:

QUERY SPEC:
[... truncated ...]

QUERY SPEC TIPS:

[... truncated ...]

---

## Instructions

You are an EXPERT query evaluator that has advanced capabilities to judge if a query good or not.  You understand the nuances of the Honeycomb query language, including what is likely to be most useful from an analytics perspective.
You are given the following three inputs: (1) NLQ, (2) A list of candidate columns (COLUMNS) that are allowed to be in the query, and (3) The query (QUERY). Your job is to evaluate and critique the QUERY relative to the provided NLQ and COLUMNS. 

The critiques must be provided in the same json format as provided in the examples below:

---

NLQ: show me slowest trace

COLUMNS: ['trace.trace_id', 'trace.span_id', 'trace.parent_id', 'duration_ms', 'name', 'faas.instance', 'faas.id', 'filter', 'telemetry.instrumentation_library', 'library.name', 'faas.name', 'span.kind', 'type', 'http.wrote_bytes', 'http.url', 'service.name', 'http.flavor', 'span.num_links', 'span.num_events', 'net.host.name', 'library.version', 'http.scheme', 'net.peer.name', 'http.method', 'meta.signal_type', 'cloud.region', 'cloud.provider', 'faas.version', 'http.read_bytes', 'http.user_agent', 'cloud.account.id', 'organization_id', 'cloud.platform', 'net.sock.peer.addr', 'page_size', 'net.sock.peer.port', 'page_token', 'status_code', 'http.client_ip', 'http.status_code', 'http.route']

QUERY: {"calculations":[{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"limit":1,"time_range":7200}

{"critique": "The response is nearly correct, as it is looking for the slowest trace by using MAX(duration_ms) and ordering by duration_ms in descending order, which is appropriate for finding the 'slowest' trace. Additionally, filtering with trace.parent_id does-not-exist correctly identifies root spans. However, the query should be grouping by trace.trace_id to actually show the slowest trace. Without that grouping, the query only shows the MAX(duration_ms) measurement over time, irrespective of which trace is responsible for that measurement.", "outcome": "bad"}

---

[... truncated ...]
"""

Human Labeling
- How did I get this? In this case I used a very uncool low technology technique. Phillip and I did a bunch of labeling and analysis in this (private) spreadsheet.
  
  I sent Phillip a spreadsheet every day for a few weeks and had him write critiques. Over time I align the model as much as possible with Phillip so that it was agreeing with him in the critiques it was writing. I kept tweaking the few shot examples and the instructions until we were both satisfied that this LLM as a judge was doing a good job.
- I talk about this a little bit more detail in the blog post "Level 2: Human & Model Eval". I just want to give you an idea of the general process in your mind and you know this is a tool in your toolbox.
Critic Prompt
```
def critic_prompt(nlq, cols, query):
    "Construct a cirtic prompt."
    return CRITIC_PROMPT + f"""
For the below NLQ, QUERY and COLUMNS provide a critique as JSON in the format {{"critique": "...", "outcome": "good"|"bad"}} as shown above.

NLQ: {nlq}

COLUMNS: {cols}

QUERY: {query}
"""
```
When you have the result of this, you get a bunch of critiques.

You can use those critiques to actually make the data better. And you can use the you can use the same LLM as a judge to filter and curate the data like filter out bad queries. Example, "Given a critique, can you make the query better". If it still can't make the query better, then you filter it out.

(watch the video at 01:15:33)

Curate Data - Make the data really good.
- Fix the bad data
  - Again using a large language model. It's like you're giving the model inputs and a critique. Then it output the improved query. ("Output the improved query and nothing else in a json format adhereing to the QUERY SPEC.")
- Filtering data
  - There's many different ways to filter the data. When we talk about dataset curation, there's a lot of things that you can do.
    - Use level one eval logic to filter invalid queries (those assertions and tests)
    - Use level two eval logic to filter queries (you can also try to heal)
    - Apply other kinds of filters - in this case filtered out queries that were (1) too simple or too complex (2) near duplicates.
    You'll see like different things in the dataset like, this part of the dataset are garbage or the model is making a certain kind of mistake.
    
    Then you have to decide whether or not you have to go acquire data for that mistake.
    
    One example of that is I noticed there was a lot of either low complexity queries like super simple queries or really high complexity queries with lots of operations, lots of filters that didn't make any sense. So I had some code that filtered those out.
```
def complexity(q):
    "calculate complexity score for query."
    l1_keys = len(q)
    l2_keys = 0
    l2_vals = 0
    
    for k in q:
        val = q[k]
        if isinstance(val, dict):
            l2_keys += len(val)
        elif isinstance(val, list):
            cnt = sum([len(l) if isinstance(l, dict) else 1 for l in val])
            if cnt == 0: return 0 # so we can filter out queries with empty values
            else: l2_vals += cnt
    return l1_keys + l2_keys + l2_vals      
```
  - Lilac
    - A popular tool for searching for, filtering, etc. duplicates is Lilac.
    - Another part of data curation is to get rid of duplicates.
      - You did a lot of data augmentation and things like that. You might have lots of data that looks very similar or too similar. That's not going to be good. What ends up happening is you're going to overweight on those examples.
  - A naive way to filter near duplicates
    - There's not as sophisticated things you can do. You should start with dumb things if you can.
    - We want to filter duplicates where the (nlq, cols),(nlq, query), or (cols, query) is the same.
      valid_synth_df = (valid_synth_df .drop_duplicates(subset=['nlq', 'col_set']) .drop_duplicates(subset=['col_set', 'str_query']) .drop_duplicates(subset=['nlq', 'str_query']) )
      In this case, you can drop any data where there's a pair of two that are duplicated within those three.
    - Another thing you can do is semantic deduplication.
      
      In Lilac, for example, you have fuzzy concept search so you can look at data try to maximize diversity, clean out things that are too much duplication.

That's an end-to-end overview of the idea. This is not a linear process.

Again this is a very simple example to give you a concrete use case, to give you the idea of the workflow.

Debugging Axolotl

(watch the video at 01:19:25)

Slide 32 (screenshot of Axolotl Docs, "How-to Guides - Debugging")

It's really important if you can use some software that you know how to debug it. I just want to call your attention to this these Docs.

Slide 33 (screenshot of Axolotl Docs, "How-to Guides - Debugging > General Tips")

General Tips

There's these guidelines here that I think are really important.

Questions For Wing

(watch the video at 01:21:09)

Let me ask Wing, is there anything else on your mind in terms of things like tips you might have for people using Axolotl that you like to highlight?

Wing: I don't have any off the top of my head. It usually comes with people asking questions.

Dan: How do you predict how long a fine-tuning job will take before you start it. Do you have any recommendations?

Wing: That one is relatively hard to answer. It depends on model size, LoRA, full fine-tune, the GPUs. the number of GPUs. If you're using DeepSpeed Zero, Two, Zero and you're having offload. There's so many factors that can affect the amount of time that it takes to fine-tune a model that's use. I think once you have a gauge on a specific dataset and certain parameters that you're going or hyperparameters that you're going to use for specific set of experiments, you could usually get a good gauge out from that. But I don't have a good like formula that works for everybody.

(watch the video at 01:23:02)

Wing: Someone had asked about, during a fine-tune and then doing what Hamel saying like improving the data and then whether or not you should start from scratch again? Or fine-tune over that fine-tune model? I think one of the things when you think about that is, if your model is already getting pretty close to being over fit, just fine tuning that again for multiple more epoch is just going to definitely overfit at that point. You should really consider cleaning up the original data and adding in the new improved data and then sort of starting from scratch again at that point on base model.

Hamel: Yes, I always start again from scratch when I improve my data.

Scaling Model Training With More Compute, How Do They Do It?

Presenter: Zach Mueller, Technical Lead for the Hugging Face's Accelerate project / Pytorch FSDP (Fully Sharded Data Parallel)

Presentation (slide deck): https://huggingface.co/spaces/muellerzr/llm-conf

(watch the video at 01:24:04)

Slide:

(These are small notes instead of detailed notes.)

Who Am I?

I handle a lot of the internals when it comes to the Hugging Face's Transformers trainer. I'm also a humongous API design geek!

Understanding GPU Usage

Before we start talking about how do they go about doing this sort of what we call distributed training, let's get a general understanding of model GPU usage.

Slide:

We can somewhat estimate the memory usage in vanilla full-fine-tuning of models.
Requires certain assumptions (that I'll be covering):
- Adam optimizer
- Batch size of 1

Slides:

General estimate (bert-base-cased, 108M params):

Each parameter is 4 bytes
Backward ~= 2x the model size
The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):

dtype	Model	Gradients	Backward pass	Optimizer step	Highest
float32	413.18 MB	413.18 MB	826.36 MB	1.61 GB	1.61 GB
float16	413.18 MB*	619.77 MB	826.36 MB	826.36 MB	826.36 MB

*All estimations were based off the Model Estimator Tool

This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).

But what happens as we scale?

Here's llama-3-8B (8.03B parameters)

dtype	Model	Gradients	Backward pass	Optimizer step	Highest
float32	28.21 GB	28.21 GB	56.43 GB	112.84 GB	112.84 GB
float16	28.21 GB*	42.32 GB	56.43 GB	56.43 GB	56.43 GB

Well, I don't have 56GB of GPU memory in a single card, let alone 112GB.

What can we do?

Distributed Training

(watch the video at 01:27:05)

This is where the concept of distributed training comes in or how do we make sure that we can use multiple GPUs to achieve what we want.

Kinds Of Training

Slide:

(watch the video at 01:27:14)

Slide:

Single GPU:
- No distributed techniques at play
Distributed Data Parallelism (DDP):
- A full copy of the model exists on each device, but data is chunked between each GPU
Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS):
- Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs

FSDP

Essentially we could split chunks of the model and optimizer states across multiple GPUs. What that allows is rather than having the limit of DDP where we're stuck with, say, 2x 4090 GPUs at 24B. That's all I can use. In memory, it acts as a single 48GB GPU when we think about the total RAM that we can play with to train models. That's the secret to how you can train these larger and larger models.

Fully Sharded Data Parallelism

(watch the video at 01:28:29)

Slide:

The general idea here is, you take your model, and we're going to create what's called shards of the model. That's say, taking the model, we could imagine a shard being. It split perfectly in half the first half of the model and the second half of the model.

Occasionally PyTorch needs to know what's happening with that other model chunk, because it's all the same model. We need to get the gradients all aligned. So these calls are called communications. Generally you want less of these because it's essentially time spent on your GPUs just talking to each other and trading information. You're not training anything. You're not processing data. It is quite literally just your 2 GPUs trading notes on how they think the model should be and then correcting themselves.

FSDP: Getting Parameter Specific

(watch the video at 01:29:31)

Slide:

Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs
These include how model weights are sharded, gradients, and more.
I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B without PEFT on 2x4090's

Now, I'm not going to really go too much in-depth into every single thing FSDP can do.

In my opinion, the most important ones when it comes to training in low resource areas and when you're using FSDP, how you dictate how those weights, gradients, and parameters get sharded.

On top of that, I'm going to cover some of the important ones I needed when doing a full-fine-tune of Llama-3 8B without PEFT on 2x4090's. (spoiler alert: it was very slow)

`sharding_strategy`

(watch the video at 01:30:07)

Slide:

Dictates the level of divving resources to perform
- FULL_SHARD: Includes optimizer states, gradients, and parameters
- SHARD_GRAD_OP: Includes optimizer states and gradients
- NO_SHARD: Normal DDP
- HYBRID_SHARD: Includes optimizer states, gradients, and parameters but each node has the full model

The general idea is this is us telling FSDP how we want to split all of these different things that take up VRAM.

FULL_SHARD or SHARD_GRAD_OP - This reduces some of the memory overhead, because we still need more than the original model, because we're still fitting the entire model in VRAM. But it reduces that training VRAM a little bit for us.

HYBRID_SHARD - A new thing that PyTorch has come out with. It's kind of like FULL_SHARD where we're fully sharding absolutely everything including the optimizer states, gradients, and parameters. However, if you're training on multi-node (multiple computers are training a big model at once). It keeps a copy of the entire model on each of those nodes. That's important because remember how I said communications slow down things. A lot hybrid shard lets us reduce the communications from 3 down to 2 if not 1. Your training speed is increased honestly to some extent exponentially, depending on how long it takes for your computers to talk to each other.

`auto_wrap_policy`

(watch the video at 01:31:52)

Slide:

How the model should be split
Can be either TRANSFORMER_BASED_WRAP or SIZE_BASED_WRAP
TRANSFORMER/fsdp_transformers_layer_cls_to_wrap:
- Need to declare the layer
- Generally transformers has good defaults
SIZE/fsdp_min_num_param:
- Number of total parameters in a shard

So the next part is, we know how we're going to split the memory. But how do we split the model because we need some way to tell FSDP.

I have this model. How do I want to split it in between my GPUs.

With Accelerate, with Axolotl, with Transformers, we use 2 different nomenclatures: TRANSFORMER_BASED_WRAP or SIZE_BASED_WRAP.

SIZE_BASED_WRAP version is more manual. Basically you're telling FSDP. After X amount of parameters, go ahead and split the model. That's great because it works out of the box. That's bad because there could be speed increases that you might be missing by having say, each head of a Mistral model on a separate GPU so that way it can handle its own computations much faster than needing to wait to communicate with other GPUs.

`offload_params`

(watch the video at 01:33:00)

Slide:

Offloads the parameters and gradients to the CPU if they can't fit into memory
Allows you to train much larger models locally, but will be much slower

Case: FFT of Llama-3-8B with fsdp_offload_params on 2x 4090 GPUs was 72hrs, vs ~an hour or two when using 1x H100.

What this says is, "I have 48GB of VRAM right now, if I'm assuming 2x 4090 GPUs. I can't fit that, I can't train on it. Well, I'm going to accept that. I still want to do it. I don't want to go by through a Cloud provider."

FSDP will let us offload gradients and model parameters into RAM.

Now, that sounds like that's going to be extremely slow because we're taking things from the GPU to the CPU and now shoving into RAM.

So case in point. When I was doing a full-fine-tune of Llama-3 8 billion to match a paper that came out. I wound up needing to use offload parameters because as we saw earlier, 8 billion requires about 50GB VRAM or so. I only have 48GB. It was going to take 72 hours to do 4 iterations through my data versus an hour or 2 on an H100 GPU.

Yes, it's cool that you know how to use these tools and it can help you train things locally. Make sure to double check though:

What your time constraint is
What your budget is

because I can run it for free and it can take longer or I can pay $5 and go finish it in an hour. Depending on how much time you have available, each solution has different opportunities.

`cpu_ram_efficient_loading` AND `sync_module_states`

(watch the video at 01:34:43)

Uses the idea behind big model inference/the meta device to load in the model to the GPU in a low-ram scenario
Rather than needing model_size * n_gpus RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via sync_module_states

Another kind of critical part in my opinion when it comes to doing FSDP that accelerating Transformers has is this idea of CPU RAM efficient loading and also the idea of sync module states.

Basically PyTorch lets us use this thing called device=meta and that essentially is the skeleton of your model -- the weights aren't loaded, it can't really do computations too well. [... truncated ...]

This really helps keep your RAM size low and you don't suddenly sit there with crashes because "oh no you ran out of CPU memory". Because fun fact, you will red line this quite often. I found at least in this particular scenario.

Tying This To Hugging Face Accelerate

(watch the video at 01:35:59)

Let's take it back and just focus on Accelerate.

Slide:

So far we've covered the theory, but how do we put it into practice
By using a library that's at the heart of the entire open-source ecosystem
- Nearly all of 🤗
- axolotl
- fastai
- FastChat
- lucidrains
- kornia

Are you using it and you don't even know?

Accelerate which you might not know is the foundation of a lot of your favorite libraries.

What Is Hugging Face Accelerate

(watch the video at 01:36:30)

Slide:

The general idea with Accelerate is it's essentially 3 frameworks:

Command line interface - Hamel and Wing already showed us whenever they were doing accelerate launch.
Training library - Under the hood what it is doing all of this distributed training fairly easily.
Big model inference

A CLI Interface

(watch the video at 01:37:03)

Slide:

accelerate config
- Configure the environment
accelerate estimate-memory
- How to guess vRAM requirements
accelerate launch
- How to run your script

You need about 3 commands to really get everything going. [... truncated ...]

Launching Distributed Training Is Hard

(watch the video at 01:37:42)

Slide:

python script.py

torchrun --nnodes=1 --nproc_per_node=2 script.py

deepspeed --num_gpus=2 script.py

How can we make this better?

Launching a distributed training sucks.

There's a lot of different ways you can do it. There's a lot of different commands you can run. Some of it is PyTorch, some of it is DeepSpeed, and all of them have slightly different commands. [... truncated ...]

That's a lot of different commands that you have to know and remember!

`accelerate launch`

(watch the video at 01:38:24)

Slide:

accelerate launch script.py

accelerate launch is here to say, "OK, tell me what you're doing and I'll make sure that we're running it."

`accelerate config`

(watch the video at 01:38:31)

Slide:

Rely on config.yaml files
Choose to either running accelerate config or write your own:

# ddp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

# fsdp_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8

These essentially define how we want certain things to run. [... truncated ...]

That's all you need to do from a launching perspective. If you're using Axolotl or Transformers, this is all you need to do.

The next part I'm going to show is the internals a bit on the low level of how Accelerate works and how you can use Accelerate specifically. But do remember this isn't necessarily needed if you're using things like Axolotl or Transformers.

A Training Library: The Code

(watch the video at 01:39:33)

Slide:

# For alignment purposes
for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()
    scheduler.step()

from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
    accelerator.prepare(
        dataloader, model, optimizer, scheduler
    )
)
for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    # inputs = inputs.to(device)
    # targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss) # loss.backward()
    optimizer.step()
    scheduler.step()

The general idea with Accelerate is, we want a low level way to make sure that this can essentially be device agnostic and compute agnostic. [... truncated ...]

A Training Library: How Scaling Works

(watch the video at 01:40:15)

Slide:

Accelerate's DataLoaders and schedulers work off of a sharding mindset
Rather than repeating the same data across n nodes, we instead split it
Speeds up training linearly
Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2
This also means the scheduler will be stepped n GPUs at a time per "global step"

It winds up working similar to FSDP.

Accelerate will do the data sharding for you, taking in your data and splitting it across GPUs. [... truncated ...]

So what winds up happening is this, it lets us successfully scale our training that should have roughly the same results when training on a single GPU versus training on multiple GPUs without needing to worry about:

do I need to step my scheduler more?
do I need to adjust my learning rate more?
do I need to do this? Do I need to do that?
is the same amount of data being processed at one time?

everything else is done for you.

A Training Library: Mixed Precision

(watch the video at 01:41:22)

Slide:

This may be a bit different than your "normal" idea of mixed precision.
We do not convert the model weights to BF16/FP16
Instead we wrap the forward pass with autocast to convert the gradients automatically
This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on.
If you use .bf16() weights, you are STUCK in bf16 perminantly

The next part of this I want to talk about some very specific tweaks that we do to protect you from dumb decisions. The first part is mixed precision. [... truncated ...]

We do not convert the model weight to BF16/FP16. [... truncated ...]

This is very important:

If you use .bf16() weights, you are STUCK in bf16 permanently.

If you go to bf16, you are stuck in bf16. There was a whole issue a few months ago with Transformers where some quality of some fine-tune models weren't doing well. This was the cause.

(watch the video at 01:42:17)

Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine

Optimization Level	Computation (GEMM)	Comm	Weight	Master Weight	Weight Gradient	Optimizer States
FP16 AMP	FP16	FP32	FP32	N/A	FP32	FP32+FP32
Nvidia TE	FP8	FP32	FP32	N/A	FP32	FP32+FP32
MS-AMP O1	FP8	FP8	FP16	N/A	FP8	FP32+FP32
MS-AMP O2	FP8	FP8	FP16	N/A	FP8	FP8+FP16
MS-AMP O3	FP8	FP8	FP8	FP16	FP8	FP8+FP16

Now going a bit more than that, if you're familiar with or keeping up to date with efficient memory training, you might have heard of something called NVIDIA's TransformerEngine or MS-AMP.

The idea behind this is, we make use of like 4090s, H100s and do training in 8-bit. Now, this is different than quantization. You are actually training on raw native 8-bit.

A lot of mistakes I see people do with this, especially with the NVIDIA examples is they do the prior thing of converting the entire model into BF16 and then train. That leads to huge instabilities during training and generally people's performance hasn't been the best.

I've also heard rumors though that even this can go bad. So it's always worth playing around with, if you have the ability, FP16 versus non-FP16. That includes the BF16. And testing out sort of what levels can be at 8-bit.

Because with TransformerEngine, it's still using the autocast. So the computations, rather than being done in 16-bit are done in 8-bit. Then if you're playing around with MS-AMP, that let's you experimentally go even further with this. We can get to a point where if we do MS-APM O3, almost everything is in 8-bit. Your master weights are in 16-bit and your optimizer states are even in 8-bit. (I'm scared to play around with that. I don't know necessarily how good that is. I need to play around with it. That's sort of what I'm using the Llama-3 training for to just toy around with these things.)

DeepSpeed vs FSDP

(watch the video at 01:43:56)

Slide:

Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation

Framework	Model Loading (`torch_dtype`)	Mixed Precision	Preparation (Local)	Training	Optimizer (Local)
FSDP	bf16	default (none)	bf16	bf16	bf16
FSDP	bf16	bf16	fp32	bf16	fp32
DeepSpeed	bf16	bf16	fp32	bf16	fp32

To learn more, check out the documentation or join my office hours

Last part I'm going to very briefly talk about. We can talk about this more in my office hour is DeepSpeed by Microsoft and FSDP.

These two are almost the exact same. DeepSpeed has a few tweaks and calls things a bit differently. If you've done it in FSDP, it can be done in DeepSpeed and vice versa.

A wonderful community member recently posted some documentation where he directly talked about this parameter in DeepSpeed is this parameter in FSDP. Generally what I've seen it's a mix of if people prefer DeepSpeed or FSDP. It's usually a matter of do you want to go with Microsoft and do their thing, or stick with PyTorch and stay native.

Key Takeaways

(watch the video at 01:44:45)

Slide:

You can scale out training with accelerate, FSDP, and DeepSpeed across multiple GPUs to train bigger models.
Techniques like FP8 can help speed up training some and reduce computational overhead.
Comes at a cost of end-precision and locking model weights for further fine-tunes if not careful.

Some Handy Resources

(watch the video at 01:45:12)

Slide:

I'll post this presentation in the Discord.

There's some handy links there that will help get you started with Accelerate, go through some concept guides to understand some of the internals and really get you going.

Accelerate/FSDP Q&A

(watch the video at 01:45:26)

I thought that DeepSpeed zero3.json is the same as FSDP but the other options in DeepSpeed weren't necessarily equivalent?
- Zach: It's got to a point where there's some equivalencies now. The chart talks about it. zero3.json is definitely the equivalent of FSDP but there's some tweaks that you can do because FSDP gives you options to only offload certain things.
- Hamel: I just want to mention that I didn't show you there's a DeepSpeed and FSDP configs when you want to do multi-GPU training in Axolotl, you have to supply a config file. I'll show you some examples of those.
- Wing: I have some clarifications. So one of the things, especially for the FSDP part in the Axolotl configs is we try and move those FSDP specific configs into the Axolotl and then it maps them into Accelerate. What we found was that a lot of people were running Accelerate config and then setting setting things and then they would go and Axolotl. That would have like a mismatch in certain parameters. What would happen next is it would break in most in a lot of situations. What we actually recommended people do we have a warning, say just remove Accelerate config and then we will sort of map all of those configurations that normally get set by Accelerate through I think some uses environment variables to sort of communicate that under the hood. Anyways, when you use accelerate launch we just sort of like mimic a lot of that to avoid some of the headache of doing it, running accelerate config and getting a mismatch later on that just cause a lot of support issues.
- Zach: Makes perfect sense. That's exactly the solution I recommend. Even I'm debating on rewriting half of our internals for the FSDP and DeepSpeed plugin because I don't necessarily want to rely on environment variables and even setting it up. I'm sure as you've experienced normally, is problematic at best. Yes, that's a very smart way to go about it because it's even we've had users that report issues and like, "well it's because you set up your config wrong and you're using something else".
- Hamel: So what you heard from Zach today about Zero stage 1 to 3, BF16. That's all background that you might want to know to demystify a little about what is happening when you supply these configs. What I do honestly is I just use a config again. I just use one of these: https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/deepspeed_configs Use it off the shelf and then maybe consult Zach. Zach has written a lot about this. I actually look at his presentation. I kind of fiddle with it a bit sometimes. Honestly I just use ones that work if I want to parallelize my model. Then I'll pick the right config. You have these configs in the Axolotl repo and then you supply it to the main config. I'll show you an example when we talk about Modal in a second.
- Wing: Can I add a clarification on this one specifically with zero1.json and zero2.json specifically for DeepSpeed. I think the bf16 and fp16 can be set to auto because DeepSpeed doesn't care about it until after the trainers loaded. But for zero3.json specifically (and I see Zach nodding his head), is, it needs to know ahead of time specifically, that you're using bf16. You can't set up. You can't set auto in the zero3.json config if you want to use bf16. So that's why there's a specific zero3_bf16.json because it needs to know that you want to load it in bf16 before the trainer sees it or something along those lines. Maybe Zach can explain it better than I can.
- Zach: No, that's that's a pretty good explanation of it. It's something with DeepSpeed when it comes to setting up the actual call to DeepSpeed and initializing everything it has to know well beforehand what we're actually doing which makes it a little annoying whenever we're dealing with configs it that way.

Training On Modal

(watch the video at 01:50:22)

There's a lot of different ways you can train models. You can use RunPod, which Dan showed earlier. That recording was done on RunPod.

If you look at the Axolotl docs, it'll tell you a bit about RunPod: https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#cloud-gpu

Also there's a Docker container for Axolotl which is what you want to use most of the time: https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#docker

Wing do you want to say anything about that? What's your preferred way of running. How do you run it?

Wing: On my local 3090 GPUs, I don't use Docker containers mostly because it's development. It's not amenable to using Docker containers for that. But for general like debugging issues that people are seeing, I will generally spin up a Docker container on my RunPod and debug the issue there. So environment it doesn't have all of them mess and mismatch of various packages that might not have updated.

(watch the video at 01:51:50)

What is modal? Actually some general rule about this conference. We were pretty selective about the tools that we brought in to this conference for them to talk about. I'm only going to talk about tools that I use or that I like. This is like hundreds of tools.

One that I really like is Modal.

Why Modal

(watch the video at 01:52:15)

Slide 37:

Feels local, but its remote (“code in production”)
Massively parallel
Python native
Docs: https://modal.com/

Things I’ve built with modal

Transcript Summarizer
W&B Webhook

Modal is actually gives us really cool Cloud-native way to run Python code.

The thing that's really interesting about it is it has this one innovation -- it feels like local development but it's actually remote development. There's nothing to do with fine-tuning. Now I'm just telling you a little about Modal and some background.

Basically it's also like massively parallel. You can get things like Axolotl, easily do fine-tuning.

A lot of times I use Modal to do things like hyperparameter tuning. There's different ways to do hyperparameter tuning. It's not something you should focus on like in the beginning and it's totally fine to do it. I do a lot of things manually. I use bash scripts sometimes to do many different Axolotl runs.

It's very Python native.

There's these Modal docs.

If you're just getting started in Modal, to really experience this magic of Modal (what am I talking about this like local but it's remote? What does that even mean?) I don't even know how to explain it to you without you trying it yourself.

What I like to show people first is this Web endpoints.

I'm not going o demo it right now, because don't have time. But basically just try it out. What you want to do is you can change the code and you can see it change in production in real-time. You don't have to do these constant deploys to change code. It's like this iterative thiing really interesting.

I built lots of tools in Modal.

Modal - Axolotl Fine Tuning

(watch the video at 01:54:36)

Slide 38:

https://github.com/modal-labs/llm-finetuning

Has additional defaults / some differences

Merges LoRA back into the base model
Use a –data flag instead of relying on the config
DeepSpeed config comes from the Axolotl repo that is cloned

(these are small notes instead of detailed notes.)

It's something to try first. You could tweak it. You could change the code. There's the README in the llm-finetuning repo. There's a way to get started. Obviously you have to install it. Essentially what you do is you clone this repo. Then you launch this fine-tuning job using this command.

(watch the video at 01:56:51)

Slide 39: quick video of what that looks like.

Let me go back to the repo. Just to point out here, to navigate yourself in the repo. Hit period (.) on your keyboard to launch VSCode (GitHub Codespaces) real quick so you can view some code. [... truncated ...]

Modal - Debug Data

(watch the video at 01:59:41)

Slide:

Jupyter Notebooks: https://github.com/modal-labs/llm-finetuning/blob/main/nbs/inspect_data.ipynb
Tip: replace github.com with nbsanity.com to view notebooks

Another thing you might want to do is debug the data so you can run it end-to-end. Remember I told you. You don't want to just train stuff. If you want to have your own data inside Modal, I have this notebook, inspect_data.ipynb. Let's go to this notebook about "Inspecting Flattened Data".

I'm going to change this github.com to nbsanity.com because it's easier to read. [... truncated ...]

Q&A

(For the Q&A, these are small notes instead of detailed notes.)

(watch the video at 01:20:30)

Hamel: There was a bunch of questions in the Zoom about how do you connect the Docker container? that if you want to run Axolotl in.
- Hamel: That's really related to debugging with Docker.
  
  You can use VSCode to do that. I have some videos and tutorials in the Axolotl docs that show you how to do that either with Docker, not using Docker, and how to attach to remote host and things like that.

(watch the video at 02:01:12)

Are tiny models like 5 billion, 3 billion or less suited for fine tuning?
- Hamel: I usually don't go smaller than a 7 billion parameter model because I haven't had to go smaller than that. That's like a really sweet spot for me because the models are kind of good enough and they're small enough.
  
  Wing or anyone else, do you have any opinions on this, or seen anything?
- Wing: I haven't spent a lot of time with the 5, 3 billion models mostly because I wasn't impressed by the 5 billion models and I feel they were way too small. I think with the smaller models, the reasoning is worse. So LLama-3 is good enough and it works. So yes, 7 billion.
How to determine the adapter rank?
- Dan: They're actually 2 parameters. This wasn't part of the question but there are 2 parameters that go together. There's the adapter rank and then the adapter alpha.
- Hamel: I just copy the config so I don't determine anything.
- Wing: That's one of those hyperparameters you should play with. Assuming you have good evaluations and to understand is your model is that LoRA at that rank sufficient to get good accuracy on what your downstream used cases. 16, 32 is typically a good starting point that you see most people use. Then for alpha is usually I believe the paper say it should be 2x the rank. If you're using something, I think it was RSLoRA. It's has something to do with the square root but I try not to get into that.
- Dan: There's a blog post I'm forgetting, I think by Sebastian Rashka where he actually does a grid search and what works for those.
- Hamel: There's another thing that I do. This is kind of a weird answer. I actually ask my friends were a lot smarter than me. There's this guy, Johno Whitaker. He's really understands a lot of stuff. I'm like, "hey, what rank do you think I should use for this and gives me some tips?". Johno is actually speaking in this conference. He might not talk exactly about this but he has a really cool talk called "napkin Math for fine-tuning".
I have a custom evaluation or benchmark for my model. Is there a way I can get it to run periodically during fine-tuning to see how the training is going so far against that evaluation metric?
- Dan: It is actually something that I've wanted. I don't know the answer to i but it's something that I've wanted in the past.
- Hamel: Can you have an evaluation function in Axolotl or something, some callback, or something if you want to compute some custom evaluation metrics? How do you deal with? Do you do that? How you deal with it?
- Wing: There's like the tiny benchmarks that you can run sort of against the more standard benchmarks. As far as trying to get more custom evaluations, it's not really supported right now. I think you could do things by adding like callbacks on the evaluation loop maybe.
  
  Here's something you could probably try. There is a way I think on the evaluation. If you were to specify a custom test dataset for your evaluations you can have it generate predictions for those at certain steps and then log those out to Weights & Biases. Then you could pull those from Weights & Biases and then do your own evaluations using like LM-as-a-judge or something along those lines. That would be one way you could do it but there's nothing like directly integrated right now that's streamlined for that.
- Hamel: How would you do that dumping of predictions in Axolotl?
- Wing: It's already built in. I think there's something called the eval_table_size setting in Axolotl. What it does is it will pull some number of prompts from your test dataset and then run predictions during the evaluation step and then log those out to Weights & Biases. It's a little bit flaky so it's not a top level. It's the number of predictions that you want to do and then the number of max tokens is how many tokens you would like it to generate during that eval step.
Given Axolotl as a wrapper for some Hugging Face libraries, are there any important edge cases of functionality that you can do in the lower level libraries that aren't yet possible in Axolotl?
- Wing: I'm sure there are a lot of things that you could do.
- Hamel: You're operating at the code level. Yes.
- Wing: It's hard for me to figure out everything else that goes on underneath.
- Zach: I think it would especially be like at the speed that Wing can implement whatever we chuck into Accelerate. More specifically we can then chuck into the trainer. It's whatever that gap is is the bleeding edge that you don't have access to. That could be like new FSDP techniques, new DeepSpeed techniques that get added that we need to update in Accelerate and then push to the trainer. For the most part that should be the most major gap because we try and shove everything we can in Accelerate into the trainer that then Wing gets for free.
- Hamel: So you might be wondering why use Axolotl is worth bringing that up again. I just want to show you one example because there's a lot of stuff that you need to glue together especially if you don't have a lot of GPUs. One example that came out recently is QLoRA working with FSDP. For the longest time didn't work. The Answer.AI team kind of enable that. Then within hours, Wing glued it into Axolotl really before anyone else. I was able to use it almost right away.
  
  Wing keeps doing that like over and over again for anything that happens. The LLMs space is changing extremely fast from day to day like there's a new technique for efficient fine tuning like lower GPU, memory faster. Something like the ones that are really important like this one get into Axolotl really fast. Trying to do all that yourself would take a long time.
What are the practical implications of 4-bit versus higher precision?
- Dan: I think we said that some of those we will talk about more at deployment. Is there anything that you guys think we missed in talking about the implications of 4-bit obviously going to lead to a smaller LoRA and requires less RAM?
- Hamel: 4-bit can be aggresive. I have noticed the performance degradation when going all the way to 4-bit before. I've been using this library, MLC for example and they have 4-bit quantization. I don't see much of a difference 10-bit to an 8-bit. But I'm just talking about vibe checks. There's probably like papers out there that do some analysis. You always have to check yourself. Generally the trade-off is, for the smaller models you'll have a more portable model that's probably faster. Maybe now it fits on one GPU. You don't have to do distributed inference things like that potentially.
- Wing: One thing to keep in mind is QLoRA is definitely a trade-off when you don't have enough GPU RAM. So if you have an H100 and you're training like a 13 billion parameter model and it fits, don't decide to go down to QLoRA because you lose a lot of performance in the quantization, dequantization step. I experimented when my QLoRA came out. I was like why is this really terrible on A100 and like it should be faster, right? No, it's because of the quantization, dequantization steps that it's just actually worse. If you're going for speed and performance when you don't actually need it, it might be an over optimization in some cases.
- Hamel: It's definitely a GPU poor optimization for sure which is lots of people.
Does Axolotl also support mac M series GPUs?
- Wing: Yes because PyTorch is supported on mac M series. There is an example somewhere where someone did it. But you're probably better off using like MLX, I believe is the repository that has better fine-tuning for if you want to fine tune on your Macbook or what have you?
- Zach: It's MLX because fine-tuning on Mac is 3 different frameworks, 3 different backends, and all of them kind of work. It can work. Your mileage may vary.
In an overarching sense, are there mental models or intuitions that we bring to agentic LLM applications versus ones that are not agentic?
- Hamel: I saw this question. I guess in a sense, what does agentic means? Agentic is some workflow where there's a function call. Really it's models that make function calls are called agentic. I just want to demystify the terminologies -- people have terms and then feel like it's a rocket science. I actually have not worked on a use case where there isn't some function call involved. Even the Honeycomb example it's executing a query at the end for you. That's after the query generation but it's executing it. It's going in some loop after that to try to correct if something goes wrong.
  
  It's really hard to think of, I mean there might be some use cases that you know, but there is no function calls, but I feel like they all had function calls. I think you need to write evals that you kind of think of it as like unit tests and integration tests. It's important to have tests that test the function calls and have a unit test for those as well as integration tests.
Is fine-tuning an LLM to output deterministic results exactly the same?
- Dan: This is I think important because to output deterministic results is not something about how you do training. It is instead something about how you do inference. So you're going to train the model. It's going to have some weights. Then when you are predicting the next word, the last layer is this softmax so that the output of the model is actually a probability distribution over the next token. To make that deterministic, you would choose whatever token is most likely. Then if you don't do that, you're just sort of sampling from this probability distribution. That's all something that happens at inference time rather than something that happens at training time.
- Hamel: I'll give you a little bit more nuance there. If you want structured output from your LLMs, the guided generation that Dan is talking about is, you can clamp down the model. So that's providing you only tokens that makes sense in your constraint. If you want a JSON output with a certain schema that only has like allowed values, you can have a grammar or you can write. It's like basically rules that clamp down on the model and on what tokens it's allowed to predict. Fine-tuning can, if you have like a very specific type of structured output that you want the model to always provide. Fine-tuning can make it happen more reliably.
  
  If you're doing fine-tuning correctly, you should. Hopefully you don't trigger the guided generation framework that often. if your guided generation framework is getting triggered very often, then perhaps that means that if you're already doing fine-tuning anyways perhaps that means that your fine-tune is not that good.
  
  The cost of the guide generation isn't very meaningful. The guided generation frameworks are actually really good and really fast. Things like Outlines and things like that tend to be really good.
  
  It turns out that fine-tuning can help quite a bit in like learning syntax, learning structure and things like that with more deterministic outputs.
- Dan: I've seen Predibase guys did an experiment. They actually found that Outlines can make an inference faster than conventional sampling.

Lesson Resources

All the links I collected today, not complete:

https://x.com/abacaj/status/1782835550396850449
https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms
https://x.com/danielhanchen/status/1791900967472140583
https://arxiv.org/abs/2405.09673
https://arxiv.org/pdf/2305.11206
https://x.com/bhutanisanyam1/status/1758159687051350189
https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2
"# Scaling Up “Vibe Checks” for LLMs - Shreya Shankar | Stanford MLSys #97" https://www.youtube.com/watch?v=eGVDKegRdgM
https://lightning.ai/pages/community/lora-insights/
axolotl-ai-cloud/axolotl#1589
huggingface/peft#1724
LoRA: Low-Rank Adaptation of Large Language Models - Explained visually + PyTorch code from scratch https://www.youtube.com/watch?v=PXWYUTMt-AU
https://buttondown.email/ainews
https://huggingface.co/docs/transformers/main/en/chat_templating
https://poe.com/s/c0BFLNhTwiyPXOulPCnO
https://openaccess-ai-collective.github.io/axolotl/docs/input_output.html
https://openaccess-ai-collective.github.io/axolotl/docs/dataset-formats/pretraining.html
https://www.guardrailsai.com/
https://github.com/outlines-dev/outlines
https://outlines-dev.github.io/outlines/
https://nbsanity.com/static/d06085f1dacae8c9de9402f2d7428de2/demo.html
https://x.com/HamelHusain/status/1784769559364608222
"...FSDP QDoRA, a scalable and memory-efficient method to close the gap between parameter efficient finetuning and full finetuning." https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html

Source: Discord

Some of them are recommended reading list for today's workshop by the instructor.

Discord Messages

Some highlights:

(WIP)

LLM Fine-Tuning 101 blog post by Lucas van Walstijn, May 2024 - A guide on fine-tuning TinyLLama-1.1b base model using the alpaca_2k_test dataset, Axolotl, and Jarvislabs's GPU Cloud.

cedrickchee/mastering-llm-ft-workshop-2.md

Mastering LLMs: A Conference For Developers & Data Scientists

Fine-Tuning Workshop 2: Fine-Tuning with Axolotl

Logistic

Growing Into Something Bigger

How To Be Successful With This Conference

Plan For Today

Modeling Choices

Choosing a Base Model

LoRA In A Nutshell

QLoRA

What Is Axolotl

Using Axolotl

Case Study

Honeycomb - NL to Query

Honeycomb Case Study

Debugging Axolotl

Questions For Wing

Scaling Model Training With More Compute, How Do They Do It?

Who Am I?

Understanding GPU Usage

Distributed Training

Kinds Of Training

Fully Sharded Data Parallelism

FSDP: Getting Parameter Specific

sharding_strategy

auto_wrap_policy

offload_params

cpu_ram_efficient_loading AND sync_module_states

Tying This To Hugging Face Accelerate

What Is Hugging Face Accelerate

A CLI Interface

Launching Distributed Training Is Hard

accelerate launch

accelerate config

A Training Library: The Code

A Training Library: How Scaling Works

A Training Library: Mixed Precision

DeepSpeed vs FSDP

Key Takeaways

Some Handy Resources

Accelerate/FSDP Q&A

Training On Modal

Why Modal

Modal - Axolotl Fine Tuning

Modal - Debug Data

Q&A

Lesson Resources

Discord Messages

`sharding_strategy`

`auto_wrap_policy`

`offload_params`

`cpu_ram_efficient_loading` AND `sync_module_states`

`accelerate launch`

`accelerate config`