Show & Tell
Every Friday lunchtime, the whole of endjin get together for a "Show & Tell" session. These are essentially lightning talks on any subject; they can based on what someone has been working on that week, it can be a topic someone has been researching, it can be a useful tip or trick, it can be technical or non-technical; it just has to be something the speaker has found interesting, and wants to share with the whole company.
This week Ian Griffiths, Technical Fellow, gave a brief overview of investigations into using TorchSharp to model Tensors, to use backpropagation to train Neural Networks, in order to understand the principals behind building your own Generative Pre-Trained Transformer (GPT).
The Show & Tell contains the following chapters:
- 00:00 What is Torch / PyTorch / TorchSharp?
- 00:45 Representing numbers as Tensors
- 02:21 Tensors and the .NET Type System
- 04:07 Multiplying Arrays
- 05:16 Tensors and Neural Networks
- 06:23 GPUs and CUDA
- 08:35 How inputs affect output
- 12:21 What's backpropagation?
- 21:33 Building a GPT from scratch
- 24:07 Which libraries were used?
- 🔦 TorchSharp
- 🔦 PyTorch
- 📺 Let's build GPT: from scratch, in code, spelled out.
- 🤖 Tensors
- 🤖 Backpropagation
- 🤖 Generative Pre-trained Transformer
- 📖 Polyglot Notebooks
Ian Griffiths: So what is Torch? Torch is a library that gets used in building neural networks. Not the only thing it's for, but that's why we're interested in it. So, it's widely used through Python. A package called PyTorch, but it's actually like a C++ library under the covers. And PyTorch is just a Python wrapper for it.
And there's a thing called TorchSharp, which is a .NET wrapper for pi, for Torch. And Torch is essentially lets you do two things. One thing is deal with lots of numbers in parallel, so it does vectorized calculations. So here, I've got a non vectorized number, so I've asked to represent the number 1 in Torch's preferred representation, which is something it calls a tensor, which is like the mathematical concept of a tensor, only missing some pieces.
Essentially, it's either a number, or a tensor can be an array of numbers or it can be an array of arrays of numbers. It can be any number of dimensions. Either it's a scalar or it's an N-dimensional array of numbers. So, at its simplest we can just use them as though they were numbers. So, if I've got the numbers 1 wrapped up as a tensor I can do this.
I can say what is. 1+1. And if I ask the compiler what type that is, it says that's also a tensor because the plus operator is overloaded. So, you can add them together. And if I run this, it will give us the blindingly unattainable answer of one because I printed out the wrong variable.
If I print out the sum. It will determine that 1 plus 1 is not in fact 1, but 1 plus 1 is 2. So far so ordinary. So let's start with the dealing with lots of numbers thing. I can, rather than doing just one number, I can say I would like a tensor which is built from a bunch of numbers.
I can say I'd like a new array of 1, 2, 3, and if I print that, so that's also just the same data type as far as C# is concerned. The .NET type system makes no attempt to embed either the dimensionality or indeed the value type in the data type. So, tensors can have doubles, integers, several different data types, bytes if you like. So, you just have to know this is a, it fits better with Python's way of thinking in the C#'s, so you have to know what you're looking for when you get out of it. So in this case, I had to know this is a multidimensional thing, and so actually item's the wrong thing, and I want data, because that's actually able to give me the data as a collection.
Although probably the simplest thing at this point is for me just to call ToString() on the thing and I'm going to tell it to format it in the same way that numpy formats its strings, and so you can see again, I've not actually changed the variables, so let me say some numbers. So if I run that, we get 1, 2, 3, and then I can show you how the kind of vectorization works. So, if I have some numbers, I say what happens if I take some numbers and I multiply them by this tensor here, that's interesting because this is a one dimensional array of numbers, whereas X was a one dimensional array of numbers. 1 plus 1, that's going to be a scalar, that's just the value 2. So, this is also going to give me a tensor. Let's find out what that looks like.
So, I've multiplied the array 1, 2, 3 by the number 2. And what I get is the array 2, 4, 6. So it's done what they call broadcasting. I believe NumPy can do this as well. So, this is a fairly common thing in numeric processing in Python. You can multiply things out and it has rules for what happens when your tensor sizes are mismatched.
Sometimes it's just not allowable. It will just say, okay, I can't work out anything useful to do as you've tried to multiply this thing by that thing. It doesn't make any sense. But as long as you stick to the rules it will either just do pairwise work, so if I have two sets of numbers, like this, if I take These two equally sized arrays and multiply them by one another and run them. Now it's going to go, oh they're the same size. I'll just do pairwise multiplication. So, it's multiplied the first number one by the next one and so on. And if you have higher dimensions, it will just do it across all of them.
And this is where the power of the thing starts to come into play because you can actually, it's very common to give it operations that work across very large numbers of values simultaneously. For example, if you have a tensor representing all of the weights in a neuron you can get it to multiply all of its inputs by all of its weights with a single multiplication operation.
Because you just say, here are my inputs. This is a tensor with the input values. Here's another tensor with my weights. Please multiply them together. And it goes, oh, you've got, a thousand items in both arrays. I'll just multiply those for you. And you can build up more complex expressions and it also has heavily optimized matrix multiplication if you use that as well.
So for a lot of the operations in neurons, you can actually calculate / process an entire layer of neurons. You might have 1000 neurons, each of which has 2000 inputs. You can express that as a single matrix multiplication, and it will parallelize that as much as is possible with the hardware you're running on, so it can do these sorts of calculations very quickly.
If you have a suitable graphics card, Torch can use the CUDA API, which is the API for using your graphics cards hardware as a computational service. It can use CUDA to run this stuff on the graphics card. And this is what a lot of AI researchers end up doing when they're training their models, is they run it all on the GPU, because the GPU can multiply matrices far faster than your CPU can, basically. The nature of graphics card processing is that they're very good at enormous numbers of fairly simple calculations. That's the thing the graphics cards are very good at. That's the vectorization end of things. That's one half of what Torch does for you, but the other thing I'm now going to look at, I'm going to comment all this out so I can do a different example.
Suppose we have start from a similar place. Let's have A variable I'm gonna call X, which I'm gonna set to a value of 1. And I'm gonna have another variable called Y, which I'm gonna set to a value of 0 in this case. And then I am going to define some function of those two, which is gonna be, let's do X squared plus Y. So X times X plus Y. And that is also gonna be a tensor. Actually, I probably shouldn't call it F, because that makes it sound like it's a function. Let's just call it R, because it's the result of evaluating this function. When you're working with tensors, you're always working with values. You're not defining functions that you can then invoke. I could go and change the inputs, but I'd have to then rerun this line to re evaluate the function. So really, it's actually going to evaluate the result of that right now. If I print this out. Now with scalars, it turns out that this style of printing doesn't work. That's why I did it the other way.
If you're doing scalars, you actually have to go and get the single value out of there and print it that way. So right now, what have we got? X squared is 1, so we expect 1 times 1 plus 0, so this ought to print out the number 1, if I'm still able to do basic arithmetic. Yes, okay. Who cares?
Here's the thing, though. Something that you can ask Torch to do is to find out how the input variables affect your output. And the way I do this is I ask Torch to enable a feature called RequiresGradient. If I turn this on for both of these tensors, it should also end up being enabled for this one here.
And what I can do is then call a function called backward() on it. And what this does is it starts from whichever tensor I've operated on, and it essentially works out the derivative of this value here with the respect to each of the inputs, so it's going to say the rate of change of R with respect to R is one, so that's nice and easy.
It works out the rate of change there is unity because R changes whatever rate R changes. But the more interesting is what happens if X changes? So what happens if Y changes? How is that going to affect the final answer? And so, once you've done this step, you can see you can ask it, let's add a new line, so let's imagine my result represents a function I have called f. How, what's the rate of change of f with respect to X? I can find that out by going to the X variable, asking for its grad, which is nullable because not all things have a gradient, but I know I've asked for it, so I know it's going to be there because I've asked to evaluate it, and then show its value.
And I can do the same thing for Y, so how does my output vary as Y changes? So if I run this, it's now going to show me, okay, for X equals 1 and Y equals 0, the value of R is 1, and if you that's wrong. because I've put the wrong value in. That's not the right answer. Yeah. So it's saying, okay at position one, the rate at which your R will change if you change X is it's 2.
So a change of a small change in X will give me a small change of double the size in my output value. Or we could look at this, yeah, how would you differentiate this by hand? You'd say the D of X times X plus Y with respect to R is going to be, sorry, D of R with respect to X, sorry, that's going to be 2X, basically, isn't it, I think feeding in X we get 2. If I start with a different value, if I say what's the value of 0, The gradient's different. So, it's working out the gradient at the exact point. So, it's saying, what does a change in X mean for F given that the inputs were 0 and 0? If I change the input value of Y, that shouldn't matter because that's not going to change the rate of change.
It will change the output value. So, my output value is now 1, but the rate of change of R as Y changes is always going to be unity, because I'm just adding that in, right? That doesn't change at any point in the place, that's just being added in as a linear function, whereas X is squared, the slope goes up as I go up the slope.
So, if I feed in a huge value for x, then I'm going to get a much steeper slope for x. So, there we are.
Mike Evans-Larah: How does it know, so were you on line 28, 29 when you've done x. grad, y. grad, how does it know that's with respect to R?
Ian Griffiths: Because I called it here, so this does a process called backpropagation.
So the algorithm they actually use to do this starts from wherever you ask it to start and goes and fills in this grad setting, first of all on itself. Okay, the rate of change of me with respect to me is one because it just is, and then it goes, okay, how am I built up? It will remember when you construct this thing, it will remember that it was made up with this expression here, or at least it knows that it was an addition, in fact, Technically, it will know that it's an addition expression because it's going to interpret that as this.
So it says I am made up of an add operation, and my first operand is made up of a multiple operation, and my second operand is made up of a Y operation, and so this is recursively going to ask where did I come from? And it's going to apply the chain, the version of the chain rule to work out what the kind of next stage is, and it actually goes and fills in the gradient on its antecedents.
And there will actually be a gradient on this whole thing here, so if I were to extract this thing as a local variable. So here it's just implicit in here, but I could define like an x squared variable, that is x times x.
That's literally the same code, I've just given this thing a name. It ends up being the same thing as at the tensor level. So it's going to go to this one here and say, OK, you're part of a sum the rate of change of R with respect to this input to the sum that makes up R is going to be 1, so it labels that with the value of 1, but then works back to its inputs and say okay how does this change with respect to that and basically build it up walking down the tree.
So, the reason X knows I'm asking for it to be the respect to R is because I called from R in the first place. It just back propagates its way through the entire tree and things break completely if you call backward on several different nodes that are shared across multiple trees because it will actually cumulatively add information to the gradient, it adds as it walks the tree.
So, it's not doing a full analytical like process of working out what the actual derivative function is. It's just evaluating it at the point that the inputs are there for, because that's a much easier problem to solve. It's much easier to find out what the gradient is at that point than it is to come up with a function that tells you the gradient overall, so that's the other thing that Torch does.
So you've got the ability to vectorize calculations, and it has this thing that they call AutoGrad, Auto Gradient Determination, where you can pick some output node, and it will walk through all of the expressions that built into that, and will tell you how that node in the expression tree contributes to the output.
Now, why is that useful? What you often do in neural networking is you will have, let's say, a tensor that represents your weights. Initialized to random numbers, so I might say I need, a hundred inputs to this neuron, so I'm going to have a hundred numbers that remember the weight for every single input, so neurons basically have some number of inputs, they multiply those inputs by some number, and they also have a bias, which is actually just a single number for a tensor, so that's going to look something like this, which is probably zero, and the way you work out the output of a neuron is if you've got some bunch of inputs, and again I'm just going to, give some random numbers there, but in practice the inputs would be whatever the inputs to your system are.
The way to work out it's not technically the output of the neuron, it's the input to the activation function for the neuron. You can take the weights multiply them pairwise by the inputs, and then add in the bias. So that gives me another tensor, but then. What I can do is write some sort of function that tells me how close that answer was to the answer I was looking for.
So, when you're training a neural network, you typically have training examples that say, here are some inputs, here are the outputs I should have seen for those inputs. So with the number recognition thing, you feed like all the pixels for an image into the thing and you expect this image to give you the output 4 because someone's written a 4 on those pixels and so on.
And so, you have typically thousands or hundreds of thousands or even millions of example inputs and for every one of them you know what output you're expecting. So, what you can then do is you write what's called a loss function which works out how close you got. So you might say loss is going to be a minus expected value, for example.
And let's say, I don't know, I was expecting a value of 42, and I'll get some totally random number because I just, I've just initialized these all to random numbers, so who knows what's going to come out. But then the loss function tells me how far wrong I was. So, what do I want to do to improve it? I want to lower the loss function. How am I going to lower the loss function? I can call loss. backwards, and now every single tensor that fed into this loss function will now have one of these gradient annotations on it. So, I can go and say, okay, for this weight on this neuron what will increasing it by a small amount do to my loss function? And it will tell you either it will do nothing, or it will slightly increase the loss function, or it will slightly decrease the loss function.
If a small change, positive change, is going to decrease my loss function, that's how you train your neuron. You nudge it a little bit in the direction that lowers the loss function, and you do that for All of the weights in the neuron and you do that for every neuron in there in the layer. You do that for every layer in your neural network. You just move them all ever so slightly in the direction that this thing tells you will reduce the loss function. You do that for one input, then you give it a different input and you recalculate the gradients. You do it all again and so every single time you're just pushing all of the weights and biases, all of the parameters that define the behaviour of your neural network, you're pushing them ever so slightly in a direction that gives you a better answer for some piece of test input than the answer you got with the current state.
And over time, what this does is it gradually tends to push the weights and the biases in a direction that gives you the answer that you want, as long as your neural network has enough structure in it to be able to do that. So, whether the training actually works or whether it just randomly spews things up and down is going to depend on the complexity of the problem you're trying to solve.
And the structure of your network. So, in this case, I've literally got a single neural network. I've defined a hundred weights and one bias. That's one neuron. It is not going to be very sophisticated, so there's going to be a limit to what it can learn. It's not going to be able to generate the works of Shakespeare, not a single neuron.
But it might be able to Tell me whether some picture is in, some feature is in the middle of a picture or to one side, for example. You could imagine it doing that. You could imagine something able to detect whether things are aligned or not. Because they've got a hundred inputs here.
It's not hard to imagine the weights you would need to make this light up strongly when only the middle ones are all activated and but not light up strongly when it's off to one side. The structure's enough to do very simple tasks, and as long as you've got enough structure, this process of just following the gradients and nudging everything gradually closer and closer to the direction that the gradient tells you that's how learning works in neural networks. And when they call it Deep Learning, they just mean you've got multiple layers of neural networks. That's literally all it is. And then the trick to Deep Learning is working out how to design the structure of your networks in such a way that the loss function isn't so mind-bendingly complicated that this approach of nudging in the right direction fails, because that's a big problem. If you have an incredibly complicated loss function because your neural network is incredibly complicated, it turns out that a tiny nudge in what looks like the right direction might actually be catastrophic because it's very finely balanced. The trick is not so much in the learning algorithm, the trick is in working out how to design your networks in such a way that they're good at learning. Other than that, this is the fundamental underlying technique behind it.
Ed Freeman: Why are you doing this?
Ian Griffiths: Why am I doing this? Because recently somebody posted a video saying how to build a GPT from scratch, and I watched it, and then I wanted to try and build it myself! That's what led to this. I wanted to have a better understanding of what a GPT actually is. I can show you a GPT, actually. I have one. It's written in Python. I haven't converted it to C# yet, but I can show you what it looks like. So, this is using PyTorch. So, this is where I was starting. This is based on the video showing you how to build one from scratch.
It's not ChatGPT, it is just a GPT. So basic if I can find the training loop. So, this is the basic training model. This is literally what I was talking about. We go around some number of times. We call a function that estimates the loss. Actually, it's not that one there, it's this one here. The model function is going to return the loss as part of what it does.
And then, there's actually a thing called an optimizer in Torch, which does all that following of the gradients for you. So that walks your entire model, finds all the parameters, and adjusts them a little bit. And there's several different optimizers that have slightly different strategies for exactly how they deal with trying to avoid getting stuck in local minima and things like that. But fundamentally, it's all just trying to follow the gradient and set the thing up. The difference here is that the structure of the model is altogether more complicated. This is... a GPT language model a Generative Pretrained Transformer model, and it doesn't look that complicated, but essentially it consists of quite a lot of these things, and these are the actual transformers, you have a thing called a block, so the meat of it is actually this thing here.
And the critical part is a thing called a multi attention head, which is essentially a thing that's able to look at a few different places in your input data and look for patterns and light up in different places depending on which patterns it finds in there. That's essentially the authority of a GPT right there and the rest is just detail.
So, this is what I was looking at when I was wondering, could we do that in C# and then that led me to do that and then. I offered, yeah, and Eli said, "what should we talk about?" last Thursday, and I said, "how about this?" So that's why I'm talking about this!
Ed Freeman: So are there are any actual GPT specific libraries you're using here, or is the thing built into Torch?
Ian Griffiths: This is all I'm using, Torch and Torch.NeuralNetwork. Mine's slightly buggy, it does not converge as well on success as the one in the video, so I've got to work out what it is I did wrong. But it does, it produces, I'm not going to run it because it will take about 20 minutes to train the model, but it does produce something resembling English as its output is being trained on Shakespeare as input text, and it produces something that sort of vaguely looks a bit like text rather than line noise as its output.
His one looks, from a distance, looks genuinely like this. It makes no sense at all if you actually read it, because the real chat GPT is like an order of magnitude bigger than this model but the principles are basically the same.