Looking for resources to better understand LLMs

katsura@leminal.space · 2 months ago

Looking for resources to better understand LLMs

lime!@feddit.nu · edit-2 2 months ago

it’s a whole branch of mathematics. looking at it from a pure language perspective isn’t really useful because language models don’t really ~~think~~ work in language. they ~~think~~ work in text. “llms are just language” is misleading because language implies a certain structure while language models use a completely different structure.

i don’t have any proper sources but here’s a quick overview off of the top of my head:

a large language model is a big pile of vectors (a vector here is basically a list of numbers). the “number of parameters” in a machine learning model refers to the number of dimensions of one of those vectors (or, in programming speak, the length of the list). these vectors represent coordinates on an n-dimensional “map of words”. words that are related are “closer together” on this map. when you build this map, you can then use vector math to find word associations. This is important because vector math is all hardware accelerated (because of 3D graphics).

the training process builds the map, by looking at how words and concepts appear in the input data and adjusting the numbers in the vectors until they fit. the more data, the more general the resulting map. the inference process then uses the input text as its starting point and “walks” the map.

the emergent behaviour that some people call intelligence stems from the fact that the training process makes “novel” connections. words that are related are close together, but so are words that sound the same, for example. the more parameters a model has, the more connections it can make, and vice versa. this can lead to the “overfitting” problem, where there amount of input data is so small that the only associations are from the actual input document. using the map analogy, there may exist particular starting points where there is only one possible path. the data is not actually “in” the model, but it can be recreated exactly. the opposite can also happen, where there are so many connections for a given word that the actual topic can’t be inferred from the input and the model just goes off on a tangent.

why this is classed as intelligence i could not tell you.

Edit: replaced some jargon that muddied the point.

something related: you know how compressed jpegs always have visible little squares in them? jpeg compression works by making a mathematical pattern called the discrete cosine transform, slicing it into squares, and then replacing everything in the original image with references to those squares. the more you compress the more visible those squares become, because more and more parts of the image use the same square so it doesn’t match as well.

you can do this with text models as well. increasing jpeg compression is like lowering the amount of parameters. the fewer parameters the worse the model. if you compress to much, the model starts to blend concepts together or mistake words for one another.

what the ai bros are saying now is that if you go the other way, the model may become self-aware. in my mind that’s like saying that if you make a jpeg large enough, it will become real.

katsura@leminal.space · 2 months ago

thank you for the lengthy response. I have some experience with vectors but not necessarily piles of them, though maybe I do have enough of a base to do a proper deep dive.

I think I am grasping the idea of the n-dimensional “map” of words as you describe it, and I see how the input data can be taken as a starting point on this map. I am confused when it comes to walking the map. Is this walking, i.e. the “novel” connections of the LLMs, simply mathematics dictated by the respective values/locations of the input data? or is it more complicated? I have trouble conceptualizing how just manipulating the values of the input data could lead to conversational abilities, is it that the mapping is just absurdly complex, like the vectors just have an astronomical number of dimensions?

lime!@feddit.nu · 2 months ago

so for the case of inference, eg talking to chatgpt, the model is completely static. training can take weeks to months, so the model does not change when it’s in use.

the novel connections appear in training. it’s just a matter of concepts being unexpectedly close together on the map.

the mapping is absurdly complex. when i said n-dimensional, n is on the order of hundreds of billions of dimensions. i don’t know the exact size of chatgpt’s models but i know they’re at least an order of magnitude or two larger than what can currently be run on consumer hardware. my computer can handle models with around 20 billion parameters before they no longer fit in RAM, and it’s pretty beefy.

as for the conversational ability, the inference step basically works like this:

convert the input into a vector
run the weighted algorithm on the input
the top three or so most probable next words appear
select one of the words semi-randomly
append the word to the input
goto 1.

models are just giant piles of weights that get applied over and over to the input until it morphs into the output. we don’t know exactly how the vectors correspond to the output, mostly because there’s just too many parameters to analyse. but what comes out looks like intelligent conversation because that’s what went in during training. the model predicts the next word, or location on the map as it were, and most text it has access to is grammatically correct and intelligent, so it’s reasonable to assume that statishically speaking it will sound intelligent. assuming that it’s somehow self-aware is a lot harder when you actually see it do the loop-de-loop thing of farting out a few words with varying confidence levels and then selecting one at random.

my experience with this is more focused on images, which i think makes it easier to understand because images are more directly multidimensional than text.
when training an image generation model, you take an input image and accompanying text description. you then basically blur the image repeatedly until it’s just noise (specifically you “diffuse” it).
at every step you record what the blur operation actually did to the image into the weights. you then apply those weights on the text description.
the result is two of those maps: one with words, one with images, both with identical “topography”.
when you generate an image, you give some text as coordinates in the “word map”, an image consisting of only noise as coordinates in the “image map”, then ask the model to walk towards the word map coordinates. you then update your image to match the new coordinates and go again. basically, you’re asking “in the direction of this text, what came before this image in the training data” over and over again.

very_well_lost@lemmy.world · 2 months ago

in my mind that’s like saying that if you make a jpeg large enough, it will become real.

This is such an excellent analogy. Thank you for this!