I do not believe that LLMs are intelligent. That being said I have no fundamental understanding of how they work. I hear and often regurgitate things like “language prediction” but I want a more specific grasp of whats going on.
I’ve read great articles/posts about the environmental impact of LLMs, their dire economic situation, and their dumbing effects on people/companies/products. But the articles I’ve read that ask questions like “can AI think?” basically just go “well its just language and language isnt the same as thinking so no.” I haven’t been satisfied with this argument.
I guess I’m looking for something that dives deeper into that type of assertion that “LLMs are just language” with a critical lens. (I am not looking for a comprehensive lesson on technical side LLMs because I am not knowledgeable enough for that, some goldy locks zone would be great). If you guys have any resources you would recommend pls lmk thanks


thank you for the lengthy response. I have some experience with vectors but not necessarily piles of them, though maybe I do have enough of a base to do a proper deep dive.
I think I am grasping the idea of the n-dimensional “map” of words as you describe it, and I see how the input data can be taken as a starting point on this map. I am confused when it comes to walking the map. Is this walking, i.e. the “novel” connections of the LLMs, simply mathematics dictated by the respective values/locations of the input data? or is it more complicated? I have trouble conceptualizing how just manipulating the values of the input data could lead to conversational abilities, is it that the mapping is just absurdly complex, like the vectors just have an astronomical number of dimensions?
so for the case of inference, eg talking to chatgpt, the model is completely static. training can take weeks to months, so the model does not change when it’s in use.
the novel connections appear in training. it’s just a matter of concepts being unexpectedly close together on the map.
the mapping is absurdly complex. when i said n-dimensional, n is on the order of hundreds of billions of dimensions. i don’t know the exact size of chatgpt’s models but i know they’re at least an order of magnitude or two larger than what can currently be run on consumer hardware. my computer can handle models with around 20 billion parameters before they no longer fit in RAM, and it’s pretty beefy.
as for the conversational ability, the inference step basically works like this:
models are just giant piles of weights that get applied over and over to the input until it morphs into the output. we don’t know exactly how the vectors correspond to the output, mostly because there’s just too many parameters to analyse. but what comes out looks like intelligent conversation because that’s what went in during training. the model predicts the next word, or location on the map as it were, and most text it has access to is grammatically correct and intelligent, so it’s reasonable to assume that statishically speaking it will sound intelligent. assuming that it’s somehow self-aware is a lot harder when you actually see it do the loop-de-loop thing of farting out a few words with varying confidence levels and then selecting one at random.
my experience with this is more focused on images, which i think makes it easier to understand because images are more directly multidimensional than text.
when training an image generation model, you take an input image and accompanying text description. you then basically blur the image repeatedly until it’s just noise (specifically you “diffuse” it).
at every step you record what the blur operation actually did to the image into the weights. you then apply those weights on the text description.
the result is two of those maps: one with words, one with images, both with identical “topography”.
when you generate an image, you give some text as coordinates in the “word map”, an image consisting of only noise as coordinates in the “image map”, then ask the model to walk towards the word map coordinates. you then update your image to match the new coordinates and go again. basically, you’re asking “in the direction of this text, what came before this image in the training data” over and over again.