Did Philosophers Predict LLMs?
(Warning: Includes math)
I was reading a book on metaphysics the other day and I realized that the philosopher Wilfrid Sellars postulated something that sounds suspiciously similar to word embeddings in LLMs. To understand how exactly we’ll need a little philosophical background so bear with me. One of the most important problems in metaphysics is the problem of universals, which is concerned with whether abstract objects like “red-ness” or “circularity” literally exist beyond any particular red or circular object or are they mere mental constructions in our minds. Those who assert the existence of universals are often called “realists” (Plato comes to mind), while those who deny them are referred to as “nominalists” (Occam comes to mind; Yup, the same Occam from Occam’s razor)
Realist insist that universals exist because of many reasons, one of which is that they make our language and propositions make sense. If I say “Socrates is courageous” then every token in my proposition maps to a real object, either a particular (Socrates) or a universal (courageous). But if universals are not real there’d be nothing to map the token “courageous” to and the proposition will be meaningless. Realists think that for nominalism to be coherent, they must give an account to explain our language and propositions. Otherwise, all propositions like “Socrates is courageous” become meaningless, which essentially threatens all proposition and the authority of reason and the existence of truth itself. A lot is at stake here, right?.
There are various responses by nominalists, but the one I’m interested in discussing is Sellar’s metalinguistic account. He argues that propositions like “Socrates is courageous” need not refer to universals. The meaning of “courage” here is defined implicitly through a network of inferences and linguistic rules developed socially over time. In other words, “courage” is not an abstract object out there, but functions as a label for a linguistic role that has its meaning embedded within a particular language. This kind of view is also referred to as Inferential role semantics.
Notice how every word in the dictionary is defined in terms of other words. Indeed, a “word” has no meaning on its own, but it gains its meaning through the various contexts its placed in. This is the key insight that makes word embeddings work. if “meaning” is just patterns of co-occurrence in a corpus, then an algorithm can process the corpus and give a numeric representation to meaning. That’s precisely what an algorithm like skip-grams do.
In skip-grams (from the word2vec package), we train a logistic classifier that answers the following question, given a word \(w\) and a context \(c\) that may or may have never occurred with \(w\), what is the probability that \(c\) really did appear? We’ll denote it as \(p(+, | w, c)\). Conversely, we’ll denote \(p(- | w, c) = 1 - p(+ | w, c)\) to mean \(c\) not really occurring with \(w\). To estimate this we use the dot product to measure similarly, which has a range of \([ -\infty, \infty]\). To turn it into a probability, we’ll just take it’s sigmoid.
Once we train this logistic classifier that can predict these weights, we’ll just grab the weights and discard it. We’re only interested in training the embedding vectors for $w$ and $c$, which are initially random.
Now, in truth we have multiple contexts \(c_{1:L}\) (words that appear within the context window of size \(L\)), so we have multiple probabilities. We assume that these contexts are independent, thus what we really have is
$$p(+ | w, c_{1:L}) = \prod_{i = 1}^{L}\sigma(w \cdot c_i)$$This is the same thing as
$$p(+ | w, c_{1:L}) = \sum_{i = 1}^{L} \log{\sigma(w \cdot c_i).}$$Note that we have two embeddings here, one for the word as a target, and one for the word as a context. Thus, we have two weight matrices \(W\) and \(C\). Both of size \(|V|\) (most of the time they’re the same), so the total number of parameters is \(2|V|\)
We’ve described the actual task that the classifier has to answer, but not the training methodology itself. It’s slightly different from traditional logistic regression. We’ll have a word \(w\) and a context word \(c_{pos}\). Then, we’ll sample \(k\) (a hyperparameter) negative samples, which are basically contexts that DONT occur with \(w\). The key idea is that we want to teach the classifier that a negative (fake) sample \(c_{neg}\) does NOT appear with \(w\). We want to maximize \(p(+ \mid w, c_{pos})\) as well as \(p(- \mid w, c_{neg i})\). Again, we’re going to assume that they’re independent so we can just multiply them together. This is captured in this loss function
$$L = - (\log{p(+ | w, c_{pos}) \prod_{i = 1}^{k} p(- | w c_{neg i})}),$$Which when simplified, yields the following results
$$L = - (\log{\sigma(w \cdot c_{pos})} + \sum_{i = 1}^{k} \log{\sigma(- w \cdot c_{neg i})})$$Finding the gradient is straight forward.
$$\frac{\partial L}{\partial c_{pos}} = (\sigma(c_{pos} \cdot w) -1) w$$$$\frac{\partial L}{\partial c_{neg}} = \sigma(c_{pos} \cdot w) w$$$$\frac{\partial L}{\partial w} = (\sigma(c_{pos} \cdot w)-1) c_{pos} + \sum_{i = 1}^{k} \sigma(w \cdot c_{neg i}) c_{neg i}$$Notice how we can solve a dauntingly difficult problem by using a relatively simple algorithm and a few mathematical equations. I do not think the metalinguistic nominalists or advocates of inferential role semantics directly inspired AI researchers. This looks another case of two different fields independently converging to similar ideas! That’s fascinating.