Understanding Three Fundamental AI Concepts: Embeddings, Vectors and Kernels -- Pure AI

Understanding Three Fundamental AI Concepts: Embeddings, Vectors and Kernels

By Pure AI Editors
04/01/2025

In nutshell, a vector is a sequence of numbers, an embedding is a word translated to a vector, and a kernel is a math function that gives a measure of the similarity between two vectors. Understanding these three fundamental ideas is important for anyone who uses or buys AI services.

An analogy is driving an automobile. You don't need to be an automative technician to drive a car. But you should know basic concepts such as roughly how an automobile's electrical system works so you can compare different car brands and make an intelligent purchase, and be able to distinguish what's important and what's not, and be able to communicate with car maintenance and repair people.

Vectors
The foundation for all of AI (including natural language systems and image systems) and its sub-discipline machine learning (prediction systems) is a vector. A "real" vector is nothing more than a sequence of numbers such as (0.1234, -5.4321, 0.9753). The "real" indicates the elements of the vector are ordinary numbers and not weird "complex" numbers that deal with the square roots of negative numbers.

The terms array and vector are often used interchangeably but in general, an array is a computer science storage object which can contain numbers, letters, or objects, while a vector holds just numbers. A special kind of vector is one that holds integers -- a collection of numbers without a decimal point, such as (23, -765, 42, 99).

A matrix is conceptually a rectangular collection of numbers. For example, a 3-by-4 matrix can be thought of as a 2-dimensional object having 12 elements arranged into 3 rows and 4 columns. However, behind the scenes, a numeric matrix is just a vector of vectors.

Word Embeddings
Natural language processing systems, including large language models (LLMs) like GPT-x from OpenAI, Gemini from Google, LLaMA from Meta, Phi from Microsoft, R1 from DeepSeek, and Claude from Anthropic, all rely on word embeddings. A sequence of words is broken down into words or word fragments. For example, the sentence, "The Quidtrixx company provides software services" might be parsed into "The", "Quid", "tri", "xx", "company", "provides", "software", "services". Each large language model has its own set of vocabulary words and word fragments.

Each word or fragment is mapped to an integer. For example, "The" might map to 15 (smaller integers are often used for common words), "Quid" might map to 10,349 and so on. The size of a large language model's vocabulary can vary. For example, the relatively small GPT-2 model has a vocabulary of just 50,257 words and fragments.

Figure 1: Word Embeddings — **[Click on image for larger view.]** *Figure 1:* Word Embeddings

Each integer representation is mapped to a vector. For example, "The" = 15 might map to (0.4325, 1.9768, . . . 0.9148) where the number of elements in the vectors typically ranges from 500 to 10,000. This is called the dimension of the model. The small GPT-2 LLM has 768 elements per word embedding vector.

The idea behind word embeddings is that a word can have multiple interpretations and so a single integer can't capture a word's entire meaning. For example, the word "bank" can mean a financial institution, or an airplane maneuver, or the edge of a river. A vector with many values can capture all the nuances of a word. The values in a word embedding vector are not arbitrary -- they are constructed in a mathematically sophisticated way so that there are relationships between the underlying words which can be used by the associated LLM.

Kernel Functions
In machine learning prediction systems, a mathematical kernel function computes a measure of similarity between two vectors. Suppose vector v1 = (3.6, 1.5, 2.4) and vector v2 = (3.0, 1.0, 2.0). A simple measure of dissimilarity/distance between vectors is Euclidean distance, which is the square root of the sum of the squared differences between elements.

euc_dist(v1, v2) = sqrt( (3.6 - 3.0)^2 + (1.5 - 1.0)^2 + (2.4 - 2.0)^2 )
                 = sqrt( 0.6^2 + 0.5^2 + 0.4^2)
                 = sqrt( 0.36 + 0.25 + 0.16 )
                 = sqrt( 0.77 )
                 = 0.88

But Euclidean distance has the undesirable characteristic that there's no limit to how large the distance between two vectors can be.

A kernel function gives a measure of similarity (not dissimilarity/distance) between two vectors. There are many kernel functions but the most common is called the radial basis function (RBF) kernel, or sometimes the Gaussian kernel. There are several slight variations of the RBF kernel but a common variation is called the gamma version. The RBF gamma version kernel function is defined as:

rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )

The exp() is the math constant 2.71828... raised to a power, which can be found on most calculators. The gamma value is an arbitrary constant like 2.0. The ||v1 - v2||^2 means squared Euclidean distance, which is the sum of the squared differences between vector elements -- in other words, Euclidean distance without the square root.

Suppose, as above, v1 = (3.6, 1.5, 2.4) and v2 = (3.0, 1.0, 2.0). Then the ||v1 - v2||^2 term is:

||v1 - v2||^2 = (3.6 - 3.0)^2 + (1.5 - 1.0)^2 + (2.4 - 2.0)^2
              = 0.6^2 + 0.5^2 + 0.4^2
              = 0.36 + 0.25 + 0.16
              = 0.77

If the arbitrary value of gamma is set to 2.0 then:

rbf(v1, v2) = exp( -1 * gamma * ||v1 - v2||^2 )
            = exp( -1 * 2.0 * 0.77 )
            = exp( -1.54 )
            = 0.21

The RBF kernel function has the nice characteristic that if two vectors have the same elements, the RBF value is 1.0 (maximum similarity). If two vectors are extremely different, the RBF value approaches, but never quite reaches 0.0.

Wrapping Up
The Pure AI editors asked Dr. James McCaffrey from Microsoft Research to comment. McCaffrey noted, "As the field of AI and machine learning continues to explode, it seems likely that technically unqualified people will emerge and attempt to establish themselves as authority figures, especially in soft areas where lack of expertise is easy to hide.

"One area of AI that might be especially vulnerable to misinformation in an attempt for some kind of financial gain is AI fairness and equity. It doesn't seem plausible that an AI pseudo-expert who doesn't understand how the natural language system word embedding mechanism works could provide accurate and meaningful information on things such as AI bias."

McCaffrey concluded, "Understanding fundamental AI ideas such as vectors and word embeddings can provide a basis for defense against being swayed by slanted arguments that have some sort of an underlying agenda."