AI Tokens

AI Tokens
Text → Tokens → Token IDs → Vectors → Prediction → New Token → Repeat
đź’ˇ
AI doesn’t see raw sentences—it sees tokens as numbers.

Why Tokens?

Unlike the traditional way of storing and retrieving data, the AI Model shall build relationships between the text and generate text with the right context, so it requires a special data structure (Vectors) to store it in a unique way (multi-dimensional arrays). 

When we have text as input in many use cases, it has to be converted into tokens first and then into vectors. A token is a chunk of text the model processes. A token indeed can be a word, a subword, or a character

đź’ˇ
It can be a word → house
đź’ˇ
A subword → un, believ, able
đź’ˇ
Or even punctuation → . ,

The Process

đź’ˇ
Text → Tokens → Token IDs → Vectors → Prediction → New Token → Repeat

Step-1: Text → Tokens (Tokenization)

Before anything else, the model uses a tokenizer (often based on Byte Pair Encoding (BPE) or similar methods).

Example:

đź’ˇ
"unbelievable" → ["un", "believ", "able"]

Why split like this?

  • Reduces vocabulary size
  • Helps the model understand new/rare words

Step-2: Tokens → Numbers (Token IDs)

Each token is mapped to a unique number:

đź’ˇ
"un" → 453
"believ" → 9821
"able" → 771

Now the sentence becomes:

đź’ˇ
[453, 9821, 771]

Step-3: Numbers → Vectors (Embeddings)

Each token ID is converted into a vector (a list of numbers):

đź’ˇ
"cat" → [0.12, -0.98, 0.44, ...]

These vectors capture meaning:

  • “cat” and “dog” → close together
  • “cat” and “car” → far apart

Step-4: Model predicts next token

The model:

  1. Looks at previous tokens
  2. Calculates probabilities for the next token

Example

Input: "The house is very"
Output probabilities:

  • "big" → 40%
  • "small" → 25%
  • "beautiful" → 20%

Step-5: Token generation

The model chooses one token (based on probability + randomness settings):

đź’ˇ
Chosen → "big"

Then repeats:

đź’ˇ
"The house is very big ..."

This loop continues → token by token generation

Entire flow

đź’ˇ
Text → Tokens → Token IDs → Vectors → Prediction → New Token → Repeat

Check out relevant topics

The World of Vectors in AI
Why are Vectors a big deal for AI? Understand what we have been doing so far vs what we can get via AI Vectors as a perfect data type and format. What are we doing to retrieve the data? 1. RDBMS options 2. NoSQL and other database options 3. Search