AI Concepts

The World of Vectors in AI

Ravi Kalidindi

03 May 2025 — 10 min read

Why are Vectors a big deal for AI?

Understand what we have been doing so far vs what we can get via AI Vectors as a perfect data type and format.

What are we doing to retrieve the data?

RDBMS options
NoSQL and other database options
Search Indexing Software Options

Our Option-1: When we build any software, we historically used RDBMs to normalize/organize the structured data in two dimensions and retrieve the data with SQL queries. Build (normalize/organize the data) many types of data in two dimensions as tables.

Our Option-2: When we need specific data retrieval and optimizations, we use specific Databases such as NoSQL, time-series, graph databases, etc. Similar to RDBM databases, we normalize/organize the structured data and semi-structured data in two dimensions with some limitations and retrieve the data with better performance vs cost.

Our Option-3: When we need further specific data retrieval and optimizations on unstructured data, we use specific Search Indexing software (Solr, Elastic Search, Open Search, etc). Search Indexing software gives better retrieval capabilities and performance than the above options, where you can apply many dimensions.

💡

We have been using various data types such as integers, doubles, floats, text, and dates in the above databases. But we missed one data type and format that can help to leverage all the world's data + organization's data + 1000s of use cases with 1000s of dimensions.

What can we get via AI Vectors?

Our Option-4: Leverage Vectors and Embedding Vectors in such a way that we can store trillions of records with 1000s of dimensions and with semantic meanings/relationships.

💡

See below how we can leverage Internet-sized data and 1000s of Dimensions via Vector data type + embeddings to build new AI agents.

In today's faster innovation cycles (innovations are faster in young or new startups vs big companies) with a free data world (internet data and other public data sets), Users or Customers demand or create new use cases. For those use cases, Data (Structured and Unstructured) and Dimensions play a big role in Vectors importance.

See below how Data, Dimensions, and Vectors play a role in the AI world.

Data - the key requirement:

Until we know the value of AI for the last few years, we have been focused on limited internal organization data sets.

So far, we have focused on billions of data records or objects internally at any organization
But with recent AI-pretrained data, it is all about what 8+ billion people (trillions and trillions of data records, text, images, audio, and videos) are being generated in this world on the internet and on other sources every day.

There is a clear contrast between having focused on just what any organization has done with their internal data vs what they can do with external data generated every day by 8+ billion people.

💡

Internal organization data: Typical billions of data records, text, images, audio, and videos.

💡

External internet data: Trillions and trillions of data records, text, images, audio, and videos.

The new external data with Trillions and trillions of data records, text, images, audio, and videos makes a big difference at the foundation level. That's what you get when you adopt AI models and try to build new AI applications.

💡

Literally, you would get a new life of software via understanding of people's minds, people's actions, people's patterns, people's interests, and people's needs from those Trillions and trillions of data records, text, images, audio, and videos.

Data Dimensions - the key requirement

Once we have the new data (Trillions and trillions of data records, text, images, audio, and videos) in hand via internal and external data sources, we have to look at users' or customers' needs. The typical needs of users or customers are too many and dynamic. When we try to convert those needs or requirements as technical dimensions, we are talking about combinations of billions of dimensions.

Though there were a number of innovations over the last few decades at various storage, databases, computing, and software levels, we still have limitations on the dimensionality of data retrieval.

💡

First, Machine Learning (ML) came into the picture over a decade ago to solve those challenges, and ML indeed solved certain dimensionality data retrieval side challenges, but that was not enough to satisfy typical user needs, such as writing a document or creating a video.

That means, data dimensions play a big role in satisfying today's users' or customers' needs since the 8+ billion people's needs/demands (in terms of a combination of billions of data dimensions) are much more than what ML can do.

💡

Then, Artificial Intelligence (AI) models were created recently to satisfy typical user needs, such as writing a document or creating a video.

When a software wants to write a document without human intervention, it needs a ton of text and needs to keep the text in proper dimensions and positions.

💡

If we keep trillions of text combinations in a database, it is not going to perform even with today's GPUs. That's when Vectors come into the picture to store text properly and retrieve text properly with many dimensions.

💡

Both big data and many many dimensions play a key role on making a software such as AI software or AI agent to satisfy today's users or customers needs.

See below 3072 dimensions importance

Vectors

Since storing petabytes of text and structured data in a traditional database and trying to retrieve via many dimensions is not feasible practically, Vectors came into the picture. They have to be expressed by both magnitude and direction to scale easily.

💡

A Vector is simply an array of floating numbers, a numerical representation such as [0.12, 0.45, 0.89], used to represent data (like a word, image, or person) in a way that machines can understand and process very fast. Vectors allow AI models to perform mathematical operations on complex real-world inputs like text, images, or audio, and allow them to retrieve multidimensional data very fast.

💡

Vectors are designed so that similar things have similar vectors (e.g., “cat” and “dog”).

Vector example: The text "king" is represented as an array [0.212, 0.534, 0.333, 0.751, 0.622, ...]

💡

Finally, we have a perfect data type and format in Vectors to store all the organization's data and the world's data (structured and unstructured).

Embedding

💡

Embedding is a specific type of AI vector that maps discrete data into a continuous space. It is a learned vector that maps symbolic data (like words or images) into a continuous space where semantic relationships are preserved.

Embedding captures meaning and relationships like:
king - man + woman ≈ queen

Embeddings are generated by neural networks that are trained to map input data (like words, sentences, images, or audio) into dense numerical vectors that capture their meaning or features in context via Embedding libraries and Embedding models.

💡

It's about storing trillions of vectors with meaning/context, relationships/distance, and dimensions appropriately (semantic relationships are preserved) and retrieving them dynamically based on the request or query.

As shown in the above image, the country and capital relationships are preserved based on the vector embeddings, so when the AI Agent wants to generate text, it has all the data with relationships in place appropriately.

With Embedding Vectors, we can store relationships, patterns, and semantic meanings.

The above image shows only 3 dimensions, but we need 1000s of dimensions to satisfy various user/customer needs/requests. See the image below with the 10 dimensions.

Typical user requests just based on Rome and Italy:

What is the capital of Italy?
What is the official capital of Italy?
Is Rome the capital of Italy?
What are the two capitals of Italy?
Why is Rome not the capital of Italy?
Do you know when Rome became the capital of Italy?
What is the largest city in Italy?
What became Italy's capital in 1871?
When did Rome become Italy's capital?
What was the capital of Italy before Rome?

I only added 10 related questions from the users, but we can expect 1000s of similar requests. Since we have 1000s of similar requests and content, we need 1000s of dimensions as well. Vectors help us to store the worldwide + organization's text/images/audios/videos in a way that we can keep relationships, semantic meaning, and then retrieve the right text/images/audios/videos.

Word Embedding Examples:

(simplified to 5 dimensions for illustration — real embeddings typically have 100–3072+ dimensions)

Word	Vector (simplified example, 5 dimensions)
king	`[0.21, 0.53, 0.33, 0.75, 0.62]`
queen	`[0.20, 0.51, 0.34, 0.72, 0.60]`
man	`[0.18, 0.49, 0.35, 0.70, 0.59]`
woman	`[0.19, 0.48, 0.36, 0.69, 0.61]`
apple	`[0.12, 0.25, 0.45, 0.15, 0.50]`
fruit	`[0.11, 0.27, 0.47, 0.13, 0.52]`

These vectors allow analogies like:
king - man + woman ≈ queen

Sentence Embeddings (from BERT/Sentence-BERT)

Sentence	Vector (1st 5 values out of 384)
"I love pizza."	`[0.135, -0.245, 0.089, 0.564, -0.002, ...]`
"Pizza is my favorite food."	`[0.130, -0.240, 0.095, 0.560, -0.004, ...]`

These vectors are very close, showing the model understands their similarity.

Image Embeddings (e.g., CLIP)

Input Image	Vector (first 5 of 512)
🐶 Dog photo	`[0.412, 0.219, -0.311, 0.096, 0.520, ...]`
🐕‍🦺 Another dog photo	`[0.408, 0.222, -0.307, 0.101, 0.519, ...]`
🚗 Car photo	`[0.020, -0.105, 0.244, 0.390, -0.011, ...]`

The dog vectors will be closer together than dog vs car.

Audio Embeddings (e.g., Whisper or Wav2Vec)

Sound	Vector (first 5 of 1024)
Voice saying "hello"	`[0.023, -0.119, 0.081, 0.044, 0.331, ...]`
Voice saying "hi"	`[0.025, -0.122, 0.084, 0.048, 0.328, ...]`

Again, similar audio → similar embeddings.

Typical Embedding Models and Use Cases:

Type	Model	Use Case
Word	Word2Vec, GloVe	Semantic word similarity, analogy
Sentence	BERT, SBERT	Sentence similarity, search
Image	CLIP	Vision-language alignment
Audio	Whisper, Wav2Vec	Speech recognition, speaker ID

Vector Databases & AI Models

A list of a few Vector Databases to store organizations' data as Vectors

ChromaDB
Milvus
Pinecone
Weaviate
Cloud & Traditional RDBMS databases have vector storage plugins
Search software such as ElasticSearch, OpenSearch has vector storage plugins

AI Models internally use vectors at all layers

Conclusion

Got the right Data Type and Format as Vectors finally across the organization and across the world data: Store all the structured data and unstructured data including internet data, in the right data type and format (Vectors) so that you can do a lot of automations and magic (automatic content generation, automatic tasks, automatic metrics etc).

Internet-sized data, augmented with the organization's proprietary data, can be converted into Vectors, or Embedding Vectors, to satisfy untapped users' or customers' demands/needs via 1000s of dimensions of queries or unlimited auto-generated text/audio/image/video.

Petabytes of data --> Trillions of vectors --> AI Models --> So many new use cases that satisfy the user or customer needs.

According to some sources, the RefinedWeb data source has more than 5 trillion tokens of textual data. That's just one data source. Therefore, only the imagination is the limit now with the new AI models and technologies.

Check out relevant topics

numpy.vectorize — NumPy v2.3.dev0 Manual

NumPy v2.3.dev0 Manual - Home

Tensors — PyTorch Tutorials 1.0.0.dev20181128 documentation

The World of Vectors in AI

Ravi Kalidindi

What are we doing to retrieve the data?

What can we get via AI Vectors?

Data - the key requirement:

Data Dimensions - the key requirement

Vectors

Embedding

Word Embedding Examples:

Sentence Embeddings (from BERT/Sentence-BERT)

Image Embeddings (e.g., CLIP)

Audio Embeddings (e.g., Whisper or Wav2Vec)

Typical Embedding Models and Use Cases:

Vector Databases & AI Models

Conclusion

Check out relevant topics

Check out other internal posts

Read more

Azure AI - Part 1

AI Use Cases

IBM Watsonx.ai

IBM Watsonx.governance