The AI Engineer’s Playbook: Master Data Sources for Retrieval Systems (Part 1)

After spending the last seven years building ML systems across various industries, I’ve learned that successful vector retrieval systems always begin with a clear understanding of your data.

Too often, I’ve seen teams jump straight to selecting the latest vector database or embedding model without first mapping their data landscape — a mistake that inevitably leads to costly architecture changes down the road.During my time as a machine learning consultant for a major e-commerce platform, we had to rebuild our recommendation system three times because we didn’t properly account for our data’s velocity and modality requirements.

What began as a simple product recommendation engine eventually needed to handle real-time user behavior, multiple languages, and image-based similarity — challenges we could have anticipated with proper data mapping. This series distills what I’ve learned through those experiences. In this article, we’ll focus on understanding your data through three critical dimensions: modality, velocity, and source. These factors will shape every subsequent decision in your vector retrieval architecture.

Whether you’re building your first embedding-based system or looking to improve an existing one, starting with a clear data strategy will save you countless hours of refactoring and help you build a system that truly serves your specific use case. Let’s dive in.

Data Mapping 101: Modality, Velocity, and Source

To build a strong retrieval stack you need a clear understanding of the data you have, where it comes from, and how it fits into your system.

The data must match your specific use case, as different problems requires different types of data. For example, a personalized movie recommendation system depends on customer preferences and watch history, while fraud detection system depends on transactional data.

An organization’s data is shaped by three key factors:

Modality: Structured vs. unstructured data
Velocity: Real-time streams vs. batch updates
Source: Internal systems vs. third-party providers

How your data fits within these dimensions directly impacts how you can use it. Pinterest provides a great example of how different data types are managed across differemt dimensions.

Their system handles:

Structured data (e.g., user profiles): Consistent formats with defined attributes
Semi-structured data (e.g., event logs): Flexible structures with varying fields and contextual details
Unstructured data (e.g., images): Raw content without a fixed schema, requiring advanced processing

Here’s how this stack works:

Event Streaming Platform: Pinterest uses Flink to capture and process real-time data from user interactions and content uploads. These events are semi-structured, typically in JSON format but with varying content
Data Storage and Querying: For persistent storage, Pinterest uses DynamoDB, a NoSQL database. This supports both real-time analytics on semi-structured data and quick access to structured data like user profiles
Data Analysis: Pinterest’s engineers rely on Querybook, a collaborative big data hub that organizes queries and insights in DataDocs, to detect trends and refine recommendation algorithms

Now, let’s break down how data velocity and modality impact your retrieval system.

Data Velocity

Data processing speed is crucial for defining what retrieval and compute tasks you can perform. Different speeds unlock different use cases. Here are the three main categories:

1 — Batch Processing

Data is processed in large groups at scheduled intervals, typically daily or weekly.

2 — Stream Processing

Data is processed immediately as it is generated, ensuring quick responses and updates.

3 — Micro-Batch Processing

Data is processed in small, discrete batches at frequent intervals, typically ranging from a few seconds to a minute.

The Velocity vs Complexity Tradeoff

Most production systems mix stream and batch processing to balance real-time updates with the value of historical data. However, this approach comes with challenges in maintaining consistency across both systems over time. When choosing data sources for a vector retrieval system, consider the following:

(1) Streaming

Provides real-time vector computations but limits model complexity due to latency constraints. It typically requires simpler embeddings or lookups and uses architectures like Recurrent Neural Networks (RNNs), shallow neural networks, and indexing/retrieval models.

(2) Batch

Batch processing supports complex models but updates are less frequent, making it ideal for asynchronous needs. It features large pre-trained transformers, custom deep learning models, and architecture search models.

(3) Hybrid

Hybrid approaches combine streaming filters or lookups with batch retraining, balancing responsiveness with in-depth analysis. This can involve indexing with periodic retraining or deploying a two-stage filtering and analysis setup.

The architecture of your vector search system is shaped by how you balance velocity and complexity. Your decisions depend on two key factors:

Synchronicity Needs: Do you need real-time updates for your vector embeddings, or are periodic updates (like daily or weekly) enough?
Model Accuracy: Is high accuracy with real-time data essential, or is near-real-time performance acceptable? For example, fraud detection requires real-time updates to minimize risk, while recommendation systems can often work with slightly delayed data.

The velocity-complexity tradeoff has been one of the most challenging aspects of deploying vector systems at scale. In my early projects, we built sophisticated embedding models that performed beautifully in the lab but couldn’t meet our latency requirements in production.

We eventually developed a two-tier approach: using lightweight models for real-time embedding generation while periodically refreshing a more complex model for high-value content.

The trade-off between velocity and complexity is just one piece. You also need to balance model size with response time.

Model Size vs. Response Time Tradeoff

Balancing model size with response time is crucial for optimizing the efficiency of a vector search system. Different types of architectures offer distinct trade-offs between size and speed:

(1) Streaming

Streaming models are typically compact, under 100MB in size. They include shallow networks (with fewer than 10 layers), distillation models, and efficient models like MiniLM or TinyML. These models are optimized for real-time processing but are limited in complexity.

(2) Batch

Batch model usually exceed 1GB in size and includes large transformers (BERT, GPT-3/4, T5) and custom deep networks. The models can handle more complex tasks but are slower due to their size.

(3) Hybrid

Hybrid architecture combine smaller streaming models with larger batch models to balance speed and complexity. By using ensemble or stacked models, you can take advantage of both approaches, optimizing performance for a variety of needs.

The key is finding the right balance between model size and response time, which impacts both efficiency and accuracy in your vector search system.

Kappa vs. Lambda Architecture

As we discuss how to balance speed and complexity, it’s also important to consider the architecture model you choose for handling real-time and historical data. Two popular approaches for this are Kappa and Lambda architectures.

(1) Lambda

Lambda combines both real-time stream processing and batch processing. The data is ingested in tow layers: the batch layer deals with historical data, while the speed layer focuses on real-time data. This is ideal for applications that need both real-time data processing and a more comprehensive view of historical trends. Think of things like fraud detection or systems that need a lot of historical context to make decisions.

(2) Kappa Architecture

Kappa simplifies things by processing all data as a continuos stream, where historical data is processed in the same way as new data. Kappa is ideal when speed is crucial, and historical depth is not as important.

I’ve implemented both approaches across different projects. One financial services client initially chose Lambda for their recommendation system, maintaining separate pipelines for historical and real-time data. However, when they updated their embedding models quarterly, they faced a significant challenge: all historical embeddings needed recomputation to maintain consistency. This eventually led us to migrate toward a more Kappa-like approach where all data (historical and new) flowed through the same processing pipeline, simplifying model updates considerably.

Data Modality

This is the type of data being handled (structured, semi-structured, or unstructured). Each modality requires different processing strategies.

Unstructured Data

Unstructured data refers to any information that does not follow a predefined format or structure. It is typically raw, unordered, and can often be noisy, making it more challenging to process and analyze. This data type can come in many forms, and here are some key examples.

Text Data

Example Data: Social media posts, news articles, chat transcripts, product reviews.
Typical Formats: Plain text, JSON, XML, HTML, PDF, CSV (for tabular text data).
Datasets: Sentiment Analysis on Movie Reviews, Text Classification

Image Data

Example Data: Photographs, medical images, satellite imagery, generative AI-created images.
Typical Formats: JPEG, PNG, TIFF.
Datasets: CIFAR-10 — Object Recognition in Images, Unsplash 4.8M Photos, Keywords, Searches

Audio Data

Example Data: Speech recordings, music, environmental sounds.
Typical Formats: WAV, MP3, FLAC.
Datasets: Urban Sound Classification, Audio Classification

Video Data

Example Data: Movie clips, surveillance footage, video streams.
Typical Formats: MP4, AVI, MOV.
Datasets: YouTube-8M Segments — Video Classification, Video Classification

Structured Data

Structured data follows predefined formats, with clearly defined categories and fields, making it much easier to handle and query. Most enterprise systems rely heavily on structured data for decision-making and operational processes.

Tabular Data

Example Data: Sales records, customer information, financial statements.
Typical Formats: CSV, Excel spreadsheets, SQL databases.
Datasets: Kaggle Datasets: Kaggle offers a wide range of structured datasets covering various domains, UCI Machine Learning Repository: UCI’s repository provides many structured datasets for machine learning.
Considerations: Data quality, missing values, and the choice of variables relevant to your analysis (feature selection). You may need to preprocess data and address issues such as normalization and encoding of categorical variables. Systems: Structured data often lives in relational database management systems (RDBMS) like MySQL, PostgreSQL, or cloud-based solutions like AWS RDS.

Graph Data

Example Data: Social networks, organizational hierarchies, knowledge graphs.
Typical Formats: Graph databases (e.g., Neo4j), edge-list or adjacency matrix representation.
Datasets: Stanford Network Analysis Project (SNAP): Offers a collection of real-world network datasets, KONECT: Provides a variety of network datasets for research.
Considerations: In graph data, consider the types of nodes, edges, and their attributes. Pay attention to graph algorithms for traversing, analyzing, and extracting insights from the graph structure.
Systems: Graph data is often stored in graph databases like Neo4j, ArangoDB, or Apollo, but it can also be represented using traditional RDBMS with specific schemas for relations.

Time Series Data

Example Data: Stock prices, weather measurements, sensor data. Typical Formats: CSV, JSON, time-series databases (e.g., InfluxDB).
Datasets: Federal Reserve Economic Data (FRED): Covers economic and research data from various countries, including the USA, Germany, and Japan, The Google Trends Dataset
Considerations: Time series data requires dealing with temporal aspects, seasonality, trends, and handling irregularities. It may involve time-based feature engineering and modeling techniques, like ARIMA or other sequential models, like LSTM.
Systems: Time series data can be stored in specialized time-series databases (e.g., InfluxDB, TimescaleDB, KX) or traditional databases with timestamp columns.

Spatial Data

Example Data: Geographic information, maps, GPS coordinates.
Typical Formats: Shapefiles (SHP), GeoJSON, GPS coordinates in CSV.
Datasets: Natural Earth Data: Offers free vector and raster map data, OpenStreetMap (OSM) Data: Provides geospatial data for mapping and navigation.
Considerations: Spatial data often involves geographic analysis, mapping, and visualization. Understanding coordinate systems, geospatial libraries, and map projections is important.
Systems: Spatial data can be stored in specialized Geographic Information Systems (GIS) or in databases with spatial extensions (e.g., PostGIS for PostgreSQL).

Logs Data

Example Data: Some examples of different logs include: system event logs that monitor traffic to an application, detect issues, and record errors causing a system to crash, or user behaviour logs, which track actions a user takes on a website or when signed into a device.
Typical Formats: CLF or a custom text or binary file that contains ordered (timestamp action) pairs.
Datasets: loghub: A large collection of different system log datasets for AI-driven log analytics.
Considerations: How long you want to save the log interactions and what you want to use them for — i.e. understanding where errors occur, defining “typical” behaviour — are key considerations for processing this data. For further details on what to track and how, see this Tracking Plan course from Segment.
Systems: There are plenty of log management tools, for example Better Stack, which has a pipeline set up for ClickHouse, allowing real-time processing, or Papertrail, which can ingest different syslogs txt log file formats from Apache, MySQL, and Ruby.

Embedding Strategies Across Data Modalities

Embedding Strategies Across Data Modalities When building a vector retrieval system, we need to convert our data into numerical vectors (embeddings) that capture the essential characteristics of each item. The key aspect is that different data types require fundamentally different embedding strategies.

Text embeddings typically capture semantic meaning through contextual understanding, while image embeddings must identify visual features, patterns, and objects. Audio embeddings need to represent both temporal patterns and frequency characteristics. These differences affect not just model selection but also retrieval performance — text embeddings often benefit from approximate nearest neighbor search methods like HNSW, while visual embeddings might perform better with other indexing approaches.

The preprocessing pipeline also varies significantly — text might need tokenization and normalization, while images require resizing and augmentation before embedding.

Conclusion: Start with Data, Not Technology

Building a successful vector retrieval system starts with a clear understanding of your goals, data sources, and how they work together in your system.

For organizations just starting with vector retrieval, I’d recommend beginning with a clear mapping of your data landscape across these dimensions before making any technology decisions. This foundation will help avoid costly architecture changes later. Consider these practical steps:

Audit your data sources — Identify what data you have access to, its quality, and completeness. Determine if it’s sufficient for your use case or if you need additional sources
Assess data velocity requirements — Be honest about your true latency needs. Many applications can succeed with micro-batch processing (minutes) rather than true real-time (milliseconds), allowing for more sophisticated models
Evaluate data modality complexity — Different data types require different embedding approaches. Start with simpler, homogeneous data before attempting multimodal systems.
Start small and iterate — Begin with a focused use case and a subset of your data. This allows you to validate your approach before scaling your architecture.

This data-first approach will save significant time and resources compared to starting with a technology stack and trying to fit your data into it.

In this article, we’ve covered how to map out your data infrastructure and identify the key inputs to your vector retrieval stack.

Next, we’ll explore how Vector Compute connects your data to your Vector Search systems.

The AI Engineer’s Playbook: Master Data Sources for Retrieval Systems (Part 1)

Too Long; Didn't Read

People Mentioned

Companies Mentioned

Coin Mentioned

Data Mapping 101: Modality, Velocity, and Source