I remember the day our single-process sentiment analysis pipeline finally buckled under a surge of requests. The logs were ominous: thread pools jammed, batch jobs stalled, and memory soared. That’s when we decided to break free of our monolithic design and rebuild everything from scratch. In this post I’ll show you how we pivoted to microservices leveraging Kubernetes, GPU-aware autoscaling, and a streaming ETL pipeline to handle massive social data in near real time.
Originally our sentiment analysis stack was one big codebase for data ingestion, tokenization, model inference, logging, and storage. It worked great, until traffic shot up forcing us to over-provision every single component. Updates were worse, re-deploying the entire application just to patch the inference model felt wasteful.
By switching to micro-services, we isolated each function:
We can now scale each piece independently, boosting performance at specific bottlenecks.
Our first big step was containerization. Let’s look at a Dockerfile for the GPU-enabled inference service:
FROM nvidia/cuda:11.6.2-cudnn8-devel-ubuntu20.04
WORKDIR /app
# Install Python and system dependencies
RUN apt-get update && \
apt-get install -y python3 python3-pip git && \
rm -rf /var/lib/apt/lists/*
RUN python3 -m pip install --upgrade pip
# Copy requirements first for layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy project files
COPY . .
EXPOSE 5000
CMD ["python3", "sentiment_inference.py"]
This base image includes CUDA drivers and libraries for GPU acceleration. Once we build it and push the container to a registry it’s ready for orchestration.
With Kubernetes (K8s), we can deploy and scale each micro-service. We bind inference pods to GPU-backed node types and auto-scale based on GPU utilization:
apiVersion: apps/v1
kind: Deployment
metadata:
name: sentiment-inference-gpu
spec:
replicas: 2
selector:
matchLabels:
app: sentiment-inference-gpu
template:
metadata:
labels:
app: sentiment-inference-gpu
spec:
nodeSelector:
kubernetes.io/instance-type: "g4dn.xlarge"
containers:
- name: inference-container
image: myrepo/sentiment-inference:gpu-latest
resources:
limits:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "2"
requests:
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "1"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sentiment-inference-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: sentiment-inference-gpu
minReplicas: 2
maxReplicas: 15
metrics:
- type: Pods
pods:
metric:
name: nvidia_gpu_utilization
target:
type: AverageValue
averageValue: "70"
Whenever GPU load hits our 70% threshold Kubernetes spins up additional pods. This mechanism keeps the system snappy under heavy load, but avoids unnecessary costs during downtime.
Single inference calls for each request can throttle performance. We batch multiple requests together for better GPU utilization:
import asyncio
from fastapi import FastAPI, Request
from threading import Thread
from queue import Queue
import torch
import tensorrt as trt
app = FastAPI()
REQUEST_QUEUE = Queue(maxsize=10000)
BATCH_SIZE = 32
TRT_LOGGER = trt.Logger(trt.Logger.ERROR)
engine_path = "models/sentiment_model.trt"
def load_trt_engine():
with open(engine_path, "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
return runtime.deserialize_cuda_engine(f.read())
engine = load_trt_engine()
def inference_worker():
while True:
batch = []
while len(batch) < BATCH_SIZE and not REQUEST_QUEUE.empty():
batch.append(REQUEST_QUEUE.get())
if batch:
texts = [item["text"] for item in batch]
scores = run_tensorrt_inference(engine, texts) # Batches 32 inputs at once
for idx, score in enumerate(scores):
batch[idx]["future"].set_result(score)
Thread(target=inference_worker, daemon=True).start()
@app.post("/predict")
async def predict(req: Request):
body = await req.json()
text = body.get("text", "")
loop = asyncio.get_running_loop()
future = loop.create_future()
REQUEST_QUEUE.put({"text": text, "future": future})
result = await future
return {"sentiment": "positive" if result > 0.5 else "negative"}
This strategy keeps GPU resources humming efficiently, leading to dramatic throughput gains.
We also needed to handle high-volume social data ingestion. Our pipeline uses Kafka for streaming, Spark for real-time transformation, and Redshift for storage.
from pyspark.sql import SparkSession
from pyspark.sql.functions import from_json, col, udf
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName("TwitterETLPipeline").getOrCreate()
schema = StructType([
StructField("tweet_id", StringType()),
StructField("text", StringType()),
StructField("user", StringType())
])
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "broker1:9092,broker2:9092") \
.option("subscribe", "tweets") \
.option("startingOffsets", "latest") \
.load()
parsed_df = df.select(from_json(col("value").cast("string"), schema).alias("tweet"))
def custom_preprocess(txt):
return txt.replace("#", "").lower()
udf_preprocess = udf(custom_preprocess, StringType())
clean_df = parsed_df.select(
col("tweet.tweet_id").alias("id"),
udf_preprocess(col("tweet.text")).alias("clean_text"),
col("tweet.user").alias("username")
)
query = clean_df \
.writeStream \
.outputMode("append") \
.format("console") \
.start()
query.awaitTermination()
Spark picks up raw tweets from Kafka, cleans them, and sends them on to be stored or scored. We can scale both Kafka and Spark to accommodate millions of tweets per hour.
Early on, we hit puzzling memory issues because our GPU resource limits were mismatched with physical hardware. Pods crashed randomly under load. We also realized that tuning batch sizes is a balancing act: for internal analytics, we want bigger batches, for end-user requests we keep them modest to minimize latency.
By weaving together micro-services, GPU acceleration, and a streaming-first ETL architecture, we transformed our old monolith into a high-octane sentiment pipeline that laughs off 50K RPS. It’s not just about speed, batch inference strategies ensure minimal resource waste while flexible ETL pipelines let us adapt to surging data volumes in real time. Gone are the days of over-provisioning or patching everything just to fix a single inference bug. With a robust containerized approach, each service scales on its own terms, keeping the entire stack lean, reliable, and ready for the next traffic spike. If you’ve been feeling the pinch of a bogged-down monolith, now’s the time to rev the engine with micro-services and real-time data flows.