May 6, 2026

Building an Async AI Pipeline with Bedrock + SQS on AWS

Table of contents

The Real Architecture: Why Everything Is Async
The Two-World Problem: LocalStack + Real Bedrock
Titan Embeddings: Semantic Search Without OpenAI
Streaming: When AI Needs to Feel Like a Conversation
Error Handling Nobody Talks About
What I’d Change Starting Today
Decision Summary

In the previous article I explained why I chose Amazon Bedrock over OpenAI to build Sovereign Architect my career platform for developers. Cost, control via IAM, consistency inside AWS.

This article is about the how.

Because “calling an AI API” is the easy part. The hard part is doing it inside a real product, with tasks that take 30+ seconds, without blocking the user, with retry, with fallback, with local dev working, and with predictable costs. Here’s what I learned building it.

The Real Architecture: Why Everything Is Async

The first important decision: no AI call is synchronous.

The product receives a PDF résumé, extracts data, generates a gap analysis comparing it against job postings, and returns a personalized roadmap. Each step has different latency:

Extract résumé: ~30 seconds
Gap analysis: ~25 seconds
Generate roadmap: up to 45 seconds

If you try to do this inside an HTTP request, the user sees an infinite loading screen, the ALB times out, and the experience becomes unusable. The solution was async processing via SQS.

The actual flow:

User uploads résumé → API saves to S3, creates an import_job in RDS, publishes ai.process_profile to SNS
SNS distributes to SQS queue ai-processing-worker-queue
Worker consumes, downloads PDF from S3, extracts text, calls Bedrock (Haiku), validates JSON with Zod, persists in RDS
Worker publishes profile.processed to SNS
API receives via WebSocket/SSE and updates the frontend in real time

The user sees a “processing…” screen and within 30 seconds the full profile appears. No HTTP timeouts, no infinite loading, no stuck requests.

There’s a bonus too: SQS naturally absorbs traffic spikes and error retries. More on that below.

The Two-World Problem: LocalStack + Real Bedrock

Here’s a trap I fell into before I understood what was happening.

In local development, I use LocalStack to emulate SQS, SNS, and S3. It works great no cost, no need for an AWS account, no risk of accidentally hitting real infrastructure. But Bedrock has no LocalStack emulation. There’s no “fake local Bedrock.” You need to call the real Bedrock, with real credentials.

The problem appeared when I set AWS_ENDPOINT_URL=http://localhost:4566 in environment variables to point to LocalStack. The AWS SDK intercepts that variable for all clients including BedrockRuntimeClient. Suddenly the worker was trying to call Bedrock at LocalStack, which doesn’t exist there, and failing silently.

The solution was to separate credentials and never use AWS_ENDPOINT_URL globally:

# Credentials for SQS/SNS/S3 → LocalStack in dev
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
LOCALSTACK_ENDPOINT=http://localhost:4566   # custom var, NOT intercepted by the SDK

# Credentials for Bedrock → real AWS (STS temporary)
BEDROCK_AWS_ACCESS_KEY_ID=ASIA...
BEDROCK_AWS_SECRET_ACCESS_KEY=...
BEDROCK_AWS_SESSION_TOKEN=...

The worker creates two sets of clients: one with the LocalStack endpoint explicitly passed in the constructor, another with STS credentials for Bedrock. Both coexist in the same process without interference.

Every time STS credentials expire (max 12 hours), I renew:

aws sts get-session-token --duration-seconds 43200

It’s a local dev friction that disappears in production the ECS Task uses an automatic IAM Role, no token to rotate.

Titan Embeddings: Semantic Search Without OpenAI

There’s a less obvious part of the pipeline: the initial match between profile and job posting doesn’t use Claude. It uses Titan Text Embeddings v2.

The reason is cost and speed. Before running the gap analysis (Sonnet, expensive), I do a quick score based on semantic similarity via embeddings. I convert the user profile into a 1536-dimension vector, convert the job posting into another vector, and calculate cosine similarity via pgvector in PostgreSQL.

This is cheap, fast, and gives me a baseline score from 0 to 100 before any LLM call.

SELECT 1 - (profile_embedding <=> $1::vector) AS similarity
FROM user_profiles WHERE user_id = $2

This baseline score is passed as input to the Sonnet prompt in the gap analysis it adjusts it, doesn’t ignore it. Something like: “The embedding score is 72. Analyze the profile and the job posting and give me a final score considering context that embeddings don’t capture.”

Popular embeddings (React, Node.js, TypeScript) are cached in Redis for 24 hours. No reason to recalculate the embedding for “React” ten times a day.

General rule: use embedding to triage, LLM to reason. Embedding costs pennies, LLM costs dollars.

Streaming: When AI Needs to Feel Like a Conversation

There’s a feature in the product that uses a different pattern: simulated interviews. The user enters interview mode, answers questions, and the system gives real-time feedback.

Here latency matters differently. It’s not “wait 30 seconds and get the result.” It’s “start seeing the response in under 1 second, appearing as if someone is typing.”

Bedrock has InvokeModelWithResponseStream. Instead of waiting for the model to finish generating, you receive text chunks as they’re produced. The worker streams those chunks via SSE to the frontend.

Claude 3.5 Sonnet via Bedrock has a Time-to-First-Token of ~500–800ms. For a simulated interview where the user just finished typing an answer that delay is imperceptible.

If I needed even lower latency, Haiku reaches ~200–400ms. But interview feedback quality with Haiku is lower. It’s a conscious tradeoff: smoother experience or better feedback quality? I went with Sonnet.

Lesson: streaming isn’t just “appears faster.” It’s a different UX pattern that changes the perception of latency even when total latency is higher.

Error Handling Nobody Talks About

When Bedrock returns invalid JSON and it does happen, rarer with Claude 3.5 but it happens you need a strategy.

Mine: up to 2 retries with a stricter prompt. Something like: “You returned something that isn’t valid JSON. Try again and return ONLY JSON, no extra text, no markdown.” On the third failure, I mark the job as failed in RDS, publish a failure event to SNS, and notify the user to fill in manually.

For ThrottlingException (429 from Bedrock): exponential backoff. 1s, 4s, 16s. The SQS message goes back to the queue with increased delay. This is one of the advantages of the async pipeline SQS absorbs the retry naturally, without needing synchronous retry logic in the request.

Model fallback: if Sonnet is unavailable, structured extraction falls back to Haiku automatically. Gap analysis has no fallback if Sonnet fails, the job stays pending and retries later. Degrading the quality of the product’s core isn’t worth it.

General rule: async retry > synchronous retry. SQS + DLQ handles 80% of cases without you writing a single line of retry logic.

What I’d Change Starting Today

Two things.

First: Bedrock Prompt Caching from day zero. The system prompts for roadmap and gap analysis are long. In every call, those tokens are charged on input. Bedrock Prompt Caching reduces input token cost by up to 90% for calls that reuse the same system prompt. I left it for later, but it should be default configuration from the start.

Second: token metrics from the first deploy. I started instrumenting tokens_in, tokens_out, latency_ms in CloudWatch with the system already in production. It should have been from day one. Without it, you’re flying blind on cost. You only discover a model is consuming more tokens than expected when the bill arrives.

If you’re starting now, do both before any feature. It’s 30 minutes of work that prevents months of blindness.

Decision Summary

The choices that had the most impact:

Decision	Reason
Everything async via SQS	No HTTP request holds an AI task
Separate Bedrock / LocalStack credentials	Custom `LOCALSTACK_ENDPOINT`, never global `AWS_ENDPOINT_URL`
Embedding for triage, LLM for reasoning	Titan + pgvector before Sonnet
Streaming for interaction, batch for processing	Different patterns for different UX
Retry in SQS, not in code	Exponential backoff via DLQ + visibility timeout
Token metrics + Prompt Caching from the start	Don’t fly blind on cost

None of these choices are obvious until you have the system running. But all of them except the last two, which I left for later I’d make again.

This is the second in a series about building Sovereign Architect. In the first, I covered why I chose Bedrock over OpenAI. The next ones will go deeper into prompt engineering for structured extraction and the embedding-based matching system.