Product recommendations that survive contact with production
Embeddings + Shopify Functions without the research-lab demo theater.
Most merchants I talk to have tried an AI personalization pilot in the last eighteen months. Most got stuck in the same place: impressive demo, unclear revenue lift, and nobody on the team willing to own the surface area in production.
This is the playbook I ship when a Shopify Plus brand wants AI-driven product recommendations that actually go live — and stay live.
The default pattern (what I usually ship)
The key move is where the intelligence lives:
- Offline: generate embeddings for every product using OpenAI
text-embedding-3-smallor a sentence-transformer you host. Batch job nightly. Cheap. - At-rest: store in Pinecone, Turbopuffer, or Postgres + pgvector. Index by product GID.
- At-request: similarity search against the current product (PDP), cart contents (checkout), or customer history (account page). Return top-N product GIDs.
- At-render: Shopify fetches the products via Storefront API with your existing pricing / availability rules.
No LLM runs per request. Latency is a vector lookup, not a model inference. This is the difference between shipping and demoing.
Where merchants usually get stuck
1. They put an LLM in the hot path
A call to GPT-4 or Claude on every PDP render will cost more than the incremental revenue unless your AOV is very high. And it will fail on latency. Keep the LLM for offline tasks — generating embeddings, enriching product descriptions, clustering categories.
2. They let the vector index drift
Products launch, descriptions change, variants come and go. Your embedding index needs a nightly reconcile job. I usually run it as a Shopify webhook listener (products/update, products/create, products/delete) with a 4am full-reindex as a safety net.
3. They skip the merchandising override layer
The model will confidently recommend the sold-out product, the 10%-margin product, or the SKU marketing specifically wants hidden this week. Every recommendation query needs a post-filter step: in-stock, margin threshold, brand rules. Ship this layer from day one.
Architecture at a glance
┌───────────────┐ ┌─────────────────┐ ┌──────────────────┐
│ Product │─────▶│ Embedding │─────▶│ Vector Index │
│ catalog │ │ batch job │ │ (Pinecone / │
│ (Shopify) │ │ (nightly + │ │ pgvector) │
│ │ │ webhook) │ │ │
└───────────────┘ └─────────────────┘ └──────────────────┘
│
▼
┌───────────────────────┐
│ Retrieval API │
│ + merchandising │
│ rules │
└───────────────────────┘
│
┌─────────────────────────────┼──────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌─────────────────┐
│ PDP widget │ │ Cart upsell │ │ Checkout │
│ (Storefront │ │ (Theme block) │ │ (Function) │
│ API) │ │ │ │ │
└────────────────┘ └────────────────┘ └─────────────────┘
Cost math (what to tell finance)
For a catalog of 10,000 SKUs, embedded nightly with text-embedding-3-small:
- Embedding generation: ~$2–4/month.
- Vector index (Pinecone starter): $0 for the first namespace at this scale.
- Retrieval infra (one Cloudflare Worker + KV or Postgres): ~$20/month.
The cost isn't the model. The cost is the team time to operate it. Budget one engineer-day a month for monitoring, merchandising rule updates, and reconciliation.
What I measure
- Click-through on recommendations — baseline versus variant, over 30-day windows.
- Attach rate — does the recommended product show up in the next order.
- Margin-weighted lift — because the model doesn't care about your margin unless you tell it to.
- Failure rate — percentage of requests that fell back to the default algorithm. Target < 0.5%.
If you can't measure margin-weighted lift, you can't tell whether the pilot won or lost. This is the number that gets the CFO off your back.
When to use an LLM in the loop
Rarely, but sometimes:
- Query understanding for semantic search — "shoes for a beach wedding" → product filter. Keep the LLM call cached by query hash.
- Description enrichment offline — more signal in the embedding, measurably better retrieval.
- Conversational commerce surfaces — a chat widget with RAG over the catalog. Only worth it for catalogs > 50K SKUs with real taxonomy complexity.
What to ship first
The minimum viable version I recommend for a Shopify Plus brand:
- Nightly embedding job for the full catalog.
- PDP "you may also like" block powered by vector similarity.
- Merchandising post-filter — in-stock, margin ≥ threshold.
- A/B test against the default algorithmic block for 30 days.
Everything else — checkout Functions, cart upsell, account-level history — stacks on top of this foundation. Don't try to ship all four surfaces in month one. One surface, measured, then the next.
When to call me
If you've run an AI pilot that felt impressive in demo and disappointing in production, the gap is almost always one of: the model is in the hot path, the index has drifted, or the merchandising rules never got shipped. All three are fixable. Two weeks of audit usually finds which one it is.
Otherwise, if you're shipping from scratch and want a reference architecture that won't set your AWS budget on fire: book a stack call and we'll scope it.
// newsletter
Ship smarter integrations.
One architectural breakdown every Friday. Shopify Plus, NetSuite, Celigo, AI. No fluff.
No spam. Unsubscribe in one click.