Blog
Thoughts and insights about reliability, performance, observability and AI.
Mastering LLMs: How Strategic Prompting Transforms Technical Outputs
Learn fundamental prompt engineering techniques including Zero-shot, Few-shot, Chain-of-Thought, and role-specific prompting to achieve professional-grade AI outputs.
Beyond the Hype: How AI Integration Impacts DORA Metrics and Software Performance
Explore how AI adoption affects DORA metrics, the new fifth metric (Deployment Rework Rate), and the seven organizational capabilities needed to turn AI into a performance amplifier rather than a bottleneck.
The Architect's Guide to Hybrid Search, RRF, and RAG in the AI Era
Traditional search engines excel at exact matches but fail to grasp user intent. Learn how hybrid search combines lexical and vector methods with RRF to build accurate, context-aware retrieval systems.
From Chatbots to Autonomous Agents: The 7 Patterns of Agentic AI Evolution
Software development is transforming as natural language becomes the primary programming interface. Learn seven AI patterns from simple loops to autonomous agent-to-agent systems and Model Context Protocol.
The Reality of Enterprise AI: Avoiding Common Pitfalls When Scaling GenAI
Moving from a Generative AI pilot to an enterprise-wide rollout is not a linear process. Learn the lessons learned by technical leaders at Liberty Mutual during their journey of deploying GenAI at scale, including adoption strategies, cost management, and RAG complexity.
Observing the Stochastic: Tuning the LGTM Stack for AI Infrastructure
LGTM—Loki, Grafana, Tempo, Mimir—has earned its place in production environments. This article explores how well this stack holds up when pushed into one of the most hostile observability environments: LLM and ML production systems.
The AWS AI Stack: Moving Beyond Proof-of-Concept to Five-Nines Reliability
AWS can absolutely support Tier-1 AI systems, but only if you treat AI workloads as first-class distributed systems, not experiments wrapped in SDKs. This is an evaluation of what actually holds up when you push beyond a demo, beyond a PoC, and toward five-nines expectations.
Azure for AI: An SRE's Guide to Provisioning for High Availability
Azure can run LLM-backed systems with uptime expectations comparable to Tier-1 services, but the path runs through quotas, networking, and observability—not prompts and SDKs. This is an evaluation of what actually holds up under load, failure, and budget scrutiny.
From PoC to Production: What Breaks When You Ship LLM-Based Systems
The gap between a Proof of Concept and a production system is not primarily about model quality. It's about everything that happens after the first successful response. LLM systems don't fail because the prompt is wrong. They fail because production is hostile to assumptions.