Ali Doğan
Senior Site Reliability Engineer with 6+ years of experience operating and stabilizing distributed, cloud, and on-prem systems. I am currently focused on AI and MLOps infrastructure, applying proven SRE principles to ensure reliability, observability, performance, and operational safety of AI-enabled and data-intensive systems in production.
With a strong background in observability, incident response, and deep system analysis, I approach AI workloads from a production-first perspective—prioritizing predictability, debuggability, and scalability over experimentation. Throughout my career, I have consistently used analytical and AI-assisted tools to understand complex system behavior; today, I apply that experience to building and operating the infrastructure that supports AI at scale.
Core Skills
AI / MLOps Infrastructure
- Preparing AI/ML-backed and data-intensive services for reliable operation in production environments.
- Designing observability for model inference, data pipelines, and AI-enabled services in production.
- Monitoring latency, throughput, error patterns, and cost characteristics of AI workloads.
- Deploying and operating ML-backed services from an infrastructure and platform perspective, not research.
- Maintaining a strict separation between model experimentation and production reliability ownership.
Reliability & SRE
- Defining and operating SLIs/SLOs, managing error budgets, and using reliability signals to guide operational and architectural decisions.
- End-to-end incident management and root cause analysis across application, infrastructure, network, database, and storage layers.
- Improving detection and recovery times through observability, automation, alert quality, and disciplined operational practices.
- Building and operating observability systems using metrics, logs, and traces to enable full system visibility and correlation.
Observability & Telemetry
- Design and operation of observability platforms (metrics, logs, traces)
- High-volume telemetry analysis to identify trends, bottlenecks, and failure patterns
- Development of custom analysis agents for log, metric, and trace correlation
- Alert design, signal quality improvement, and noise reduction
Programming & Automation
- Python for automation, analysis tooling, and backend services
- REST and gRPC-based services using FastAPI and Flask
- Infrastructure automation and CI/CD using Terraform, Azure DevOps, and GitHub Actions
- Operational scripting with Bash and PowerShell
Cloud & Platform
- Kubernetes (Rancher), Docker, OpenStack, containerized and on-prem platforms
- Azure (primary), GCP (working knowledge), hybrid cloud and bare-metal environments
- High-availability, fault-tolerant architectures for distributed systems
Professional Experience
Senior Site Reliability Engineer
DESTEL
- Site Reliability Engineer in the Digital Performance team, responsible for improving reliability, performance, and visibility of complex, distributed production systems across multiple client engagements.
- Perform advanced root cause analysis (RCA) by correlating telemetry data (logs, metrics, traces) with application, infrastructure, and network behavior, including deep packet-level analysis when required.
- Use APM and NPM tools that apply analytical and AI-assisted techniques to identify anomalies, performance regressions, and behavioral patterns in production systems.
- Analyze high-volume telemetry data to identify trends, bottlenecks, and failure patterns, strengthening the foundation for data-driven system analysis and automation.
- Develop custom analysis agents for logs, metrics, and traces, focusing on data aggregation, correlation, and automated insight generation to support faster RCA and operational decision-making.
- Automate deployment, monitoring, alerting, and data collection workflows to reduce operational toil, detection times, and response times while improving system stability and operational consistency.
- Support client teams in adopting SRE best practices, including SLI/SLO definition, alert design, observability standards, and operational ownership models.
Site Reliability Engineer
EHSIM
- Served as one of three Site Reliability Engineers responsible for the availability and reliability of a confidential, mission-critical on-prem system operating in a high-compliance environment.
- Managed and operated infrastructure built on OpenStack, Rancher / Kubernetes, Ceph, PostgreSQL, HAProxy, Zabbix, Grafana, and the ELK stack.
- Performed root cause analysis (RCA) across application, infrastructure, network, and storage layers to resolve production incidents and ensure continuous system availability.
- Designed and implemented monitoring and alerting strategies to improve system visibility and reduce incident detection and response times.
- Developed automation scripts to reduce operational toil and support repeatable infrastructure and maintenance tasks.
- Troubleshot complex issues involving Kubernetes clusters, etcd, networking, storage (Ceph), and databases, operating with full production responsibility as part of a small SRE team.
Software Engineer & DevOps
Logarity
- Co-founded and developed Logarity, an ELK-based mini SIEM solution targeting small companies that required mandatory log retention but could not afford enterprise SIEM platforms.
- Extended the ELK stack with custom components for log archiving, agent management, and long-term storage, focusing on operational scalability.
- Designed and implemented high-throughput log ingestion pipelines using Kafka to handle increased data volume under load.
- Developed gRPC-based agents to optimize log data transfer between endpoints and the central platform, improving efficiency and reliability.
- Built backend services using Python and containerized all components with Docker for reproducible deployments.
- Owned DevOps and operational responsibilities, including container orchestration, deployment workflows, and system reliability for on-prem installations.
Software Engineer & DevOps
AllConfig
- Led a small development team responsible for building an on-prem, microservice-based network configuration management system.
- Designed and developed Python-based backend services, exposing APIs for managing and auditing network device configurations.
- Containerized services using Docker and deployed them with Docker Swarm, focusing on service isolation and operational simplicity.
- Automated network device configuration and management using Netmiko, gaining hands-on experience with network protocols, device configurations, and operational networking concepts.
- Owned DevOps responsibilities, including service packaging, deployment workflows, and basic CI/CD practices for on-prem environments.
- Worked under the technical leadership and mentorship of Okan Eke, with guidance and collaboration from Erkun Altunbaş and Gürkan Atabay, gaining practical experience in system design and operational decision-making.
Machine Learning Software Engineer
VeriUs
- Built and maintained data collection and preprocessing pipelines for NLP systems, including web scraping, parsing, normalization, tokenization, stemming, and noise filtering.
- Developed and containerized Python-based REST APIs using Flask and Docker to serve NLP functionalities for internal and client-facing use.
- Conducted model training and evaluation experiments for intent detection, text summarization, and noisy text classification, focusing on data quality and performance validation.
- Collaborated with Associate Professor Dr. Murat Can Ganiz on academic NLP research and contributed to a peer-reviewed publication on Word Sense Disambiguation.
- Also worked with Aydın Gerek on optimization of language models.
Latest Blog Posts
Recent thoughts and tutorials
LangChain Deep Agents: Karmaşık Görevler İçin Planlama ve Delegasyon Framework'ü
LangChain Deep Agents ile otonom araştırma süreçlerini otomatize edin. Planlama, bilgisayar erişimi ve alt ajan delegasyonu gibi yeteneklerin nasıl entegre edileceğini öğrenin.
Sabit Bilgiden Dinamik Yeteneklere: AI Ajanlarını Eğitmenin Yeni Yolu
AI ajanlarının geçmiş etkileşimlerinden nasıl öğrenebileceğini ve Deepagents ile sürekli öğrenme döngüsünü nasıl oluşturabileceğinizi keşfedin.
Müşteri Deneyiminde Yeni Dönem: Yapay Zeka İletişim Ajanları ve İşletmenize Sağlayacağı Faydalar
Yapay zeka iletişim ajanlarının işletmeler için 1.118 farklı kullanım senaryosu, maliyet tasarrufu, ölçeklenebilirlik ve kişiselleştirilmiş müşteri deneyimi sağlama konularında kapsamlı bir rehber.
Certifications & Badges
Professional certifications and learning achievements

