Ali Doğan
Senior Site Reliability Engineer with 6+ years of experience operating and stabilizing distributed, cloud, and on-prem systems. I am currently focused on AI and MLOps infrastructure, applying proven SRE principles to ensure reliability, observability, performance, and operational safety of AI-enabled and data-intensive systems in production.
With a strong background in observability, incident response, and deep system analysis, I approach AI workloads from a production-first perspective—prioritizing predictability, debuggability, and scalability over experimentation. Throughout my career, I have consistently used analytical and AI-assisted tools to understand complex system behavior; today, I apply that experience to building and operating the infrastructure that supports AI at scale.
Core Skills
AI / MLOps Infrastructure
- Preparing AI/ML-backed and data-intensive services for reliable operation in production environments.
- Designing observability for model inference, data pipelines, and AI-enabled services in production.
- Monitoring latency, throughput, error patterns, and cost characteristics of AI workloads.
- Deploying and operating ML-backed services from an infrastructure and platform perspective, not research.
- Maintaining a strict separation between model experimentation and production reliability ownership.
Reliability & SRE
- Defining and operating SLIs/SLOs, managing error budgets, and using reliability signals to guide operational and architectural decisions.
- End-to-end incident management and root cause analysis across application, infrastructure, network, database, and storage layers.
- Improving detection and recovery times through observability, automation, alert quality, and disciplined operational practices.
- Building and operating observability systems using metrics, logs, and traces to enable full system visibility and correlation.
Observability & Telemetry
- Design and operation of observability platforms (metrics, logs, traces)
- High-volume telemetry analysis to identify trends, bottlenecks, and failure patterns
- Development of custom analysis agents for log, metric, and trace correlation
- Alert design, signal quality improvement, and noise reduction
Programming & Automation
- Python for automation, analysis tooling, and backend services
- REST and gRPC-based services using FastAPI and Flask
- Infrastructure automation and CI/CD using Terraform, Azure DevOps, and GitHub Actions
- Operational scripting with Bash and PowerShell
Cloud & Platform
- Kubernetes (Rancher), Docker, OpenStack, containerized and on-prem platforms
- Azure (primary), GCP (working knowledge), hybrid cloud and bare-metal environments
- High-availability, fault-tolerant architectures for distributed systems
Professional Experience
Senior Site Reliability Engineer
DESTEL
- Site Reliability Engineer in the Digital Performance team, responsible for improving reliability, performance, and visibility of complex, distributed production systems across multiple client engagements.
- Perform advanced root cause analysis (RCA) by correlating telemetry data (logs, metrics, traces) with application, infrastructure, and network behavior, including deep packet-level analysis when required.
- Use APM and NPM tools that apply analytical and AI-assisted techniques to identify anomalies, performance regressions, and behavioral patterns in production systems.
- Analyze high-volume telemetry data to identify trends, bottlenecks, and failure patterns, strengthening the foundation for data-driven system analysis and automation.
- Develop custom analysis agents for logs, metrics, and traces, focusing on data aggregation, correlation, and automated insight generation to support faster RCA and operational decision-making.
- Automate deployment, monitoring, alerting, and data collection workflows to reduce operational toil, detection times, and response times while improving system stability and operational consistency.
- Support client teams in adopting SRE best practices, including SLI/SLO definition, alert design, observability standards, and operational ownership models.
Site Reliability Engineer
EHSIM
- Served as one of three Site Reliability Engineers responsible for the availability and reliability of a confidential, mission-critical on-prem system operating in a high-compliance environment.
- Managed and operated infrastructure built on OpenStack, Rancher / Kubernetes, Ceph, PostgreSQL, HAProxy, Zabbix, Grafana, and the ELK stack.
- Performed root cause analysis (RCA) across application, infrastructure, network, and storage layers to resolve production incidents and ensure continuous system availability.
- Designed and implemented monitoring and alerting strategies to improve system visibility and reduce incident detection and response times.
- Developed automation scripts to reduce operational toil and support repeatable infrastructure and maintenance tasks.
- Troubleshot complex issues involving Kubernetes clusters, etcd, networking, storage (Ceph), and databases, operating with full production responsibility as part of a small SRE team.
Software Engineer & DevOps
Logarity
- Co-founded and developed Logarity, an ELK-based mini SIEM solution targeting small companies that required mandatory log retention but could not afford enterprise SIEM platforms.
- Extended the ELK stack with custom components for log archiving, agent management, and long-term storage, focusing on operational scalability.
- Designed and implemented high-throughput log ingestion pipelines using Kafka to handle increased data volume under load.
- Developed gRPC-based agents to optimize log data transfer between endpoints and the central platform, improving efficiency and reliability.
- Built backend services using Python and containerized all components with Docker for reproducible deployments.
- Owned DevOps and operational responsibilities, including container orchestration, deployment workflows, and system reliability for on-prem installations.
Software Engineer & DevOps
AllConfig
- Led a small development team responsible for building an on-prem, microservice-based network configuration management system.
- Designed and developed Python-based backend services, exposing APIs for managing and auditing network device configurations.
- Containerized services using Docker and deployed them with Docker Swarm, focusing on service isolation and operational simplicity.
- Automated network device configuration and management using Netmiko, gaining hands-on experience with network protocols, device configurations, and operational networking concepts.
- Owned DevOps responsibilities, including service packaging, deployment workflows, and basic CI/CD practices for on-prem environments.
- Worked under the technical leadership and mentorship of Okan Eke, with guidance and collaboration from Erkun Altunbaş and Gürkan Atabay, gaining practical experience in system design and operational decision-making.
Machine Learning Software Engineer
VeriUs
- Built and maintained data collection and preprocessing pipelines for NLP systems, including web scraping, parsing, normalization, tokenization, stemming, and noise filtering.
- Developed and containerized Python-based REST APIs using Flask and Docker to serve NLP functionalities for internal and client-facing use.
- Conducted model training and evaluation experiments for intent detection, text summarization, and noisy text classification, focusing on data quality and performance validation.
- Collaborated with Associate Professor Dr. Murat Can Ganiz on academic NLP research and contributed to a peer-reviewed publication on Word Sense Disambiguation.
- Also worked with Aydın Gerek on optimization of language models.
Latest Blog Posts
Recent thoughts and tutorials
2026 Teknoloji Dünyası: Yazılım Geliştirmeden Kuantum Devrimine 10 Kritik Trend
2026'da teknoloji dünyasını şekillendiren 10 eğilim: Kod Temizlikçileri, AI platosu, halka arz dalgası, nükleer veri merkezleri ve kuantum uygulamaları. Teknik perspektif.
Tech Trends 2026: From AI Plateaus to the Rise of "Code Janitors"
Ten critical trends shaping 2026: the code janitor role, LLM plateau, IPO wave, humanoid robots, nuclear data centers, quantum practicality, and JavaScript evolution.
Decoding ClawdBot: Is Anthropic's Web Crawler a Threat to Your Infrastructure?
Identify ClawdBot activity, distinguish it from spoofing, and implement robots.txt or WAF controls to protect bandwidth and content without hurting SEO.
Certifications & Badges
Professional certifications and learning achievements

