Ali Doğan

Senior Site Reliability Engineer with 6+ years of experience operating and stabilizing distributed, cloud, and on-prem systems. I am currently focused on AI and MLOps infrastructure, applying proven SRE principles to ensure reliability, observability, performance, and operational safety of AI-enabled and data-intensive systems in production.

With a strong background in observability, incident response, and deep system analysis, I approach AI workloads from a production-first perspective—prioritizing predictability, debuggability, and scalability over experimentation. Throughout my career, I have consistently used analytical and AI-assisted tools to understand complex system behavior; today, I apply that experience to building and operating the infrastructure that supports AI at scale.

Ankara, Turkiye

GitHub

Core Skills

AI / MLOps Infrastructure

Preparing AI/ML-backed and data-intensive services for reliable operation in production environments.
Designing observability for model inference, data pipelines, and AI-enabled services in production.
Monitoring latency, throughput, error patterns, and cost characteristics of AI workloads.
Deploying and operating ML-backed services from an infrastructure and platform perspective, not research.
Maintaining a strict separation between model experimentation and production reliability ownership.

Reliability & SRE

Defining and operating SLIs/SLOs, managing error budgets, and using reliability signals to guide operational and architectural decisions.
End-to-end incident management and root cause analysis across application, infrastructure, network, database, and storage layers.
Improving detection and recovery times through observability, automation, alert quality, and disciplined operational practices.
Building and operating observability systems using metrics, logs, and traces to enable full system visibility and correlation.

Observability & Telemetry

Design and operation of observability platforms (metrics, logs, traces)
High-volume telemetry analysis to identify trends, bottlenecks, and failure patterns
Development of custom analysis agents for log, metric, and trace correlation
Alert design, signal quality improvement, and noise reduction

Programming & Automation

Python for automation, analysis tooling, and backend services
REST and gRPC-based services using FastAPI and Flask
Infrastructure automation and CI/CD using Terraform, Azure DevOps, and GitHub Actions
Operational scripting with Bash and PowerShell

Cloud & Platform

Kubernetes (Rancher), Docker, OpenStack, containerized and on-prem platforms
Azure (primary), GCP (working knowledge), hybrid cloud and bare-metal environments
High-availability, fault-tolerant architectures for distributed systems

Professional Experience

Senior Site Reliability Engineer

DESTEL

AnkaraJan 2022 - Present

Ankara

Jan 2022 - Present

Site Reliability Engineer in the Digital Performance team, responsible for improving reliability, performance, and visibility of complex, distributed production systems across multiple client engagements.
Perform advanced root cause analysis (RCA) by correlating telemetry data (logs, metrics, traces) with application, infrastructure, and network behavior, including deep packet-level analysis when required.
Use APM and NPM tools that apply analytical and AI-assisted techniques to identify anomalies, performance regressions, and behavioral patterns in production systems.
Analyze high-volume telemetry data to identify trends, bottlenecks, and failure patterns, strengthening the foundation for data-driven system analysis and automation.
Develop custom analysis agents for logs, metrics, and traces, focusing on data aggregation, correlation, and automated insight generation to support faster RCA and operational decision-making.
Automate deployment, monitoring, alerting, and data collection workflows to reduce operational toil, detection times, and response times while improving system stability and operational consistency.
Support client teams in adopting SRE best practices, including SLI/SLO definition, alert design, observability standards, and operational ownership models.

Site Reliability Engineer

EHSIM

AnkaraMar 2021 - Oct 2021

Ankara

Mar 2021 - Oct 2021

Served as one of three Site Reliability Engineers responsible for the availability and reliability of a confidential, mission-critical on-prem system operating in a high-compliance environment.
Managed and operated infrastructure built on OpenStack, Rancher / Kubernetes, Ceph, PostgreSQL, HAProxy, Zabbix, Grafana, and the ELK stack.
Performed root cause analysis (RCA) across application, infrastructure, network, and storage layers to resolve production incidents and ensure continuous system availability.
Designed and implemented monitoring and alerting strategies to improve system visibility and reduce incident detection and response times.
Developed automation scripts to reduce operational toil and support repeatable infrastructure and maintenance tasks.
Troubleshot complex issues involving Kubernetes clusters, etcd, networking, storage (Ceph), and databases, operating with full production responsibility as part of a small SRE team.

Software Engineer & DevOps

Logarity

İstanbulApr 2020 - Sep 2020

İstanbul

Apr 2020 - Sep 2020

Co-founded and developed Logarity, an ELK-based mini SIEM solution targeting small companies that required mandatory log retention but could not afford enterprise SIEM platforms.
Extended the ELK stack with custom components for log archiving, agent management, and long-term storage, focusing on operational scalability.
Designed and implemented high-throughput log ingestion pipelines using Kafka to handle increased data volume under load.
Developed gRPC-based agents to optimize log data transfer between endpoints and the central platform, improving efficiency and reliability.
Built backend services using Python and containerized all components with Docker for reproducible deployments.
Owned DevOps and operational responsibilities, including container orchestration, deployment workflows, and system reliability for on-prem installations.

Software Engineer & DevOps

AllConfig

İstanbulJul 2019 - Mar 2020

İstanbul

Jul 2019 - Mar 2020

Led a small development team responsible for building an on-prem, microservice-based network configuration management system.
Designed and developed Python-based backend services, exposing APIs for managing and auditing network device configurations.
Containerized services using Docker and deployed them with Docker Swarm, focusing on service isolation and operational simplicity.
Automated network device configuration and management using Netmiko, gaining hands-on experience with network protocols, device configurations, and operational networking concepts.
Owned DevOps responsibilities, including service packaging, deployment workflows, and basic CI/CD practices for on-prem environments.
Worked under the technical leadership and mentorship of Okan Eke, with guidance and collaboration from Erkun Altunbaş and Gürkan Atabay, gaining practical experience in system design and operational decision-making.

Machine Learning Software Engineer

VeriUs

İstanbulJun 2018 - May 2019

İstanbul

Jun 2018 - May 2019

Built and maintained data collection and preprocessing pipelines for NLP systems, including web scraping, parsing, normalization, tokenization, stemming, and noise filtering.
Developed and containerized Python-based REST APIs using Flask and Docker to serve NLP functionalities for internal and client-facing use.
Conducted model training and evaluation experiments for intent detection, text summarization, and noisy text classification, focusing on data quality and performance validation.
Collaborated with Associate Professor Dr. Murat Can Ganiz on academic NLP research and contributed to a peer-reviewed publication on Word Sense Disambiguation.
Also worked with Aydın Gerek on optimization of language models.

Latest Blog Posts

Recent thoughts and tutorials

AITeknoloji Trendleri

2026 Teknoloji Dünyası: Yazılım Geliştirmeden Kuantum Devrimine 10 Kritik Trend

2026'da teknoloji dünyasını şekillendiren 10 eğilim: Kod Temizlikçileri, AI platosu, halka arz dalgası, nükleer veri merkezleri ve kuantum uygulamaları. Teknik perspektif.

Jan 28, 2026

4 min read

AITech Trends

Tech Trends 2026: From AI Plateaus to the Rise of "Code Janitors"

Ten critical trends shaping 2026: the code janitor role, LLM plateau, IPO wave, humanoid robots, nuclear data centers, quantum practicality, and JavaScript evolution.

Jan 27, 2026

5 min read

AISecurity

Decoding ClawdBot: Is Anthropic's Web Crawler a Threat to Your Infrastructure?

Identify ClawdBot activity, distinguish it from spoofing, and implement robots.txt or WAF controls to protect bandwidth and content without hurting SEO.

Jan 26, 2026

4 min read

View All Blog Posts