Ali Doğan

Senior Site Reliability Engineer with 6+ years of experience operating and stabilizing distributed, cloud, and on-prem systems. I am currently focused on AI and MLOps infrastructure, applying proven SRE principles to ensure reliability, observability, performance, and operational safety of AI-enabled and data-intensive systems in production.

With a strong background in observability, incident response, and deep system analysis, I approach AI workloads from a production-first perspective—prioritizing predictability, debuggability, and scalability over experimentation. Throughout my career, I have consistently used analytical and AI-assisted tools to understand complex system behavior; today, I apply that experience to building and operating the infrastructure that supports AI at scale.

Ankara, Turkiye

Core Skills

AI / MLOps Infrastructure

  • Preparing AI/ML-backed and data-intensive services for reliable operation in production environments.
  • Designing observability for model inference, data pipelines, and AI-enabled services in production.
  • Monitoring latency, throughput, error patterns, and cost characteristics of AI workloads.
  • Deploying and operating ML-backed services from an infrastructure and platform perspective, not research.
  • Maintaining a strict separation between model experimentation and production reliability ownership.

Reliability & SRE

  • Defining and operating SLIs/SLOs, managing error budgets, and using reliability signals to guide operational and architectural decisions.
  • End-to-end incident management and root cause analysis across application, infrastructure, network, database, and storage layers.
  • Improving detection and recovery times through observability, automation, alert quality, and disciplined operational practices.
  • Building and operating observability systems using metrics, logs, and traces to enable full system visibility and correlation.

Observability & Telemetry

  • Design and operation of observability platforms (metrics, logs, traces)
  • High-volume telemetry analysis to identify trends, bottlenecks, and failure patterns
  • Development of custom analysis agents for log, metric, and trace correlation
  • Alert design, signal quality improvement, and noise reduction

Programming & Automation

  • Python for automation, analysis tooling, and backend services
  • REST and gRPC-based services using FastAPI and Flask
  • Infrastructure automation and CI/CD using Terraform, Azure DevOps, and GitHub Actions
  • Operational scripting with Bash and PowerShell

Cloud & Platform

  • Kubernetes (Rancher), Docker, OpenStack, containerized and on-prem platforms
  • Azure (primary), GCP (working knowledge), hybrid cloud and bare-metal environments
  • High-availability, fault-tolerant architectures for distributed systems

Professional Experience

Senior Site Reliability Engineer

DESTEL

AnkaraJan 2022 - Present
  • Site Reliability Engineer in the Digital Performance team, responsible for improving reliability, performance, and visibility of complex, distributed production systems across multiple client engagements.
  • Perform advanced root cause analysis (RCA) by correlating telemetry data (logs, metrics, traces) with application, infrastructure, and network behavior, including deep packet-level analysis when required.
  • Use APM and NPM tools that apply analytical and AI-assisted techniques to identify anomalies, performance regressions, and behavioral patterns in production systems.
  • Analyze high-volume telemetry data to identify trends, bottlenecks, and failure patterns, strengthening the foundation for data-driven system analysis and automation.
  • Develop custom analysis agents for logs, metrics, and traces, focusing on data aggregation, correlation, and automated insight generation to support faster RCA and operational decision-making.
  • Automate deployment, monitoring, alerting, and data collection workflows to reduce operational toil, detection times, and response times while improving system stability and operational consistency.
  • Support client teams in adopting SRE best practices, including SLI/SLO definition, alert design, observability standards, and operational ownership models.

Site Reliability Engineer

EHSIM

AnkaraMar 2021 - Oct 2021
  • Served as one of three Site Reliability Engineers responsible for the availability and reliability of a confidential, mission-critical on-prem system operating in a high-compliance environment.
  • Managed and operated infrastructure built on OpenStack, Rancher / Kubernetes, Ceph, PostgreSQL, HAProxy, Zabbix, Grafana, and the ELK stack.
  • Performed root cause analysis (RCA) across application, infrastructure, network, and storage layers to resolve production incidents and ensure continuous system availability.
  • Designed and implemented monitoring and alerting strategies to improve system visibility and reduce incident detection and response times.
  • Developed automation scripts to reduce operational toil and support repeatable infrastructure and maintenance tasks.
  • Troubleshot complex issues involving Kubernetes clusters, etcd, networking, storage (Ceph), and databases, operating with full production responsibility as part of a small SRE team.

Software Engineer & DevOps

Logarity

İstanbulApr 2020 - Sep 2020
  • Co-founded and developed Logarity, an ELK-based mini SIEM solution targeting small companies that required mandatory log retention but could not afford enterprise SIEM platforms.
  • Extended the ELK stack with custom components for log archiving, agent management, and long-term storage, focusing on operational scalability.
  • Designed and implemented high-throughput log ingestion pipelines using Kafka to handle increased data volume under load.
  • Developed gRPC-based agents to optimize log data transfer between endpoints and the central platform, improving efficiency and reliability.
  • Built backend services using Python and containerized all components with Docker for reproducible deployments.
  • Owned DevOps and operational responsibilities, including container orchestration, deployment workflows, and system reliability for on-prem installations.

Software Engineer & DevOps

AllConfig

İstanbulJul 2019 - Mar 2020
  • Led a small development team responsible for building an on-prem, microservice-based network configuration management system.
  • Designed and developed Python-based backend services, exposing APIs for managing and auditing network device configurations.
  • Containerized services using Docker and deployed them with Docker Swarm, focusing on service isolation and operational simplicity.
  • Automated network device configuration and management using Netmiko, gaining hands-on experience with network protocols, device configurations, and operational networking concepts.
  • Owned DevOps responsibilities, including service packaging, deployment workflows, and basic CI/CD practices for on-prem environments.
  • Worked under the technical leadership and mentorship of Okan Eke, with guidance and collaboration from Erkun Altunbaş and Gürkan Atabay, gaining practical experience in system design and operational decision-making.

Machine Learning Software Engineer

VeriUs

İstanbulJun 2018 - May 2019
  • Built and maintained data collection and preprocessing pipelines for NLP systems, including web scraping, parsing, normalization, tokenization, stemming, and noise filtering.
  • Developed and containerized Python-based REST APIs using Flask and Docker to serve NLP functionalities for internal and client-facing use.
  • Conducted model training and evaluation experiments for intent detection, text summarization, and noisy text classification, focusing on data quality and performance validation.
  • Collaborated with Associate Professor Dr. Murat Can Ganiz on academic NLP research and contributed to a peer-reviewed publication on Word Sense Disambiguation.
  • Also worked with Aydın Gerek on optimization of language models.

Certifications & Badges

Professional certifications and learning achievements

Grafana Labs

Trailblazer Technical Practitioner

Grafana Labs
Issued: Nov 2025
GrafanaObservability
DevOps Institute

Site Reliability Engineering (SRE) Foundation™v1.1

DevOps Institute
Issued: Dec 2022
ID: 23833559
TroubleshootingObservabilitySite Reliability Engineering
Riverbed Technology

End-to-End Visibility

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceApplication Performance
Riverbed Technology

RCPE Associate: AppResponse

Riverbed Technology
TroubleshootingObservabilityApplication PerformanceAPM
Riverbed Technology

RCPE Associate: Introduction to NPM

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceNPM
Riverbed Technology

RCPE Certified Professional

Riverbed Technology
TroubleshootingObservabilityNetwork PerformanceApplication Performance
Riverbed Technology

RCPE Foundation: Performance Foundations

Riverbed Technology
TroubleshootingObservabilityPerformance AnalysisSystem Monitoring
Riverbed Technology

RCPE Professional: AppResponse 11

Riverbed Technology
TroubleshootingObservabilityApplication PerformanceAPM+1 more
Riverbed Technology

RCPE Professional: Packet Analyzer Plus (PA+)

Riverbed Technology
TroubleshootingObservabilityPacket AnalysisNetwork Analysis+1 more