Hi, I'm Sibasis — Staff Site Reliability Engineer & Platform Engineer

Avatar

Based in Berlin, Germany

Staff Site Reliability Engineer with 12+ years of experience designing, operating, and evolving production infrastructure for distributed systems at global scale — spanning large-scale Kubernetes platform engineering, cloud architecture, high-stakes incident command, and reliability practice across hypergrowth environments. At Delivery Hero's Tech Foundations Domain, I own the reliability, scalability, and operational posture of a Kubernetes platform processing 5M+ API requests per hour across 100+ production clusters. I define SLOs and error budgets with engineering teams, lead production readiness reviews, and drive incident management from detection through post-mortem — having contained some of the highest revenue-impact failures in the company's history, including AWS region outages and cluster-wide IP exhaustion. I actively integrate agentic AI into operational workflows — building LLM-powered tooling for alert correlation, incident summarization, and autonomous runbook execution. I build platforms that survive production, not just pass a load test.

100+ K8s clusters
in production
200+ avg nodes
per cluster
5M+ API requests
per hour
12+ years in
production infra
$M+ revenue saved
via incident response

Work
Experience

Technical
Skills

Kubernetes & Container Orchestration EKS · kOps · Cluster Lifecycle Automation · Fleet Management (100+ clusters) · Helm · Custom Controllers · AWS ECS
Cloud Platform AWS (EC2, VPC, IAM, EKS, RDS, S3, Lambda, Route53, Transit Gateway, Direct Connect) · Multi-region architecture · Cost Optimization
Infrastructure as Code & GitOps Terraform · CloudFormation · ArgoCD · FluxCD · GitOps workflows at fleet scale
Site Reliability Engineering Incident Command · Capacity Planning · SLO/SLI/Error Budget frameworks · Chaos Engineering · Runbook Automation · Post-mortem culture
Observability & Monitoring Datadog · Prometheus (remote write, high-cardinality tuning) · Loki (log ingestion at scale) · SLO/SLI definition & burn-rate alerting · Structured log pipelines (ELK / OpenSearch) · Distributed tracing (OpenTelemetry) · Platform-wide dashboarding (Grafana) · Alert quality & noise reduction · MTTD/MTTR optimization · SysDig · Dynatrace · AWS CloudWatch
Networking & Security VPC Design · CIDR Planning · IPv4 Exhaustion Mitigation · Network Policies · AWS GuardDuty · CloudTrail · CloudCustodian
CI/CD & Developer Platforms GitHub Actions · Jenkins · Drone CI · Spinnaker · Concourse · Bamboo · AWS CodePipeline · Internal Developer Platform design
Databases & Messaging AWS RDS · Aurora (PostgreSQL / MySQL) · DynamoDB · ElastiCache (Redis / Memcached) · Kafka
Agentic AI & LLM Tooling AI-assisted incident triage · LLM-powered runbook automation · MCP server design for infrastructure APIs · Agentic observability workflows · AI-augmented anomaly detection
Languages & Scripting Go · Python · Bash · JavaScript
API & Service Mesh Kong · AWS API Gateway · Mulesoft · Istio · Service mesh observability

Open Source
Projects

 pgview              │  <Esc> back  <g> top  <G> bottom  │  Data  public.routes
 admin@mydb · local  │  <d> describe  <f> row view/edit  │  42 rows  ~1.2K est · PK: id
─────────────────────────────────────────────────────────────────────────────────────────
  id    name              status    created_at           tags
▶ 1     Alice Johnson     active    2024-01-15 09:23:11  {platform,growth}
  3     Carol White       active    2024-03-19 11:02:44  {platform,api}
  7     Eve Martinez      active    2024-05-01 16:14:09  {growth}

 WHERE "status"::text ILIKE 'active'

Documentation    View on GitHub →

Hire Me

Download Resume (PDF)