Hi, I'm Sibasis — Staff Site Reliability Engineer & Platform Engineer

Based in Berlin, Germany

Staff Site Reliability Engineer with 12+ years of experience designing, operating, and evolving production infrastructure for distributed systems at global scale — spanning large-scale Kubernetes platform engineering, cloud architecture, high-stakes incident command, and reliability practice across hypergrowth environments. At Delivery Hero's Tech Foundations Domain, I own the reliability, scalability, and operational posture of a Kubernetes platform processing 5M+ API requests per hour across 100+ production clusters. I define SLOs and error budgets with engineering teams, lead production readiness reviews, and drive incident management from detection through post-mortem — having contained some of the highest revenue-impact failures in the company's history, including AWS region outages and cluster-wide IP exhaustion. I actively integrate agentic AI into operational workflows — building LLM-powered tooling for alert correlation, incident summarization, and autonomous runbook execution. I build platforms that survive production, not just pass a load test.

Work
Experience

Dec 2021 — Present
Delivery Hero SE, Berlin
Staff Systems Engineer · Tech Foundations Domain

Own and operate 100+ production Kubernetes clusters averaging 200+ nodes each, running thousands of workloads sustaining 5M+ API requests/hour — the infrastructure backbone for one of Europe's largest on-demand logistics platforms.
Drive re-architecture and redesign of critical platform systems, delivering multi-fold improvements in reliability, availability, and operational efficiency across the fleet.
Lead high-stakes incident response for highest revenue-impact events including AWS region outages and cluster-level IPv4 exhaustion — containing incidents that protected millions in revenue.
AWS Region Outage IP Exhaustion Multi-Cluster Failure
Managed and executed EKS version upgrades across the entire 100+ cluster fleet from EKS 1.21 through 1.32 — coordinating control plane and node group upgrades at scale while maintaining workload availability throughout each upgrade cycle.
Developed an internal upgrade automation tool that reduced the end-to-end lead time for a cluster and node upgrade from 2 weeks (constrained to scheduled maintenance windows) down to 2 hours — a 10× throughput improvement that decoupled upgrade velocity from maintenance calendar constraints and eliminated the operational backlog across the fleet.
Design and own the observability strategy across the cluster fleet — defining platform-level SLIs and SLOs, implementing burn-rate alerting, and maintaining dashboards that serve both on-call engineers and engineering leadership across 100+ clusters.
Worked extensively to sustain metrics and log ingestion at platform scale using Prometheus and Loki — solving high-cardinality ingestion bottlenecks, tuning remote write pipelines, managing label explosion across thousands of workloads, and keeping the observability stack itself reliable under the same traffic pressure it was built to monitor.
Drive signal quality over noise — refine alert thresholds and routing across the fleet so that pages are actionable, reduce mean time to detect (MTTD), and build runbooks backed by observability data that make incident response faster and less dependent on tribal knowledge.
Architect and implement cluster lifecycle automation, capacity planning, and reliability engineering frameworks at a scale most organizations never encounter.
Lead production readiness reviews and launch assessments for new services and significant platform changes — evaluating failure modes, scalability limits, observability coverage, and rollback posture before workloads reach production at scale.
Define and maintain SLOs and error budgets collaboratively with product and engineering teams, translating reliability requirements into actionable targets and burn-rate alerts that inform both on-call response and prioritization decisions.
Train and onboard every junior engineer joining the team — providing structured knowledge transfer on the infrastructure topology, operational caveats, failure patterns unique to the platform, and the institutional context that takes years to accumulate. Ensures new team members become effective contributors quickly and don't learn the hard lessons through production incidents.
Actively integrating agentic AI into operational workflows — building LLM-powered tooling for automated alert correlation, incident summarization, and runbook execution; exploring MCP server interfaces that expose fleet and observability APIs to AI agents for autonomous diagnosis and remediation.
Partner with product and platform teams across the Foundations Domain to define infrastructure strategy, set reliability standards, and eliminate systemic toil through automation.
Drive GitOps adoption and standardization across the cluster fleet using Terraform and AWS-native tooling.
Drive fleet-wide cost optimization across 100+ clusters through deep cluster utilization analysis — right-sizing node groups, improving workload bin-packing, and designing fault-tolerant topologies that maximize resource efficiency without sacrificing availability guarantees.
Establish cross-cluster cost visibility and resource attribution to surface waste early, enabling engineering teams to own and optimize their own cloud spend.

Oct 2019 — Nov 2021
Cloudreach GmbH, Berlin
Cloud Systems Developer · Consulting for VW Financial Services

Architected and delivered a highly available, multi-region API platform on AWS for VW Financial Services, serving as the cloud development foundation for a major European automotive finance enterprise.
Took ownership of DevOps processes, infrastructure-as-code, and SDLC tooling — accelerating delivery velocity and reducing deployment risk across the platform.
Conducted AWS cost optimization across the platform — identifying over-provisioned compute and database instances, implementing Reserved Instance and Savings Plans coverage, and establishing tagging hygiene for per-service cost attribution across the API estate.
Designed monitoring and alerting for the API platform — establishing per-service latency, error rate, and throughput baselines on Datadog and Dynatrace, enabling the team to detect regressions before they reached end users.
Developed reusable libraries and internal tooling that shortened onboarding time for platform contributors.

Jul 2017 — Aug 2019
Tata Consultancy Services, Bergen
Cloud Consultant · Consulting for DNB (Nordic Bank)

Greenfield build of a customer-centric banking application on AWS for one of the largest Nordic banks — from infrastructure provisioning through production go-live.
Led CI/CD initiatives across multiple workstreams (frontend and backend), Well-Architected Review execution, and a cross-project cost optimization program post-launch.

Mar 2014 — Jun 2017
Tata Consultancy Services, Frankfurt
DevOps Engineer · Consulting for Deutsche Börse AG

Set up automation, test suites, and CI/CD pipelines for a new product initiative at Deutsche Börse — a high-compliance financial exchange environment with strict operational standards.
Owned operational support and patch management for production workloads throughout the engagement.

Professional
Certificates

Technical
Skills

Kubernetes & Container Orchestration EKS · kOps · Cluster Lifecycle Automation · Fleet Management (100+ clusters) · Helm · Custom Controllers · AWS ECS

Cloud Platform AWS (EC2, VPC, IAM, EKS, RDS, S3, Lambda, Route53, Transit Gateway, Direct Connect) · Multi-region architecture · Cost Optimization

Infrastructure as Code & GitOps Terraform · CloudFormation · ArgoCD · FluxCD · GitOps workflows at fleet scale

Site Reliability Engineering Incident Command · Capacity Planning · SLO/SLI/Error Budget frameworks · Chaos Engineering · Runbook Automation · Post-mortem culture

Observability & Monitoring Datadog · Prometheus (remote write, high-cardinality tuning) · Loki (log ingestion at scale) · SLO/SLI definition & burn-rate alerting · Structured log pipelines (ELK / OpenSearch) · Distributed tracing (OpenTelemetry) · Platform-wide dashboarding (Grafana) · Alert quality & noise reduction · MTTD/MTTR optimization · SysDig · Dynatrace · AWS CloudWatch

Networking & Security VPC Design · CIDR Planning · IPv4 Exhaustion Mitigation · Network Policies · AWS GuardDuty · CloudTrail · CloudCustodian

CI/CD & Developer Platforms GitHub Actions · Jenkins · Drone CI · Spinnaker · Concourse · Bamboo · AWS CodePipeline · Internal Developer Platform design

Databases & Messaging AWS RDS · Aurora (PostgreSQL / MySQL) · DynamoDB · ElastiCache (Redis / Memcached) · Kafka

Agentic AI & LLM Tooling AI-assisted incident triage · LLM-powered runbook automation · MCP server design for infrastructure APIs · Agentic observability workflows · AI-augmented anomaly detection

Languages & Scripting Go · Python · Bash · JavaScript

API & Service Mesh Kong · AWS API Gateway · Mulesoft · Istio · Service mesh observability

Open Source
Projects

2026 — present
pgview
A lightweight, keyboard-driven PostgreSQL browser for the terminal — built in Go using the same TUI framework as k9s. Connect to any PostgreSQL-compatible endpoint and navigate tables, filter rows, inspect schemas, and run SQL queries without leaving the terminal.

Features
k9s-style TUI · Smart filter DSL (col=val, col=%sub%, array & JSONB element-wise) · SQL editor with schema-aware Tab-completion · SQL templates panel (Query / Write / DDL pre-filled with real column names) · Row viewer with inline editing and UPDATE commit · Mouse & touchpad scroll (vertical + horizontal) · Query history panel · Multi-arch binary releases (linux/darwin amd64+arm64)

Connect anywhere
Direct host:port · pgBouncer · AWS RDS Proxy · SSH tunnel · Full postgres:// DSN

Stack
Go 1.22 · tview · tcell · pgx

 pgview              │  <Esc> back  <g> top  <G> bottom  │  Data  public.routes
 admin@mydb · local  │  <d> describe  <f> row view/edit  │  42 rows  ~1.2K est · PK: id
─────────────────────────────────────────────────────────────────────────────────────────
  id    name              status    created_at           tags
▶ 1     Alice Johnson     active    2024-01-15 09:23:11  {platform,growth}
  3     Carol White       active    2024-03-19 11:02:44  {platform,api}
  7     Eve Martinez      active    2024-05-01 16:14:09  {growth}

 WHERE "status"::text ILIKE 'active'

Documentation View on GitHub →

Based in Berlin, Germany

WorkExperience

ProfessionalCertificates

TechnicalSkills

Open SourceProjects

Hire Me

Work
Experience

Professional
Certificates

Technical
Skills

Open Source
Projects