Wali Hasan — Site Reliability Engineer

About me

From Varanasi to
mission-critical
platforms

"You don't always need to have everything figured out. Sometimes growth comes from taking the next opportunity, embracing uncertainty, and being willing to learn along the way."

Mumbai, India B.Tech CSE · AKTU CDAC DITISS Alumni 5+ years in SRE 13 July 1997

I grew up in Varanasi in a large family where resources were stretched thin — which, looking back, taught me to be resourceful before I ever touched a terminal. My parents' insistence on good education was the first investment in what became a career defined by ownership and reliability.

My path to engineering wasn't a straight line. After graduation, I didn't get placed through campus recruitment. I spent months preparing for government exams before realising it simply wasn't where my energy belonged. That honest self-assessment changed everything.

I enrolled at CDAC, earned a strong rank, survived a pandemic, and landed my first role. From bare-metal Kubernetes clusters to production systems serving millions of event-goers — every step has been about solving real problems, not just moving up a ladder.

Today I work at BookMyShow SEA, where reliability isn't a buzzword — it's a promise to millions of users trying to book the tickets they've been waiting months for. I own that promise end-to-end.

Career journey

April 2025 – Present

Site Reliability Engineer II

BookMyShow SEA · Mumbai

May 2023 – Mar 2025

Site Reliability Engineer I

BookMyShow SEA · Mumbai

Jul 2022 – Apr 2023

Engineer – DevOps

IndiaMART InterMESH · Noida

Feb 2021 – Jun 2022

Project Engineer

C-DAC Mumbai

2020

DITISS Program

Centre for Development of Advanced Computing

By the numbers

Impact that speaks

Not vanity metrics — real outcomes that affected real systems and real teams.

Years of SRE experience

Bare metal → cloud-native

97%

Uptime SLO maintained

Consistently, across sprints and scale events

Cloud cost reduction

Projected 45% with ongoing optimisation

Faster incident detection

Via observability stack improvements

10 min

Mean time to recovery

Across production incidents

"I take ownership of problems end-to-end. When something breaks at 2am and millions of users are trying to buy concert tickets — being the person teams can rely on is the whole job."

Professional experience

Where I've built things

Site Reliability Engineer II

BookMyShow SEA

May 2023 – Present

Mumbai

Owned end-to-end GCP infrastructure reliability, consistently delivering 97% uptime and hitting SLA/SLO targets across high-traffic events like live concert ticket drops.
Reduced cloud costs by 30% through cluster right-sizing and GKE workload optimisation — with a further 45% projected as improvements roll out.
Led high-traffic event scaling initiatives achieving zero downtime, alongside implementing blue-green deployments every sprint cycle.
Deployed production-grade HA Keycloak and RabbitMQ on GKE. Migrated ingress to NGINX with custom caching strategies for improved cache hit rates.
Built out observability using Prometheus, Grafana, and SigNoz — cutting incident detection time by ~30% and giving the team actionable dashboards instead of noise.
Worked across a wide surface: databases, IAM, security, compliance, container networking, and platform engineering — comfortable owning the unknown.

GCP GKE Terraform Prometheus Grafana SigNoz Keycloak RabbitMQ NGINX Ingress ArgoCD

Engineer – DevOps

IndiaMART InterMESH Ltd.

Jul 2022 – Apr 2023

Noida

Designed and deployed AWS EKS and private GKE clusters — my first experience seeing infrastructure decisions directly influence business outcomes at real scale.
Led end-to-end production deployment of a GeoIP application on EKS, managing the full lifecycle from architecture to cutover.
Built CI/CD pipelines using Devtron, and configured logging, HashiCorp Vault, and GitLab HA — contributing to Kubernetes adoption across the organisation.

AWS EKS GKE Devtron Vault GitLab HA CI/CD

Project Engineer

Centre for Development of Advanced Computing (C-DAC)

Feb 2021 – Jun 2022

Mumbai

Built and administered a multi-master bare-metal Kubernetes cluster from scratch for microservices — where I learned that infrastructure ownership means owning the hardware too.
Integrated a complete CI/CD ecosystem including GitLab, Jenkins, ArgoCD, Keycloak, and ELK stack — laying the foundation for everything that followed.
This role taught me the value of deep ownership. By consistently delivering and taking responsibility, I earned the trust to manage and support critical parts of the DevOps ecosystem.

Bare-metal K8s GitLab Jenkins ArgoCD Keycloak ELK Stack

Core expertise

What I work with

From wiring up a bare-metal cluster to tuning Prometheus alert rules at 1am — these are the tools I reach for and trust.

Cloud & Infrastructure

GCP AWS Terraform GKE EKS Cloud Cost Optimisation VPC / Networking IAM & Security

Kubernetes & Containers

Kubernetes Helm ArgoCD NGINX Ingress Docker Cluster Autoscaler Workload Identity Custom Resources

Observability & SRE

Prometheus Grafana SigNoz ELK Stack SLO/SLA Design Alerting Strategy Incident Response Runbooks

Middleware & Databases

Keycloak RabbitMQ Kafka PostgreSQL Redis Vault GitLab HA Jenkins

Featured work

Problems I've solved

01 / Featured

GKE Cloud Cost Optimisation Initiative

Challenge — Cloud spend growing faster than scale

Conducted a comprehensive audit of GKE workload resource requests vs actual usage. Identified systemic over-provisioning across namespaces — teams had set "safe" limits that no workload ever came close to hitting. Implemented right-sizing recommendations, node pool reconfiguration, and committed-use discount strategies. Result: 30% immediate reduction with a clear roadmap to 45%.

30% cloud spend reduction · 45% projected

GKEGCPTerraformPrometheusCost Explorer

Architecture snapshot

GKE Cluster

→

Metrics Server

→

Prometheus

VPA Recommender

→

Resource Reports

Right-sized Requests

→

CUD Reservations

Data-driven optimisation loop

Production-grade HA Keycloak on GKE

Challenge — No identity layer for microservices

Designed and deployed a high-availability Keycloak cluster on GKE with PostgreSQL backend, Infinispan session cache, and NGINX Ingress TLS termination. Built Helm charts for reproducible deployments and documented the runbook for zero-downtime upgrades.

Zero-downtime auth layer in production

KeycloakGKEHelmPostgreSQLNGINX

Kubernetes Observability Platform

Challenge — Flying blind during incidents

Deployed a full observability stack integrating Prometheus, Grafana, and SigNoz across the GKE cluster. Designed alert rules that fire on SLO burn rate rather than raw thresholds — cutting alert noise dramatically. Built dashboards that made the on-call experience genuinely useful instead of stressful.

~30% faster incident detection

PrometheusGrafanaSigNozAlertManagerGKE

RabbitMQ Production Cluster

Challenge — Unreliable async messaging at scale

Deployed a production-grade RabbitMQ cluster on GKE using the Operator pattern, with quorum queues for durability and cluster-level monitoring integrated into Grafana. Designed the topology to survive node failures without message loss.

Reliable message delivery at event scale

RabbitMQGKEKubernetes OperatorPrometheus

Platform Automation via Terraform

Challenge — Manual infrastructure = drift and toil

Codified GCP infrastructure end-to-end using Terraform — VPCs, GKE clusters, IAM bindings, firewall rules, and service accounts. Introduced modular repo structure enabling teams to provision environments consistently, reducing drift and eliminating the "works in staging" problem.

Full infrastructure as code coverage

TerraformGCPGKEIAMGitOps

Engineering philosophy

How I think about systems

Automate the repeatable

If you do something twice, write a script. If you do it three times, make it a service. Toil is a sign that the system needs work, not the human.

Reliability is a feature

It's not a constraint on shipping velocity. A system that works 99% of the time and ships fast is inferior to one that works 99.9% of the time and ships slightly slower.

Observe before you act

Decisions without data are guesses. Good observability isn't about having dashboards — it's about knowing what a healthy system looks like and getting paged when it doesn't.

Simplicity scales; complexity hides

Every layer of abstraction you add is a layer someone else has to debug at 3am. The right architecture is the simplest one that actually works.

Infrastructure should empower devs

The best platform is one developers don't think about. My job is to make the hard infrastructure problems invisible so product teams can focus on the problems they were actually hired to solve.

Own it, document it, handover it

True ownership means writing the runbook and training your replacement. Knowledge that lives only in your head is a single point of failure. Bus factor should be greater than one.

Beyond the terminal

The person behind the SRE

Engineering is what I do — but curiosity is who I am. Outside of work, I'm usually either travelling somewhere I've never been, watching a television series that has no right to be this good, or recently, getting embarrassingly competitive at pickleball.

I've started exploring writing — putting words together for the same reason I got into infrastructure: to make something that works. Maybe one day stand-up comedy too (the debugging skills transfer surprisingly well).

I'm from Varanasi. That city has a way of reminding you that not everything needs to be optimised. Some things just need to exist.

✈️

Travel

New places, new perspectives

🏓

Pickleball

Aggressively a beginner

📺

TV Series

Always in the middle of 3

✍️

Writing

Finding words for systems

"What started as a journey without a clear destination has evolved into a career built on curiosity, accountability, and continuous learning."

Wali Hasan

Currently exploring

Platform engineering patterns and Internal Developer Platforms

eBPF for deep Kubernetes observability

OpenTelemetry and distributed tracing at scale

Homelab experiments (because prod is where you learn)

Building reliable
systems at scale

From Varanasi to
mission-critical
platforms

Impact that speaks

Where I've built things

What I work with

Problems I've solved

GKE Cloud Cost Optimisation Initiative

Production-grade HA Keycloak on GKE

Kubernetes Observability Platform

RabbitMQ Production Cluster

Platform Automation via Terraform

How I think about systems

Automate the repeatable

Reliability is a feature

Observe before you act

Simplicity scales; complexity hides

Infrastructure should empower devs

Own it, document it, handover it

The person behind the SRE

Let's build reliable systems together

Building reliablesystems at scale

From Varanasi tomission-criticalplatforms

Impact that speaks

Where I've built things

What I work with

Problems I've solved

GKE Cloud Cost Optimisation Initiative

Production-grade HA Keycloak on GKE

Kubernetes Observability Platform

RabbitMQ Production Cluster

Platform Automation via Terraform

How I think about systems

Automate the repeatable

Reliability is a feature

Observe before you act

Simplicity scales; complexity hides

Infrastructure should empower devs

Own it, document it, handover it

The person behind the SRE

Let's build reliable systems together

Building reliable
systems at scale

From Varanasi to
mission-critical
platforms