Currently: SRE-II at BookMyShow SEA · Mumbai

Building reliable
systems at scale

Site Reliability Engineer specialising in Kubernetes, cloud infrastructure, observability, and platform engineering. 5+ years of ownership, zero excuses.

Find me on

🚧 GitHub is currently being refreshed.

From Varanasi to
mission-critical
platforms

"You don't always need to have everything figured out. Sometimes growth comes from taking the next opportunity, embracing uncertainty, and being willing to learn along the way."

Mumbai, India B.Tech CSE · AKTU CDAC DITISS Alumni 5+ years in SRE 13 July 1997

I grew up in Varanasi in a large family where resources were stretched thin — which, looking back, taught me to be resourceful before I ever touched a terminal. My parents' insistence on good education was the first investment in what became a career defined by ownership and reliability.

My path to engineering wasn't a straight line. After graduation, I didn't get placed through campus recruitment. I spent months preparing for government exams before realising it simply wasn't where my energy belonged. That honest self-assessment changed everything.

I enrolled at CDAC, earned a strong rank, survived a pandemic, and landed my first role. From bare-metal Kubernetes clusters to production systems serving millions of event-goers — every step has been about solving real problems, not just moving up a ladder.

Today I work at BookMyShow SEA, where reliability isn't a buzzword — it's a promise to millions of users trying to book the tickets they've been waiting months for. I own that promise end-to-end.

April 2025 – Present
Site Reliability Engineer II
BookMyShow SEA · Mumbai
May 2023 – Mar 2025
Site Reliability Engineer I
BookMyShow SEA · Mumbai
Jul 2022 – Apr 2023
Engineer – DevOps
IndiaMART InterMESH · Noida
Feb 2021 – Jun 2022
Project Engineer
C-DAC Mumbai
2020
DITISS Program
Centre for Development of Advanced Computing

Impact that speaks

Not vanity metrics — real outcomes that affected real systems and real teams.

0
Years of SRE experience
Bare metal → cloud-native
97%
Uptime SLO maintained
Consistently, across sprints and scale events
0%
Cloud cost reduction
Projected 45% with ongoing optimisation
0%
Faster incident detection
Via observability stack improvements
10 min
Mean time to recovery
Across production incidents
"I take ownership of problems end-to-end. When something breaks at 2am and millions of users are trying to buy concert tickets — being the person teams can rely on is the whole job."

Where I've built things

Site Reliability Engineer II
BookMyShow SEA
May 2023 – Present
Mumbai
  • Owned end-to-end GCP infrastructure reliability, consistently delivering 97% uptime and hitting SLA/SLO targets across high-traffic events like live concert ticket drops.
  • Reduced cloud costs by 30% through cluster right-sizing and GKE workload optimisation — with a further 45% projected as improvements roll out.
  • Led high-traffic event scaling initiatives achieving zero downtime, alongside implementing blue-green deployments every sprint cycle.
  • Deployed production-grade HA Keycloak and RabbitMQ on GKE. Migrated ingress to NGINX with custom caching strategies for improved cache hit rates.
  • Built out observability using Prometheus, Grafana, and SigNoz — cutting incident detection time by ~30% and giving the team actionable dashboards instead of noise.
  • Worked across a wide surface: databases, IAM, security, compliance, container networking, and platform engineering — comfortable owning the unknown.
GCP GKE Terraform Prometheus Grafana SigNoz Keycloak RabbitMQ NGINX Ingress ArgoCD
Engineer – DevOps
IndiaMART InterMESH Ltd.
Jul 2022 – Apr 2023
Noida
  • Designed and deployed AWS EKS and private GKE clusters — my first experience seeing infrastructure decisions directly influence business outcomes at real scale.
  • Led end-to-end production deployment of a GeoIP application on EKS, managing the full lifecycle from architecture to cutover.
  • Built CI/CD pipelines using Devtron, and configured logging, HashiCorp Vault, and GitLab HA — contributing to Kubernetes adoption across the organisation.
AWS EKS GKE Devtron Vault GitLab HA CI/CD
Project Engineer
Centre for Development of Advanced Computing (C-DAC)
Feb 2021 – Jun 2022
Mumbai
  • Built and administered a multi-master bare-metal Kubernetes cluster from scratch for microservices — where I learned that infrastructure ownership means owning the hardware too.
  • Integrated a complete CI/CD ecosystem including GitLab, Jenkins, ArgoCD, Keycloak, and ELK stack — laying the foundation for everything that followed.
  • This role taught me the value of deep ownership. By consistently delivering and taking responsibility, I earned the trust to manage and support critical parts of the DevOps ecosystem.
Bare-metal K8s GitLab Jenkins ArgoCD Keycloak ELK Stack

What I work with

From wiring up a bare-metal cluster to tuning Prometheus alert rules at 1am — these are the tools I reach for and trust.

Cloud & Infrastructure
GCP AWS Terraform GKE EKS Cloud Cost Optimisation VPC / Networking IAM & Security
Kubernetes & Containers
Kubernetes Helm ArgoCD NGINX Ingress Docker Cluster Autoscaler Workload Identity Custom Resources
Observability & SRE
Prometheus Grafana SigNoz ELK Stack SLO/SLA Design Alerting Strategy Incident Response Runbooks
Middleware & Databases
Keycloak RabbitMQ Kafka PostgreSQL Redis Vault GitLab HA Jenkins

Problems I've solved

02

Production-grade HA Keycloak on GKE

Challenge — No identity layer for microservices

Designed and deployed a high-availability Keycloak cluster on GKE with PostgreSQL backend, Infinispan session cache, and NGINX Ingress TLS termination. Built Helm charts for reproducible deployments and documented the runbook for zero-downtime upgrades.

Zero-downtime auth layer in production
KeycloakGKEHelmPostgreSQLNGINX
03

Kubernetes Observability Platform

Challenge — Flying blind during incidents

Deployed a full observability stack integrating Prometheus, Grafana, and SigNoz across the GKE cluster. Designed alert rules that fire on SLO burn rate rather than raw thresholds — cutting alert noise dramatically. Built dashboards that made the on-call experience genuinely useful instead of stressful.

~30% faster incident detection
PrometheusGrafanaSigNozAlertManagerGKE
04

RabbitMQ Production Cluster

Challenge — Unreliable async messaging at scale

Deployed a production-grade RabbitMQ cluster on GKE using the Operator pattern, with quorum queues for durability and cluster-level monitoring integrated into Grafana. Designed the topology to survive node failures without message loss.

Reliable message delivery at event scale
RabbitMQGKEKubernetes OperatorPrometheus
05

Platform Automation via Terraform

Challenge — Manual infrastructure = drift and toil

Codified GCP infrastructure end-to-end using Terraform — VPCs, GKE clusters, IAM bindings, firewall rules, and service accounts. Introduced modular repo structure enabling teams to provision environments consistently, reducing drift and eliminating the "works in staging" problem.

Full infrastructure as code coverage
TerraformGCPGKEIAMGitOps

How I think about systems

Automate the repeatable

If you do something twice, write a script. If you do it three times, make it a service. Toil is a sign that the system needs work, not the human.

Reliability is a feature

It's not a constraint on shipping velocity. A system that works 99% of the time and ships fast is inferior to one that works 99.9% of the time and ships slightly slower.

Observe before you act

Decisions without data are guesses. Good observability isn't about having dashboards — it's about knowing what a healthy system looks like and getting paged when it doesn't.

Simplicity scales; complexity hides

Every layer of abstraction you add is a layer someone else has to debug at 3am. The right architecture is the simplest one that actually works.

Infrastructure should empower devs

The best platform is one developers don't think about. My job is to make the hard infrastructure problems invisible so product teams can focus on the problems they were actually hired to solve.

Own it, document it, handover it

True ownership means writing the runbook and training your replacement. Knowledge that lives only in your head is a single point of failure. Bus factor should be greater than one.

The person behind the SRE

Engineering is what I do — but curiosity is who I am. Outside of work, I'm usually either travelling somewhere I've never been, watching a television series that has no right to be this good, or recently, getting embarrassingly competitive at pickleball.

I've started exploring writing — putting words together for the same reason I got into infrastructure: to make something that works. Maybe one day stand-up comedy too (the debugging skills transfer surprisingly well).

I'm from Varanasi. That city has a way of reminding you that not everything needs to be optimised. Some things just need to exist.

✈️
Travel
New places, new perspectives
🏓
Pickleball
Aggressively a beginner
📺
TV Series
Always in the middle of 3
✍️
Writing
Finding words for systems
"What started as a journey without a clear destination has evolved into a career built on curiosity, accountability, and continuous learning."
Wali Hasan

Currently exploring

Platform engineering patterns and Internal Developer Platforms
eBPF for deep Kubernetes observability
OpenTelemetry and distributed tracing at scale
Homelab experiments (because prod is where you learn)

Let's build reliable systems together

Whether it's a conversation about SRE practices, a role you think I'd be great at, or just an interesting infrastructure problem — my inbox is open.

Say hello

Mumbai, India · Open to remote and relocation