Site Reliability Engineer

Flip

Seniority

Midweight

Model

In-Office

Sector

B2B SaaS

Salary

Undisclosed

Contract

Full-Time

As a Site Reliability Engineer in the Platform Squad, you will be a key player in keeping Flip's infrastructure fast, resilient and ready to scale. You'll shape the reliability culture, tooling and practices that allow engineering teams to ship with confidence - at scale and without compromising availability.
What you'll doFurther expand and optimize cloud infrastructure on Azure and Kubernetes clusters designed for high throughput and highest availability to support Flip's rapid growth across the globe.
Design and implement zero-downtime deployments, rollback mechanisms and disaster-recovery strategies that keep the platform available around the clock.
Evolve the LGTM stack (Loki, Grafana, Tempo, Mimir) to give every team the visibility they need and use it to define and optimize SLOs.
Design, develop and optimize infrastructure as code with Pulumi in Go, eliminating toil and making the platform self-service for engineering teams.
Promote CI/CD best practices, incident management, post-mortems and developer experience across the entire engineering organization.
Collaborate with your squad and engineering leadership to define the platform's direction - from scalable, high-throughput systems and cost optimization to security posture and compliance.
What you'll need1–3 years of hands-on experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
Experience operating and scaling cloud infrastructures (Azure, GCP, AWS).
Deep knowledge of Kubernetes and container orchestration in production environments.
Hands-on experience with modern observability stacks (e.g. Prometheus, Mimir, Loki, ELK) and comfortable defining and operating SLOs and error budgets.
Solid software development skills in Go (preferred), Python or Kotlin.
Hands-on experience with infrastructure as code (e.g. Pulumi, OpenTofu, Terraform) and configuration tooling (e.g. Ansible, Chef).
Collaborative mindset, strong communication skills and business-fluent English.
Willingness to participate in on-call rotations to ensure the reliability of the platform.
Nice to haveExperience building and operating high-throughput, highly available systems in production.
Experience with Azure Kubernetes Service (AKS) specifically.
Experience with Kubernetes Gateway API and Envoy Gateway.
Familiarity with GitOps workflows and CI/CD pipeline design.
Knowledge of service mesh technologies (e.g. Linkerd, Istio).
Experience with operating High-Availability PostgreSQL.

APPLY →

Site Reliability Engineer

What you'll do

What you'll need

Nice to have

ABOUT FLIP

SIMILAR ROLES THIS WEEK