Site Reliability Engineer
Flip
Seniority
Midweight
Model
In-Office
Sector
Salary
Undisclosed
Contract
Full-Time
About the role
As a Site Reliability Engineer in the Platform Squad, you will be a key player in keeping Flip's infrastructure fast, resilient and ready to scale. You'll shape the reliability culture, tooling and practices that allow engineering teams to ship with confidence - at scale and without compromising availability.
What you'll do
- Further expand and optimize cloud infrastructure on Azure and Kubernetes clusters designed for high throughput and highest availability to support Flip's rapid growth across the globe.
- Design and implement zero-downtime deployments, rollback mechanisms and disaster-recovery strategies that keep the platform available around the clock.
- Evolve the LGTM stack (Loki, Grafana, Tempo, Mimir) to give every team the visibility they need and use it to define and optimize SLOs.
- Design, develop and optimize infrastructure as code with Pulumi in Go, eliminating toil and making the platform self-service for engineering teams.
- Promote CI/CD best practices, incident management, post-mortems and developer experience across the entire engineering organization.
- Collaborate with your squad and engineering leadership to define the platform's direction - from scalable, high-throughput systems and cost optimization to security posture and compliance.
What you'll need
- 1–3 years of hands-on experience as a Site Reliability Engineer, Platform Engineer, DevOps Engineer, Infrastructure Engineer, Cloud Engineer, or Backend Engineer with a strong infrastructure focus.
- Experience operating and scaling cloud infrastructures (Azure, GCP, AWS).
- Deep knowledge of Kubernetes and container orchestration in production environments.
- Hands-on experience with modern observability stacks (e.g. Prometheus, Mimir, Loki, ELK) and comfortable defining and operating SLOs and error budgets.
- Solid software development skills in Go (preferred), Python or Kotlin.
- Hands-on experience with infrastructure as code (e.g. Pulumi, OpenTofu, Terraform) and configuration tooling (e.g. Ansible, Chef).
- Collaborative mindset, strong communication skills and business-fluent English.
- Willingness to participate in on-call rotations to ensure the reliability of the platform.
Nice to have
- Experience building and operating high-throughput, highly available systems in production.
- Experience with Azure Kubernetes Service (AKS) specifically.
- Experience with Kubernetes Gateway API and Envoy Gateway.
- Familiarity with GitOps workflows and CI/CD pipeline design.
- Knowledge of service mesh technologies (e.g. Linkerd, Istio).
- Experience with operating High-Availability PostgreSQL.

