Job Drop BerlinYOUR WAY INTO BERLIN TECH
NewsletterLinkedIn
AboutTermsImpressumPrivacy

Site Reliability Engineer

ddeepset
Seniority
Midweight
Model
Remote
Sector
Developer Tools
Salary
Undisclosed
Contract
Full-Time

About the role

We're hiring a Site Reliability Engineer to own and evolve deepset's cloud and customer infrastructure end to end. You'll work across SaaS, private cloud, and on-prem environments to make our self-hosted platform production-ready, drive CI/CD and GitOps maturity, and reduce complexity at scale. Your work will directly shape how deepset's AI platform is built, deployed, and scaled for our own cloud and for customers running it in their own environments.

What you'll do

  • Design, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem).
  • Help us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months.
  • Improve CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence.
  • Continuously simplify systems and optimize infrastructure spend without compromising performance or reliability.
  • Champion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems.

What you'll need

  • 2-5 years of experience working with large-scale production infrastructure
  • Fluent German language skills
  • Experience with distributed or service-oriented architectures
  • Hands-on expertise with AWS, Kubernetes, and CI/CD and GitOps (e.g. ArgoCD)
  • Working knowledge of Infrastructure as Code (Terraform preferred)
  • Solid troubleshooting skills - you can debug across systems, not just within one layer
  • A pragmatic mindset: you balance speed, simplicity, and reliability
  • Ownership and accountability - you take responsibility for systems end-to-end

Nice to have

  • Familiarity with observability stacks (e.g. Datadog, Prometheus)
  • Experience optimizing cloud costs at scale
  • Interest or experience in Machine Learning / LLM systems
  • Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture

What they offer

  • Remote-first setup with flexible hours and tech of your choice
  • 30 days vacation plus extra days for family sick leave
  • Competitive salary and stock options for every team member
  • Monthly sports and mental health support allowance
  • Annual learning and development budget
APPLY →