
You are the infrastructure expert who enables our rapid product development and guarantees 99.9%+ stability and performance of our clinical AI platform for major health systems. Your focus on operational excellence is directly tied to a patient’s access to life-saving treatment.
As our SRE, you will be responsible for our entire production environment and improve the development experience across both infrastructure and application reliability.
Infrastructure Responsibilities
Infrastructure Ownership: Design, implement, and maintain the production environment, having previously handled 500+ machine deployments.
Kubernetes Mastery: Own our containerized infrastructure, leveraging deep expertise in Kubernetes and Helm to manage deployment, scaling, and operational health.
CI/CD & Deployment Optimization: Optimize and streamline both the TypeScript and Python/ML deployment pipelines to support high-velocity feature release while maintaining the highest reliability.
DevX Support: Support Developer Experience (DevX) work to streamline developer workflows, enhance tool proficiency, and improve CI/CD systems.
Infrastructure as Code (IaC): Manage and maintain infrastructure definitions using Terraform.
Application Reliability Responsibilities
Service Reliability Strategy (SLIs/SLOs): Define, implement, and evolve SLIs and SLOs for existing and new services; partner with engineering and product to align targets with patient and customer impact.
Observability Standards: Extend and standardize in-application observability by introducing consistent metrics, OpenTelemetry traces, and events/logs (including naming conventions, required attributes, and dashboards/alerts).
SLO-Driven Operations: Monitor reliability and performance against SLOs; respond to SLO violations by driving corrective actions and validating outcomes (not just mitigating symptoms).
Performance & Scalability Improvements: Use trace and metrics data to identify bottlenecks (e.g., slow endpoints, expensive queries, queue backlogs) and implement or drive improvements in application code and configurations.
Database & Dependency Optimization: Improve reliability and latency through targeted changes such as database indexing, query optimization, connection pooling, caching strategies, and safe dependency timeouts/retries.
Incident Learnings to Engineering Outcomes: Lead/participate in incident response and post-incident reviews with a bias toward durable fixes—shipping instrumentation, tests, guardrails, and code changes that prevent recurrence.
The ideal candidate for this role is someone who enjoys working both on infrastructure configuration and contributing performance and reliability enhancements to application code. You likely spent several years full time as a backend software engineer but have also contributed heavily to terraform IaC projects and have deep knowledge of deploying and running applications on Kubernetes.
Take the next step in your career journey