Confluent logo
Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)
full-timeOntario

Summary

Location

Ontario

Type

full-time

Claim this Company

Are you the employer? Manage your company page directly.

Explore Jobs

About this role

We’re not just building better tech. We’re rewriting how data moves and what the world can do with it. With Confluent, data doesn’t sit still. Our platform puts information in motion, streaming in near real-time so companies can react faster, build smarter, and deliver experiences as dynamic as the world around them.

It takes a certain kind of person to join this team. Those who ask hard questions, give honest feedback, and show up for each other. No egos, no solo acts. Just smart, curious humans pushing toward something bigger, together.

One Confluent. One Team. One Data Streaming Platform.


About the Role:

Confluent Cloud processes millions of events per second across AWS, GCP, and Azure. When incidents happen in a multi-cloud streaming platform, they happen at scale—data in motion, exactly-once semantics, and cascading failure modes that require deep systems thinking. We need an expert-level engineer who can drive proactive reliability improvements that prevent these incidents before they occur.

This role combines hands-on technical work with strategic program ownership. You'll spend roughly 75% of your time on engineering: building automation, improving tooling, analyzing systemic failure patterns, and designing reliability improvements. The remaining 25% is teaching and coordination: coaching teams through post-mortems, training incident commanders, and evolving our incident response practices.

You'll be part of a global team with follow-the-sun coverage, with clean handoffs that keep everyone working sustainable hours. This role sits within Cloud Architecture and Reliability - Supportability, a horizontal team that owns reliability standards and tooling across engineering. You're the person who makes us need incident management less.

What You Will Do:

  • Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence

  • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack

  • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments

  • Own standards, practices, and continuous improvement of incident response across engineering

  • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity

  • Develop and deliver training programs; coach teams through post-mortems

  • Partner with engineering leaders to elevate reliability practices org-wide


What You Will Bring:

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering

  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)

  • Experience navigating reliability/incident programs at 500+ engineer organizations

  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)

  • Strong understanding of distributed systems and failure modes at scale

  • Deep experience with observability: metrics, logging, tracing

  • Kubernetes and container orchestration experience

  • Understanding of CI/CD pipelines and release processes

  • Strong written communication (design docs, runbooks, post-mortems)

  • Experience driving org-wide process and cultural changes

  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Ready to build what's next? Let’s get in motion.

Come As You Are

Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.

We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.

Other facts

Tech stack
Incident Management,Reliability Engineering,Cloud Computing,AWS,GCP,Azure,Distributed Systems,Observability,Kubernetes,CI/CD,Kafka,Event Streaming,Automation,Tooling,Training,Coaching

About Confluent

Confluent is pioneering a fundamentally new category of data infrastructure focused on data in motion. Our cloud-native offering is the foundational platform for data in motion --- designed to be the intelligent connective tissue enabling real-time data, from multiple sources, to constantly stream across the organization. With Confluent, our customers can meet the new business imperative of delivering rich, digital customer experiences and real-time business operations. Our mission is to help every organization harness data in motion so they can compete and thrive in the modern world.

Team size: 1,001-5,000 employees
LinkedIn: Visit
Industry: Software Development
Founding Year: 2014

What you'll do

  • Analyze systemic failure patterns and design reliability improvements to prevent incident recurrence. Own incident management standards and practices while coaching teams through post-mortems and training.

Join Clera's Talent Pool

Get matched with similar opportunities at top startups

This role is hosted on Confluent's careers site.
Join our talent pool first to get notified about similar roles that match your profile.

Frequently Asked Questions

What does a Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) do at Confluent?

As a Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) at Confluent, you will: analyze systemic failure patterns and design reliability improvements to prevent incident recurrence. Own incident management standards and practices while coaching teams through post-mortems and training..

Why join Confluent as a Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada)?

Confluent is a leading Software Development company.

Is the Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) position at Confluent remote?

The Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) position at Confluent is based in Ontario, Canada. Contact the company through Clera for specific work arrangement details.

How do I apply for the Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) position at Confluent?

You can apply for the Staff Site Reliability Engineer - Incident Management & Reliability (Remote - Canada) position at Confluent directly through Clera. Click the "Apply Now" button above to start your application. Clera's AI-powered platform will help match your profile with this opportunity and guide you through the application process. You can also learn more about Confluent on their website.