This role is for one of the Weekday's clients
Min Experience: 3 years
Location: Bengaluru
JobType: full-time
We are looking for a highly driven Technical Lead to work across a multi-product SaaS platform, owning system reliability, scalability, and technical execution. This is a horizontal leadership role spanning multiple products and core systems, ensuring platforms remain fast, secure, and resilient under scale and peak traffic conditions.
This is a hands-on technical leadership role, focused on architecture, reliability, and execution—not people management.
Key Responsibilities1. System Reliability & Performance (Primary Ownership)
- Own and improve reliability metrics across products, including uptime, SLAs, and latency (P95).
- Monitor and reduce application errors, bug leakage, and system failures.
- Ensure correctness of distributed systems involving synchronous and asynchronous workflows.
- Optimize queue processing, worker throughput, and caching layers (e.g., Redis).
- Prepare systems for high-traffic events and peak load scenarios.
- Lead root cause analysis and drive permanent, systemic fixes.
- Act as the technical owner for incident resolution and long-term prevention.
2. Architecture & Scalability
- Collaborate with senior technical stakeholders to evolve platform architecture.
- Improve API design, data models, and system boundaries.
- Design scalable distributed system patterns such as idempotent workflows, retries, batching, and fan-out orchestration.
- Build and scale asynchronous pipelines for high-volume workloads.
- Plan capacity for traffic spikes and introduce resilience patterns like circuit breakers and fail-safes.
3. Hands-On Engineering Leadership
- Lead and review technical designs across teams and products.
- Unblock engineers on complex architectural or performance challenges.
- Own and drive cross-product refactors and technical debt reduction.
- Enforce clean code standards, testing practices, and observability-first development.
- Mentor engineers on debugging, system design, and performance optimization.
4. Observability & Monitoring
- Define and maintain SLIs and SLOs across critical systems.
- Build dashboards, alerts, and monitoring using logs, metrics, and traces.
- Ensure issues are detected proactively before impacting users.
- Work closely with platform teams to instrument distributed workflows end-to-end.
5. Security & Compliance
- Ensure secure coding practices and adherence to compliance requirements (e.g., SOC 2).
- Enforce proper secrets management, access controls, and audit logging.
- Maintain data integrity, API security, and permission correctness across systems.
6. Cross-Functional Collaboration
- Partner with Product teams to translate requirements into technically sound solutions.
- Work with Support and Customer Success teams to deeply understand production issues.
- Collaborate with Core Systems and Infrastructure teams to improve platform stability.
- Align with QA teams to define testing strategies, including load, integration, and failure testing.
RequirementsMust Have
- 3–4+ years of backend engineering experience (Python preferred).
- Strong understanding of distributed systems and backend architecture.
- Deep experience with SQL databases, data modeling, and query optimization.
- Hands-on expertise with Redis, queues, async jobs, retries, and background processing.
- Strong debugging skills across application and infrastructure layers.
- Proven ability to lead technical decisions across multiple teams.
- Experience improving system reliability and performance at scale.
- Excellent communication and collaboration skills.
Nice to Have
- Experience with observability tools such as Datadog, Sentry, or Elasticsearch.
- Exposure to CRM integrations or large enterprise systems.
- Prior ownership of reliability for multi-product SaaS platforms.
- Familiarity with secure coding practices and compliance frameworks.
What Success Looks Like
0–3 Months
- Gain a deep understanding of platform architecture and core systems.
- Deliver quick reliability and performance improvements.
- Become a go-to technical problem solver across teams.
4–6 Months
- Establish clear SLIs and SLOs for key systems.
- Introduce architectural guardrails and reduce operational noise.
- Significantly lower error rates and production issues.
7–12 Months
- Achieve high availability (99.9%+) across core platforms.
- Ensure predictable and resilient async pipelines.
- Improve performance under peak traffic conditions.
- Enable faster engineering velocity through cleaner, more stable systems.
Skills
- Backend Engineering
- Distributed Systems
- System Reliability
- Relational Databases
- Platform Scalability