About this role
<p><strong>Job Overview:</strong></p>
<p>Drive reliability and operational maturity for <strong data-start="510" data-end="541">Kubernetes workloads on GKE</strong> through safe rollout patterns, high-signal observability, resilient IaC, and effective incident response. Collaborate with developers to harden CI/CD pipelines and address infrastructure concerns within application code.</p>
<p><strong>Key responsibilities:</strong></p>
<ul>
<li>Design and maintain resilient deployment patterns (blue-green, canary, GitOps syncs) across services.</li>
<li>Instrument and optimize logs, metrics, traces, and alerts to reduce noise and improve signal.</li>
<li>Review backend code (e.g., Django, Node.js, Go, Java) with a focus on infra touchpoints like database usage, timeouts, error handling, and memory consumption.</li>
<li>Tune and troubleshoot GKE workloads, HPA configs, network policies, and node pool strategies.</li>
<li>Improve or author Terraform modules for infrastructure resources (e.g., VPC, CloudSQL, Secrets, Pub/Sub).</li>
<li>Diagnose production issues from logs, traces, dashboards, and lead or support incident response.</li>
<li>Reduce config drift across environments and standardize secrets, naming, and resource tagging.</li>
<li>Collaborate with developers to harden delivery pipelines, standardize rollout readiness, and clean up infra smells in code.</li>
</ul>
<p><strong>Key skills:</strong></p>
<ul>
<li>Have 4–6+ years of experience in backend or infra-focused engineering roles (e.g., SRE, platform, DevOps, or fullstack).</li>
<li>Can confidently write or review production-grade code and infra-as-code (Terraform, Helm, GitHub Actions, etc.).</li>
<li>Have deep hands-on experience with Kubernetes in production, ideally on GKE, including workload autoscaling and ingress strategies.</li>
<li>Understand cloud concepts like IAM, VPCs, secret storage, workload identity, and CloudSQL performance characteristics.</li>
<li>Think in systems: you understand cascading failure, timeout boundaries, dependency health, and blast radius.</li>
<li>Regularly contribute to incident mitigation or long-term fixes (not just closing alerts).</li>
<li>Can influence through well-written PRs, documentation, and thoughtful design reviews.</li>
</ul>
<p><strong>Good to have:</strong></p>
<ul>
<li>Exposure to GitOps tooling such as ArgoCD or FluxCD.</li>
<li>Experience developing or integrating Kubernetes operators.</li>
<li>Familiarity with service-level indicators (SLIs), service-level objectives (SLOs), and structured alerting.</li>
</ul>
<p><strong>Tools and Expectations:</strong></p>
<ul>
<li>Datadog - Monitor infrastructure health, capture service-level metrics, reduce alert fatigue through high signal thresholds.</li>
<li>PagerDuty - Own incident management pipeline. Route alerts by severity and align with business SLAs.</li>
<li>GKE / Kubernetes - Improve cluster stability and workload isolation. Define auto-scaling configurations and tune for efficiency.</li>
<li>Helm / GitOps (ArgoCD/Flux) - Validate release consistency across clusters. Monitor sync status and rollout safety.</li>
<li>Terraform Cloud - Support DR planning and detect infrastructure changes through state comparisons.</li>
<li>CloudSQL / Cloudflare - Diagnose DB and networking issues. Monitor latency, enforce access patterns, and validate WAF usage.</li>
<li>Secret Management - Audit access to secrets, apply short-lived credentials, and define alerts for abnormal usage.</li>
</ul>
About Orion Innovation Naukri
Orion Innovation delivers next-generation solutions in Data, AI, Cloud, and Digital Experience, empowering organizations to innovate, scale, and embrace future technologies.
With deep software engineering expertise and a strong understanding of industry-specific challenges, we build data-driven products and solutions that enhance customer experiences, accelerate growth, and drive long-term value.
Envision what's next. Build what matters.
For more information, visit www.orioninc.com.