About this role
<p><strong><span data-contrast="none"><span data-ccp-parastyle="No Spacing">ROLE</span></span></strong></p>
<p><span data-contrast="none"><span data-ccp-parastyle="No Spacing">Firmus Technologies is seeking a Senior </span><span data-ccp-parastyle="No Spacing">Platform </span><span data-ccp-parastyle="No Spacing">Engineer </span><span data-ccp-parastyle="No Spacing">to join our Engineering and Technology team. </span><span data-ccp-parastyle="No Spacing">You will </span><span data-ccp-parastyle="No Spacing">drive</span><span data-ccp-parastyle="No Spacing"> the design and implementation </span><span data-ccp-parastyle="No Spacing">of ou</span><span data-ccp-parastyle="No Spacing">r </span><span data-ccp-parastyle="No Spacing">MLO</span><span data-ccp-parastyle="No Spacing">ps</span><span data-ccp-parastyle="No Spacing"> capability. </span><span data-ccp-parastyle="No Spacing">You will </span><span data-ccp-parastyle="No Spacing">also </span><span data-ccp-parastyle="No Spacing">collaborate with other engineers and make </span><span data-ccp-parastyle="No Spacing">technical </span><span data-ccp-parastyle="No Spacing">decision</span><span data-ccp-parastyle="No Spacing">s</span> <span data-ccp-parastyle="No Spacing">on </span><span data-ccp-parastyle="No Spacing">scal</span><span data-ccp-parastyle="No Spacing">ing</span> <span data-ccp-parastyle="No Spacing">Firmus </span><span data-ccp-parastyle="No Spacing">AI factory </span><span data-ccp-parastyle="No Spacing">platform engineering capabilities</span><span data-ccp-parastyle="No Spacing"> to planet sca</span><span data-ccp-parastyle="No Spacing">le</span><span data-ccp-parastyle="No Spacing">,</span><span data-ccp-parastyle="No Spacing"> from </span><span data-ccp-parastyle="No Spacing">IaC</span><span data-ccp-parastyle="No Spacing">, </span><span data-ccp-parastyle="No Spacing">container orchestration, observability</span><span data-ccp-parastyle="No Spacing">, self-service portal to platform security.</span> <span data-ccp-parastyle="No Spacing">This role is ideal for a self-starter with passion for building things from first principles. You naturally break down complex problems into their fundamental truths to uncover novel and elegant solutions - rather than relying on conventional patterns.</span></span> <br> </p>
<p><strong><span data-contrast="none"><span data-ccp-parastyle="No Spacing">KEY RESPONSIBILITIES</span></span><span data-ccp-props="{"335551550":6,"335551620":6}"> </span></strong></p>
<ul>
<li><span data-contrast="auto">Build MLOps capabilities from the ground up, enabling reproducible, scalable, and secure ML workflows across internal and customer-facing environments. </span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Continuously improve our DevOps platform to ensure reliability, scalability, security, and seamless integration with CI/CD pipelines and infrastructure services.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Design, implement, operate and secure Kubernetes-based production infrastructure for high reliability, performance and security, including clusters supporting NVIDIA GB300 NVL72 systems with NVIDIA Quantum-X800 InfiniBand or Spectrum-X Ethernet. </span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Develop world-class observability platforms for internal and external customers</span></li>
<li><span data-contrast="auto">Integrate Firmus central services with NVIDIA’s software stack, including Mission Control, NETQ, UFM, and NMX.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Lead the enhancement and evangelism of internal platform products that provide cohesive, composable, secure-by-default, and low-friction self-service experiences that accelerates time to market and reduce engineers' cognitive load.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Drive incident response efforts, participate actively in the on-call rotation, and lead detailed Root Cause Analysis (RCA) to continuously improve system reliability, operational maturity, and incident handling processes.</span><span data-ccp-props="{}"> </span></li>
</ul>
<p> </p>
<p><span data-contrast="none"><span data-ccp-parastyle="No Spacing"><strong>SKILLS AND EXPERIENCE</strong> </span></span><span data-ccp-props="{"335551550":6,"335551620":6}"> </span></p>
<ul>
<li><span data-contrast="auto">Bachelor's degree in computer science or a related technical field.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">7+ years of experience as Platform Engineer, Site Reliability Engineer, DevOps engineer, MLOps Engineer or Observability Engineer.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Demonstrated strong proficiency on the following areas: </span><span data-ccp-props="{}"> </span>
<ul>
<li><span data-contrast="auto">Infrastructure-as-Code, configuration management and CI/CD (e.g., Terraform, Ansible, GitHub Actions, Jenkins, ArgoCD).</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Containerization technologies (e.g., Docker), Kubernetes networking and cluster management, including upgrades and troubleshooting.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Observability stack design and scaling (e.g., Loki, Grafana, Tempo, Prometheus, Thanos, ClickHouse).</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Telemetry solutions using various technology (e.g., Redfish, gNMI, SNMP, eBPF, streaming analytics).</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Unified telemetry collection with OpenTelemetry.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Compliance automation (e.g., OPA, Kyverno).</span><span data-ccp-props="{}"> </span></li>
</ul>
</li>
<li><span data-contrast="auto">Competent in scripting and programming skills (e.g., Bash, Python, Go).</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Systems knowledge on Linux internals, networking stacks, and distributed storage.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Clear and effective English communication, written and spoken.</span><span data-ccp-props="{}"> </span></li>
<li><span data-contrast="auto">Bonus: Experience in high-growth startups or regulated industries with robust security and data privacy requirements, including SOC 2 Type 2 and ISO 27001.</span><span data-ccp-props="{}"> </span></li>
</ul>