About this role
<p></p>
<p>At Robots & Pencils, we build meaningful, scalable digital products by blending strategy, design, and engineering. We are seeking a Level 4 AI Engineer to build production LLM applications for an enterprise client as part of a long-term, delivery-focused engagement.</p>
<p>You will own the AI stack end-to-end, including RAG pipelines, prompt engineering, and evaluation frameworks. This is a hands-on role: you will write production code, tune prompts, build evaluation and observability systems, and iterate based on real user feedback.</p>
<p>There is a working proof of concept in place. Your responsibility is to make it production-ready and extend it with intelligent, reliable features that operate at enterprise scale.</p>
<p> </p>
<p><strong>What You’ll Do</strong></p>
<p>AI & LLM Application Delivery</p>
<p>· Build, optimize, and evolve RAG pipelines, including retrieval strategies, chunking, and re-ranking</p>
<p>· Develop prompts and guardrails for domain-specific LLM applications</p>
<p>· Implement hallucination detection, mitigation, and fact-checking mechanisms</p>
<p>· Build embeddings-based search and recommendation features</p>
<p>· Validate AI features with real users and iterate based on qualitative and quantitative feedback</p>
<p>Evaluation, Monitoring & Reliability</p>
<p>· Set up and maintain LLM evaluation frameworks to measure quality, relevance, and reliability</p>
<p>· Implement observability and monitoring for production AI systems</p>
<p>· Monitor live AI systems and resolve quality, accuracy, and performance issues</p>
<p>· Continuously improve AI outputs based on evaluation data and user behavior</p>
<p>Platform & System Integration</p>
<p>· Work closely with product and engineering teams to integrate AI into user-facing features</p>
<p>· Build and maintain backend services in Python</p>
<p>· Integrate with vector databases to support retrieval and semantic search workflows</p>
<p>· Ensure AI solutions meet enterprise requirements for security, scalability, and maintainability</p>
<p>Delivery & Collaboration</p>
<p>· Collaborate with cross-functional partners across product, engineering, and design</p>
<p>· Operate effectively in environments with evolving requirements and ambiguity</p>
<p>· Communicate clearly with technical and non-technical stakeholders</p>
<p>· Take ownership of delivery outcomes from experimentation through production</p>
<p> </p>
<p><strong>Required Skills & Experience</strong></p>
<p>· 8+ years of professional software engineering experience, with 4+ years focused on applied AI/ML or data-driven systems in production environments</p>
<p>· 3+ years building and operating production AI systems</p>
<p>· Strong hands-on experience with LLM applications, including RAG, prompt engineering, and evaluation</p>
<p>· Experience implementing hallucination detection and mitigation techniques</p>
<p>· Proficiency in Python</p>
<p>· Experience working with vector databases (Weaviate, Pinecone, or similar)</p>
<p>· Experience with LLM evaluation frameworks (Langfuse, Weights & Biases, or custom solutions)</p>
<p>· Production experience using Claude and/or GPT APIs</p>
<p>· Strong understanding of embeddings and semantic search</p>
<p>· Comfortable working with ambiguity and iterating on unclear problems</p>
<p>· Bachelor's degree in computer science, Engineering, Data Science, or a related technical field, or equivalent practical experience</p>
<p>· Advanced degree (Master’s or PhD) in a relevant field</p>
<p> </p>
<p><strong>Nice to Have</strong></p>
<p>· Experience with Azure AI services, including Azure OpenAI and Cognitive Services</p>
<p>· Experience with document processing (PDF extraction, OCR)</p>
<p>· Exposure to audio or speech processing (e.g., Whisper or similar tools)</p>
<p>· Experience building enterprise B2B software</p>
<p>· Experience with ML classification and model training</p>
<p> </p>
<p><strong>Tech Stack</strong></p>
<p>· LLMs: Claude (Anthropic), Azure OpenAI</p>
<p>· Vector Database: Weaviate</p>
<p>· Backend: Python</p>
<p>· Infrastructure: Azure</p>
<p>· Evaluation & Observability: Langfuse or similar</p>
<p> </p>
<p><strong>How You Work</strong></p>
<p>· You are hands-on and delivery-focused, writing code and owning outcomes</p>
<p>· You balance speed with quality in production environments</p>
<p>· You communicate clearly and collaborate effectively across disciplines</p>
<p>· You take ownership of ambiguous problems and drive them to resolution</p>
<p>· You prioritize reliability, maintainability, and real-world impact</p>
<p> </p>
<p><strong>Why Robots & Pencils</strong></p>
<p>· Real production impact not a POC that sits on a shelf</p>
<p>· Exposure to the full AI lifecycle: RAG, LLM applications, evaluation, classification, and monitoring</p>
<p>· End-to-end ownership of the AI stack and technical decision-making</p>
<p>· A small, senior team with direct access to enterprise clients</p>
<p></p>