About this role
About AfterQueryAfterQuery builds the training data and evaluation infrastructure that frontier AI labs use to improve their models. They work with the world's leading labs to design high-signal datasets and run rigorous evaluations that go beyond static benchmarks. Post Series A, $30M raised at a $300M valuation, backed by Y Combinator and BoxGroup, with investors from Lightspeed, Warburg Pincus, Silver Lake, and Index Ventures, and a founding team from Jane Street, Meta, Citadel Securities, Google, Goldman Sachs, and Stanford AI Lab. Small, early team where individual contributors have a direct impact on how the next generation of models learn and improves.About the RoleAfterQuery is hiring Research Scientists to design the datasets and evaluation frameworks that shape how frontier models are trained and measured. You will work directly with research teams at top AI labs, experiment with data collection strategies, diagnose model failure modes, and develop the metrics that determine whether a model is actually getting better.This is hands-on, high-leverage work. You go from hypothesis to live experiment quickly, and your output directly influences model training runs at scale. The ideal candidate is an undergraduate or master's researcher who has not yet done a PhD, is obsessed with how data structure and quality drive model behavior, and favors building over theorizing.Compensation is performance-driven. If you design the next HLE-level benchmark, you can make over $1.5M in your first year. Average performance on this work generates around $400K in cash per year.Key ResponsibilitiesDesign data shapes that expose meaningful model failure modes across domains like finance, code, and enterprise workflowsBuild and refine evaluation rubrics and reward signals for RLHF and RLVR training pipelinesModel annotator behavior and run experiments to improve different model capabilitiesDevelop quantitative frameworks for measuring dataset quality, diversity, and downstream impact on model alignment and capabilityPartner with lab research teams to translate their training objectives into concrete data and evaluation specificationsRequirementsMust-HaveUndergrad or master's research background, strong preference for candidates who have not yet done a PhDPrior experience with evals or benchmarking, either at a competing data company, through academic research on model evaluation, or via internship at an RL environment companyGenuine obsession with how data structure, selection, and quality drive model behaviorAbility to design lightweight experiments, move fast, and extract actionable insights from messy resultsStrong quantitative instincts and familiarity with LLM training pipelines, RLHF, and RLVR, or evaluation methodologyComfort working across domains, including finance, software engineering, and policySWE experience is a significant plus; technical depth matters hereBias toward building over theorizingNice-to-HavePrior internship or work experience at an RL environment company, AI safety org, or benchmarking organization (METR, Artificial Analysis, Hellclimb, Idler AI, or equivalent)Academic research on model evaluation or benchmarkingExperience with AI for data or spreadsheet automation companies$110,000 to $150,000 base + up to 100% bonus + $100,000 to $150,000 equity over four years