Plato

Research Engineer

San Francisco · on-site$200K – $600K • Offers Equity

Introduction

Plato is an applied research lab building the foundational infrastructure to train specialized AI agents.

We turn real-world data streams into high-fidelity simulated environments that generate the training signal needed to make capable models. Our work supports frontier labs, hyperscalers, and enterprises building AI systems for complex, high-stakes work.

Today, only a handful of players can train models for capable work. Compute and algorithms are rapidly commoditizing, but reinforcement learning data remains the bottleneck. Plato is changing that by automatically scaling training environments from proprietary real-world data.

Why This Role Matters

Research engineering is central to Plato's product and research loop.

The hard part of training specialized agents is not producing tasks that look plausible. It is finding tasks that are grounded in real workflows, difficult for current models, resistant to reward hacking, and useful as training signal. To do that, we need research engineers who can turn messy traces, model failures, and researcher hypotheses into environments, verifiers, rewards, evaluations, and curricula that improve continuously.

As a Member of Technical Staff, Research Engineer, you will own the loop that discovers high-signal training targets for frontier models.

Role Description

You will design experiments, build task generation systems, run evaluations, inspect model failures, and develop methods for mining tasks that are just out of reach of today's agents.

The work is empirical, systems-heavy, and close to the frontier. You will consume real-world trajectories or researcher hypotheses, materialize realistic data, propose candidate tasks, benchmark those tasks against frontier computer-use and agent models, and hill-climb until you find the failures that produce useful learning signal.

This is not a role for someone who only wants to run experiments or only wants to write research code. You will own the full loop: hypothesis, implementation, evaluation, analysis, iteration, and productionalization.

You Will Work On

Discover model failure modes from real-world traces, agent telemetry, targeted researcher hypotheses, and customer workflows.
Generate realistic curricula grounded in actual workflows rather than toy synthetic benchmarks.
Benchmark candidate tasks against frontier CUA and agent models using pass rates, rollouts, and behavioral traces as difficulty signals.
Build hill-climbing loops that mutate, filter, and rescore tasks until they surface high-signal targets.
Study reward hackability, distribution mismatch, task realism, long-horizon failures, and transfer from simulation to deployed agents.
Turn research prototypes into reliable internal systems for continuous curriculum generation.

What We're Looking For

We're looking for someone who is excited to work at the intersection of empirical AI research, systems engineering, and model evaluation.

You may be a strong fit if you:

Have strong implementation ability and can turn ambiguous research ideas into working systems.
Have experience with RL, LLM agents, computer-use agents, evals, post-training, synthetic data, simulation, or model behavior analysis.
Care deeply about whether a task is grounded, difficult, reward-hack-resistant, and capable of producing actual learning signal.
Are comfortable interpreting ambiguous model behavior and negative results.
Enjoy building continuous research loops rather than static benchmark artifacts.

How We Work

Being an engineer at an early-stage AI startup is not easy. These are the values we care about.

Ownership

We value teammates who bring novel ideas to the table, experiment, and see results through end to end. You'll have access to massive compute budgets to test large scale experiments.

Move Fast, Build Durable

Demand is growing faster than our team. We move quickly, prioritize ruthlessly, and ship systems that keep working under load.

Reality Over Narratives

Training data is incredibly fragile and prone to reward-hacking. We prioritize digging deep through data, manually if we have to, to garner deep intuition on retaining high quality throughput.

Stay Close to the Frontier

New AI capabilities rapidly change how we think about problems and what doors open. We stay close to the frontier of model capability, and encourage teammates to constantly share new findings and update their world model of what's possible.