Plato

Infrastructure / DevOps

San Francisco · on-site

Introduction

Plato is an applied research lab building the foundational infrastructure to train specialized AI agents.

We turn real-world data streams into high-fidelity simulated environments that generate the training signal needed to make capable models.

Today, only a handful of players can train models for capable work. Compute and algorithms are rapidly commoditizing, but reinforcement learning data remains the bottleneck. Plato is changing that by automatically scaling training environments from proprietary real-world data. Our work supports frontier labs, hyperscalers, and enterprises building AI systems for complex, high-stakes work.

Why This Role Matters

Infrastructure is central to Plato's product and research loop.

Generic cloud systems are not designed for long-running RL environments, persistent agent workspaces, replayable rollouts, storage-efficient forks, or recursive debugging loops. To train useful agents, we need infrastructure that makes environment construction, experimentation, evaluation, and iteration feel like one seamless system.

As a Member of Technical Staff, Infrastructure / DevOps, you will own the systems that make Plato's research and training loops reliable at scale.

Role Description

You will build and operate the infrastructure behind long-horizon agent experiments, including environment VMs, storage-efficient snapshots and forks, orchestration for parallel agent fleets, shared workspaces, verifier workers, telemetry pipelines, deployment systems, and the operational tooling that lets researchers run thousands of experiments without thinking about the machinery underneath.

This is not conventional cloud plumbing. You will be building infrastructure that directly shapes the quality, speed, and reliability of Plato's research.

You Will Work On

Build and operate purpose-built infrastructure for RL rollouts, long-running agent tasks, and environment synthesis jobs.
Scale environment VMs, snapshots, checkpointing, persistent sandboxes, and storage-efficient forks.
Design orchestration systems for fleets of agents that crawl, synthesize, evaluate, debug, and rerun experiments.
Build telemetry, logging, tracing, replay, and observability systems for thousands of concurrent agent sessions.
Improve reliability, cold starts, uptime, cost efficiency, isolation, and developer experience across the infrastructure stack.
Partner with research engineers to turn experimental workflows into repeatable, production-grade systems.

What We're Looking For

We're looking for someone who is excited to work close to the metal of AI infrastructure and enjoys turning ambiguous research workflows into reliable systems.

You may be a strong fit if you:

Have experience building or operating distributed systems, cloud infrastructure, orchestration platforms, or developer tooling.
Are comfortable debugging across infrastructure, application, and research workflows.
Care deeply about reliability, observability, isolation, and cost efficiency.
Enjoy working with researchers and engineers to turn messy, fast-moving workflows into durable systems.
Want to build infrastructure that is part of the core product, not just internal support tooling.

How We Work

Being an engineer at an early-stage AI startup is not easy. These are the values we care about.

Ownership

We value teammates who bring novel ideas to the table, experiment, and see results through end to end. You'll have access to massive compute budgets to test large scale experiments.

Move Fast, Build Durable

Demand is growing faster than our team. We move quickly, prioritize ruthlessly, and ship systems that keep working under load.

Reality Over Narratives

Training data is incredibly fragile and prone to reward-hacking. We prioritize digging deep through data, manually if we have to, to garner deep intuition on retaining high quality throughput.

Stay Close to the Frontier

New AI capabilities rapidly change how we think about problems and what doors open. We stay close to the frontier of model capability, and encourage teammates to constantly share new findings and update their world model of what's possible.