Infrastructure Engineer
Cambridge, MA · San Francisco, CA · Full-time
About the role
Build the distributed systems that run thousands of concurrent ML experiments: job orchestration, GPU scheduling, artifact storage, and observability. This is the substrate everything else at Autolab runs on, and you will own it end to end.
What you'll do
- Design and operate the orchestration layer that schedules thousands of concurrent training experiments across heterogeneous GPU clusters.
- Build GPU scheduling and placement that keeps utilization high across reserved, on-demand, and preemptible capacity.
- Own artifact storage for checkpoints, logs, metrics, and datasets, versioned and queryable across every experiment we run.
- Build observability that catches failing or stalled runs early, for both the platform and the experiments running on it.
- Make agent-launched workloads safe to run unattended: quotas, isolation, and cleanup that hold up without a human watching.
- Keep the platform reliable with a small team: pragmatic on-call, fast incident response, simple designs that fail loudly.
What we're looking for
- Experience building and operating distributed systems or ML training infrastructure in production.
- Hands-on with cluster schedulers (Kubernetes, Slurm, Ray, or similar) and the judgment to build custom only where off-the-shelf falls short.
- Strong Python plus a systems language (Go, Rust, or C++).
- Working knowledge of GPU training workloads: interconnects, storage throughput, and the failure modes of long-running jobs.
- A bias toward simple, debuggable systems over clever ones.
- Comfort owning large surface area at an early-stage company.
apply
Harness & Agents Engineer
Cambridge, MA · San Francisco, CA · Full-time
About the role
Design the agent harness: tool interfaces, sandboxing, evaluation loops, and the control logic that lets agents plan and execute research autonomously. The harness is the product; its quality determines how good our agents can be.
What you'll do
- Design the tool interfaces agents use to write code, launch experiments, query results, and read logs and metrics.
- Build sandboxing and isolation so agents can safely execute arbitrary experiment code on real infrastructure.
- Build evaluation loops that measure whether agents make good research decisions, not just whether their code runs.
- Write the control logic for long-horizon work: planning, budgets, retries, and escalation to a human when the agent is stuck.
- Design how agents accumulate and retrieve context from past experiment trajectories.
- Work with the post-training team to turn harness traces into training data for our own models.
What we're looking for
- You have built LLM agent systems or harnesses that ran in production, beyond demos and prototypes.
- Strong software engineering fundamentals and good taste in API and interface design.
- Practical experience with sandboxing and isolation (containers, microVMs, or similar).
- You understand how evals drive agent quality and have designed evals yourself.
- Enough ML experimentation background to know what a training run actually involves.
- You ship, measure, and iterate rather than architecting in the abstract.
apply
Post-training ML Engineer
Cambridge, MA · San Francisco, CA · Full-time
About the role
Own post-training for our agent models: RL and SFT pipelines, reward design, evals, and data engines built from experiment trajectories. You will turn the data our harness produces into models that make better research decisions.
What you'll do
- Build and own the RL and SFT pipelines that post-train our agent models.
- Design rewards for research tasks where the signal is noisy and delayed, and catch reward hacking before it ships.
- Build evals that track agent capability on real research workflows and gate what we deploy.
- Build data engines that turn experiment trajectories, including failed runs, into training data.
- Run ablations to decide where the next unit of compute goes, and manage the post-training compute budget.
- Track the post-training literature and decide what is worth adopting versus ignoring.
What we're looking for
- Hands-on post-training experience (RL from human or verifiable feedback, SFT, or preference optimization) on real models.
- Strong PyTorch and experience with distributed training.
- You have designed rewards or evals for agentic tasks and seen what goes wrong.
- Strong data instincts: filtering, deduplication, and curriculum matter as much as the algorithm.
- You can read a paper and have a working reproduction within days.
- Research taste combined with the engineering rigor to keep pipelines reliable.
apply
Forward Deployed Engineer
Cambridge, MA · San Francisco, CA · Remote (US) · Full-time
About the role
Embed with design partners across self-driving, robotics, and computer vision, integrate Autolab into their training stacks, and feed what you learn back into the product. You are the shortest path between a partner's problem and a fix.
What you'll do
- Embed with design partner teams and integrate Autolab into their schedulers, data pipelines, and training code.
- Debug across the boundary: their pipelines on one side, our agents and infrastructure on the other.
- Own the technical success of your accounts from first pilot to production use.
- Turn partner pain points into concrete product specs and push them through with the engineering team.
- Build the integration patterns and tooling that make the next deployment faster than the last.
- Travel to partner sites when being in the room is the fastest way to unblock them.
What we're looking for
- A strong generalist engineer who is productive in an unfamiliar codebase within days.
- Experience with production ML training pipelines; perception or CV background is a plus.
- You communicate clearly with customers, including when the news is bad.
- You work independently while embedded with a team that does not report to you.
- A bias toward fixing the problem now and writing it up after.
- Comfortable with travel and with remote collaboration across time zones.
apply
Chief of Staff
Cambridge, MA · San Francisco, CA · Full-time
About the role
Work directly with the founders on operations, hiring, partner relationships, and everything that keeps a fast-moving research company running. The job is to make the company faster than it would be without you.
What you'll do
- Run hiring end to end: sourcing, scheduling, candidate experience, and closing.
- Own day-to-day operations: finance, legal, and vendor coordination with outside counsel and accountants.
- Manage partner and investor relationships, and make sure every commitment gets followed through.
- Build lightweight internal processes that keep the team fast instead of slowing it down.
- Prepare materials for fundraising, board updates, and partner conversations.
- Pick up whatever is dropped, and notice it before anyone else does.
What we're looking for
- Experience in an operations, chief of staff, or founder-adjacent role at a fast-moving startup.
- Excellent writing: most of this job is turning ambiguity into a clear document.
- High ownership and comfort with broad, loosely defined scope.
- Enough technical fluency to follow ML research conversations; you do not need to write code.
- Discretion with sensitive company and personnel information.
- You finish things without being asked twice.
apply