Machine Learning Engineer — Foundation Models & Systems
Forecareer
Full-time
San Francisco, CA
Job description
Overview
We are hiring a Machine Learning Engineer / Research Engineer for a well-funded AI infrastructure company building next-generation foundation models and high-performance model serving systems.
This role sits at the intersection of research and production, spanning large-scale pretraining, post-training (RL), evaluation environments, and deployment/inference optimization. The team works hands-on across the full model lifecycle and operates at serious scale.
About the Role
As a Research-Oriented ML Engineer, you’ll work across the full foundation-model stack:
- Large-scale pretraining and scaling
- Post-training and reinforcement learning
- Sandbox environments for evaluation and agent learning
- Deployment and inference optimization for production systems
You’ll move quickly from ideas to working systems, contribute production-grade infrastructure, and help deliver models that power real-world applications at scale.
What You’ll Work On
This role spans multiple tracks. Candidates may focus on one area or contribute across several.
Pretraining & Scaling
- Train large foundation models across massive, heterogeneous datasets
- Design stable training recipes and scaling strategies for new architectures
- Improve throughput, memory efficiency, and utilization on large GPU clusters
- Build and maintain distributed, fault-tolerant training pipelines
Post-Training & Reinforcement Learning
- Develop post-training pipelines (SFT, preference optimization, RLHF / RLAIF, RL)
- Curate and generate targeted datasets to improve specific model capabilities
- Build reward models and evaluation loops for iterative improvement
- Explore inference-time learning and compute-aware techniques
Sandbox Environments & Evaluation
- Build scalable sandbox environments for agent learning and evaluation
- Design automated evaluations for reasoning, tool use, and safety
- Create offline and online environments that support RL-style training
- Instrument systems for observability, reproducibility, and fast iteration
Deployment & Inference Optimization
- Optimize inference latency and throughput for large models
- Build high-performance serving pipelines (batching, KV caching, quantization)
- Improve end-to-end efficiency, cost, and reliability in production
- Profile and optimize runtime bottlenecks, GPU kernels, and memory behavior
Ideal Candidate ProfileTechnical Strength
- Strong software engineering fundamentals (robust, performant systems)
- Experience training or serving large neural networks (LLMs or similar)
- Solid understanding of modern deep learning methods and literature
- Comfort working in high-performance, GPU-based, distributed environments
Relevant Experience (one or more)
- Large-scale distributed training (FSDP, ZeRO, Megatron-style systems)
- Post-training pipelines (SFT, RLHF / RLAIF, eval loops)
- Building RL environments, simulators, or agent frameworks
- Inference optimization, model compression, quantization, profiling
- Large-scale data pipelines for internet-scale ingestion and cleaning
- Owning production ML systems end-to-end (monitoring, reliability)
Research Orientation
- Ability to propose, test, and iterate on research ideas quickly
- Strong experimental discipline: metrics, ablations, reproducibility
- Builder mindset — turning ideas into working code and measurable results
Education
- MS or PhD in Computer Science, Machine Learning, AI, Mathematics, or related field
Benefits
- Competitive salary and meaningful equity
- Medical, dental, and vision coverage
- 401(k)
- Flexible time off
- Daily meals and snacks (on-site)
Equal Opportunity
This employer is committed to building a diverse and inclusive team and is an equal opportunity employer.
Job Type: Full-time
Pay: $200,000.00 - $275,000.00 per year
Benefits:
- 401(k)
- Dental insurance
- Flexible schedule
- Health insurance
- Paid time off
- Vision insurance
Work Location: In person