CareerZen Logo
Company logo

Machine Learning Engineer — Foundation Models & Systems

Forecareer

Full-time

San Francisco, CA

Job description

Overview

We are hiring a Machine Learning Engineer / Research Engineer for a well-funded AI infrastructure company building next-generation foundation models and high-performance model serving systems.

This role sits at the intersection of research and production, spanning large-scale pretraining, post-training (RL), evaluation environments, and deployment/inference optimization. The team works hands-on across the full model lifecycle and operates at serious scale.

About the Role

As a Research-Oriented ML Engineer, you’ll work across the full foundation-model stack:

  • Large-scale pretraining and scaling
  • Post-training and reinforcement learning
  • Sandbox environments for evaluation and agent learning
  • Deployment and inference optimization for production systems

You’ll move quickly from ideas to working systems, contribute production-grade infrastructure, and help deliver models that power real-world applications at scale.

What You’ll Work On

This role spans multiple tracks. Candidates may focus on one area or contribute across several.

Pretraining & Scaling

  • Train large foundation models across massive, heterogeneous datasets
  • Design stable training recipes and scaling strategies for new architectures
  • Improve throughput, memory efficiency, and utilization on large GPU clusters
  • Build and maintain distributed, fault-tolerant training pipelines

Post-Training & Reinforcement Learning

  • Develop post-training pipelines (SFT, preference optimization, RLHF / RLAIF, RL)
  • Curate and generate targeted datasets to improve specific model capabilities
  • Build reward models and evaluation loops for iterative improvement
  • Explore inference-time learning and compute-aware techniques

Sandbox Environments & Evaluation

  • Build scalable sandbox environments for agent learning and evaluation
  • Design automated evaluations for reasoning, tool use, and safety
  • Create offline and online environments that support RL-style training
  • Instrument systems for observability, reproducibility, and fast iteration

Deployment & Inference Optimization

  • Optimize inference latency and throughput for large models
  • Build high-performance serving pipelines (batching, KV caching, quantization)
  • Improve end-to-end efficiency, cost, and reliability in production
  • Profile and optimize runtime bottlenecks, GPU kernels, and memory behavior

Ideal Candidate ProfileTechnical Strength

  • Strong software engineering fundamentals (robust, performant systems)
  • Experience training or serving large neural networks (LLMs or similar)
  • Solid understanding of modern deep learning methods and literature
  • Comfort working in high-performance, GPU-based, distributed environments

Relevant Experience (one or more)

  • Large-scale distributed training (FSDP, ZeRO, Megatron-style systems)
  • Post-training pipelines (SFT, RLHF / RLAIF, eval loops)
  • Building RL environments, simulators, or agent frameworks
  • Inference optimization, model compression, quantization, profiling
  • Large-scale data pipelines for internet-scale ingestion and cleaning
  • Owning production ML systems end-to-end (monitoring, reliability)

Research Orientation

  • Ability to propose, test, and iterate on research ideas quickly
  • Strong experimental discipline: metrics, ablations, reproducibility
  • Builder mindset — turning ideas into working code and measurable results

Education

  • MS or PhD in Computer Science, Machine Learning, AI, Mathematics, or related field

Benefits

  • Competitive salary and meaningful equity
  • Medical, dental, and vision coverage
  • 401(k)
  • Flexible time off
  • Daily meals and snacks (on-site)

Equal Opportunity

This employer is committed to building a diverse and inclusive team and is an equal opportunity employer.

Job Type: Full-time

Pay: $200,000.00 - $275,000.00 per year

Benefits:

  • 401(k)
  • Dental insurance
  • Flexible schedule
  • Health insurance
  • Paid time off
  • Vision insurance

Work Location: In person