Full-time
Redmond, WA
Job description
We are excited to announce an opening for a Cloud Solution Architect at NVIDIA and are seeking a passionate individual with a strong interest in large-scale GPU infrastructure and AI Factory deployments! If you are enthusiastic about contributing to projects that push the boundaries of cloud-based AI and resilience in large-scale environments, we invite you to read on. NVIDIA is renowned as one of the most sought-after employers in the technology world, offering highly competitive benefits. We are home to some of the most innovative and forward-thinking individuals globally. If you are creative, autonomous, and eager to apply your skills and knowledge in a dynamic environment, we want to hear from you!
What you'll be doing:
Working as a key member of our cloud solutions team, you will be the go-to technical expert on NVIDIA AI Factory solutions and large-scale GPU infrastructure, helping clients architect and deploy resilient, telemetry-driven AI compute environments at unprecedented scale.
Collaborating directly with engineering teams to secure design wins, address challenges, and deploy solutions into production, with a focus on developing robust tooling for observability, failure recovery, and infrastructure-level performance optimization.
Acting as a trusted advisor to our clients, understanding their cloud environment, translating requirements into technical solutions, and providing guidance on optimizing NVIDIA AI Factories for scalable, reliable, and high-performance workloads.
What we need to see:
2+ years of experience in large-scale cloud infrastructure engineering, distributed AI/ML systems, or GPU cluster deployment and management.
A BS in Computer Science, Electrical Engineering, Mathematics, or Physics, or equivalent experience.
Proven understanding of large-scale computing systems architecture, including multi-node GPU clusters, high-performance networking, and distributed storage.
Experience with infrastructure-as-code, automation, and configuration management for large-scale deployments.
A passion for machine learning and AI, and the drive to continually learn and apply new technologies.
Excellent interpersonal skills, including the ability to explain complex technical topics to non-experts.
Ways to stand out from the crowd:
Expertise with orchestration and workload management tools like Slurm, Kubernetes, Run:ai, or similar platforms for GPU resource scheduling.
Knowledge of AI training and inference performance optimization at scale, including distributed training frameworks and multi-node communication patterns.
Hands-on experience designing telemetry systems and failure recovery mechanisms for large-scale cloud infrastructures including observability tools such as Grafana, Prometheus, and OpenTelemetry.
Proficiency in deploying and managing cloud-native solutions using platforms such as AWS, Azure, or Google Cloud, with a focus on GPU-accelerated workloads.
Deep expertise with high-performance networking technologies, particularly NVIDIA InfiniBand, NCCL, and GPU-Direct RDMA for large-scale AI workloads.
You will also be eligible for equity and benefits.