Cedana (YC S23) Is Hiring
Recorded: May 29, 2026, 12:02 p.m.
| Original | Summarized |
Forward Deployed Engineer: AI + HPC at Cedana | Y Combinator Open menuAboutWhat Happens at YC?ApplyYC Interview GuideFAQPeopleYC BlogCompaniesStartup DirectoryFounder DirectoryLaunch YCLibraryPartnersResourcesStartup SchoolNewsletterRequests for StartupsFor InvestorsVerify FoundersHacker NewsBookfaceSafeFind a Co-FounderStartup JobsLog inApplyCedanaFast, reliable, reproducible AI with GPU live migrationForward Deployed Engineer: AI + HPC$140K - $180K•0.10% - 0.25%•US / Remote (US)Job typeFull-timeRoleEngineering, BackendExperience3+ yearsVisaUS citizen/visa onlyConnect directly with founders of the best YC-funded startups.Apply to role ›Neel MasterFounderNeel MasterFounderAbout the roleIntroducing Cedana Engineer solutions at client sites: Lead customer integrations. install, configure and deploy Cedana into SLURM, Kubernetes, and Dynamo environments. What we are looking for 3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments. Bonus if you have Experience at an HPC integrator field team Logistics Remote, US-based. ~25% travel for customer installs. Benefits 100% covered medical, dental, and vision insurance for employees and families Equal Opportunity Employer About CedanaCedana is pause/migrate/resume for compute workloads. We're working on building a global, real-time system for compute. This means a paradigm shift in how we allocate resources to things like high performance computing, numerical simulation and training and running machine learning models. We do so by taking a systems-level and deep-tech approach to these problems, working at the Linux Kernel layer and with hardware. |
Cedana addresses the challenges of scarcity and high costs inherent in AI and High Performance Computing (HPC) infrastructure, which leads to costly failures and limits research output. The core problem arises from the difficulty in achieving high utilization and throughput due to the complexity of managing workloads, hardware, and operations. Cedana’s solution involves maximizing AI+HPC cluster utilization and reliability through an automated GPU checkpointing infrastructure designed for transparent and fast migration of GPU workloads across compute instances without data loss. This system operates at the kernel and operating system level, allowing workloads to automatically migrate to enhance reliability and throughput while accelerating time to results, seamlessly integrating with existing systems like Kubernetes, SLURM, and NVIDIA Dynamo. The founding team brings extensive experience in making computation fast, productive, and reliable for AI, with prior research published in venues like NeurIPS and CVPR, and experience deploying complex systems, such as warehouse automation and fleet control planes, at Shopify. This team possesses proven experience in building and exiting a healthcare AI company, grounding their approach in deep-tech and systems-level thinking. The Forward Deployed Engineer role centers on leading and owning the technical engagement for customers, acting as the bridge between the advanced infrastructure and real-world deployments. This involves engaging with clients—ranging from university research environments running production SLURM to enterprise clusters and neoclouds—to diagnose pain points and deploy the Cedana solution across the entire stack. The engineer is responsible for end-to-end operational ownership, spanning from low-level system components, such as SLURM plugins and node configuration, up through Kubernetes operators, networking, and observability. Key responsibilities include engineering solutions at client sites by installing and configuring Cedana within environments utilizing SLURM, Kubernetes, and Dynamo. Furthermore, the role requires driving product innovation by identifying technical gaps during customer engagements and feeding that feedback back to the development team to shape new product features. Performance optimization is critical, involving the measurement of reliability, throughput, and performance using internal tools, alongside the design and implementation of policy-based migration automations to optimize these metrics. The engineer must ensure the reliable operation of the platform for critical client operations by debugging issues across the entire stack and handling complex installation debugging on unfamiliar customer infrastructure. A focus on scalability is also emphasized through building internal installation playbooks to rapidly onboard subsequent customers. To succeed in this role, candidates must possess a solid foundation in systems engineering. This includes three to ten years of software engineering experience with demonstrated expertise in configuring and managing SLURM deployments. A critical requirement is production experience setting up SLURM environments, configuring components like slurmctld, slurmdbd, accounting, and cgroup integration. Strong Linux fundamentals are essential, requiring an understanding of systemd, cgroups v2, namespaces, networking, filesystems, and kernel module loading, often necessitating the ability to analyze strace and dmesg output to formulate hypotheses. Additionally, proficiency in Kubernetes operations, including managing operators, Custom Resource Definitions, device plugins, and performing node-level debugging for controllers, is mandatory. Bonus experience is highly valued, including backgrounds in HPC integrator field teams, client-facing technical interaction, or experience within national lab user services or university research computing. Familiarity with related technologies such as CRIU, container runtimes, GPU driver internals, distributed training stacks, or AI orchestration platforms like NVIDIA Dynamo, Determined, Ray, Kueue, or KServe offers further advantages. A strong passion for debugging complex low-level issues is also noted as beneficial. The position is offered remotely in the US, along with a competitive compensation range and equity opportunities. |