LmCast :: Stay tuned in

Cedana (YC S23) Is Hiring

Recorded: May 29, 2026, 12:02 p.m.

Original Summarized

Forward Deployed Engineer: AI + HPC at Cedana | Y Combinator

Open menuAboutWhat Happens at YC?ApplyYC Interview GuideFAQPeopleYC BlogCompaniesStartup DirectoryFounder DirectoryLaunch YCLibraryPartnersResourcesStartup SchoolNewsletterRequests for StartupsFor InvestorsVerify FoundersHacker NewsBookfaceSafeFind a Co-FounderStartup JobsLog inApplyCedanaFast, reliable, reproducible AI with GPU live migrationForward Deployed Engineer: AI + HPC$140K - $180K•0.10% - 0.25%•US / Remote (US)Job typeFull-timeRoleEngineering, BackendExperience3+ yearsVisaUS citizen/visa onlyConnect directly with founders of the best YC-funded startups.Apply to role ›Neel MasterFounderNeel MasterFounderAbout the roleIntroducing Cedana
The Problem
AI and HPC  infrastructure suffers from scarcity and high costs, so when failures happen they are costly in terms of time and money. Cluster productivity directly determines research output and revenue. Achieving high utilization and throughput is increasingly challenging due to the complexity of workloads, hardware, and operations.
Cedana’s Solution
Cedana maximizes AI+HPC cluster utilization and reliability with automated GPU checkpointing infrastructure. We enable transparent and fast migration of GPU workloads across instances, without losing work. Workloads automatically migrate to achieve new levels of reliability and throughput while accelerating time to results. Our system is at the kernel/OS level, requiring no code or config changes, and works seamlessly with Kubernetes, SLURM, and NVIDIA Dynamo. Today, we're deploying into leading inference platforms, neoclouds, enterprise, and research clusters.
The Team
Cedana's founding team has spent over a decade making computation run fast, productively, and reliably for AI. Our research appears in NeurIPS and CVPR. We published some of the earliest formal methods for guaranteeing convergence in distributed training. At Shopify we've deployed warehouse automation and robot fleets building behavior trees, fleet control planes, and OTA infrastructure that performs reliably over constrained networks. We bring repeat founder experience having built and exited a healthcare AI company.
The Role
What you’ll own
As a Forward Deployed Engineer at Cedana, you’ll lead and own technical engagement from end to end. You’ll engage with customers to understand and deploy on their environments: from production SLURM at a university, bare-metal Kubernetes at an inference provider, hybrid setup at a Fortune 100 Pharma enterprise. You’ll rapidly understand their key pain points, and use Cedana to solve their problems. For each customer you own everything from the OS up: SLURM plugins, Kubernetes operators, node configuration, networking, and observability.
This role will expose you to the cutting edge of AI and HPC infrastructure, working with the world’s leading research and commercial customers to deliver a breakthrough solution.
What You'll Do

Engineer solutions at client sites: Lead customer integrations. install, configure and deploy Cedana into SLURM, Kubernetes, and Dynamo environments.
Drive product innovation from the field: Identify technical gaps while embedded with clients, then provide product feedback for new capabilities that become core product features.
Measure and optimize platform performance: Measure reliability, throughput and performance using our internal tools. Design and implement policy-based migration automations to optimize reliability, throughput and performance
Own critical deployments: Ensure our platform performs reliably for clients' critical operations, debugging issues across the full stack. Debug install issues against unfamiliar customer infrastructure, escalate to engineering when necessary.
Improve scalability: Build the internal install playbook so the second customer in each segment is faster than the first.
Respect our customers: Understand ways to make their life easier, minimize their time and overhead.

What we are looking for

3-10 years of software engineering experience with a track record of configuring and managing SLURM deployments.
A multi-month enterprise or research deployment you led end-to-end, from scoping through signoff. You write effective status updates to keep your team updated and on schedule.
Production experience standing up SLURM in a customer or research environment. You've configured slurmctld, slurmdbd, accounting, cgroup integration, and GPU resource selection.
Strong Linux fundamentals of systemd, cgroups v2, namespaces, networking, filesystems, kernel module loading, PAM session modules. You read strace and dmesg output and form a hypothesis.
Working Kubernetes operations including operators, CRDs, device plugins, node-level debugging. You've debugged a controller in production even if you haven't written one from scratch.

Bonus if you have

Experience at an HPC integrator field team
Client-facing technical experience working directly with customers.
Background in national lab user services or university research computing
You’ve developed SLURM plug-ins, and understand their architecture and how they fit into the overall platform.
Familiarity with CRIU, container runtimes, GPU driver internals, distributed training stacks
Hands-on with NVIDIA Dynamo, Determined, Ray, Kueue, KServe, or comparable AI orchestration.
Contributed to open-source schedulers or job systems (SLURM, Flux, Torque, PBS).
A passion for debugging a weird cgroup issue at 11pm just as much as writing a clean install playbook the next morning.

Logistics

Remote, US-based. ~25% travel for customer installs.
Base $140,000–$180,000 + meaningful early-stage equity.

Benefits

100% covered medical, dental, and vision insurance for employees and families
Unlimited PTO policy
401K Plan

Equal Opportunity Employer
Cedana is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status
About the interview
Initial interview for fit
Written component to understand background and motivation. Not a coding test.
Interviews with engineering team.
References

About CedanaCedana is pause/migrate/resume for compute workloads. We're working on building a global, real-time system for compute. This means a paradigm shift in how we allocate resources to things like high performance computing, numerical simulation and training and running machine learning models. We do so by taking a systems-level and deep-tech approach to these problems, working at the Linux Kernel layer and with hardware.
CedanaFounded:2023Batch:S23Team Size:5Status:ActiveLocation:New York FoundersNeel Master FounderNeel Master FounderNiranjan Ravichandra FounderNiranjan Ravichandra FounderFooterY CombinatorMake something people want.ProgramsYC ProgramStartup SchoolWork at a StartupCo-Founder MatchingResourcesStartup DirectoryStartup LibraryInvestorsDemo DaySafeHacker NewsLaunch YCYC DealsCompanyYC BlogContactPressPeopleCareersPrivacy PolicyNotice at CollectionSecurityTerms of UseTwitterTwitterFacebookFacebookInstagramInstagramLinkedInLinkedInYoutubeYouTube© 2026 Y Combinator

Cedana addresses the challenges of scarcity and high costs inherent in AI and High Performance Computing (HPC) infrastructure, which leads to costly failures and limits research output. The core problem arises from the difficulty in achieving high utilization and throughput due to the complexity of managing workloads, hardware, and operations. Cedana’s solution involves maximizing AI+HPC cluster utilization and reliability through an automated GPU checkpointing infrastructure designed for transparent and fast migration of GPU workloads across compute instances without data loss. This system operates at the kernel and operating system level, allowing workloads to automatically migrate to enhance reliability and throughput while accelerating time to results, seamlessly integrating with existing systems like Kubernetes, SLURM, and NVIDIA Dynamo.

The founding team brings extensive experience in making computation fast, productive, and reliable for AI, with prior research published in venues like NeurIPS and CVPR, and experience deploying complex systems, such as warehouse automation and fleet control planes, at Shopify. This team possesses proven experience in building and exiting a healthcare AI company, grounding their approach in deep-tech and systems-level thinking.

The Forward Deployed Engineer role centers on leading and owning the technical engagement for customers, acting as the bridge between the advanced infrastructure and real-world deployments. This involves engaging with clients—ranging from university research environments running production SLURM to enterprise clusters and neoclouds—to diagnose pain points and deploy the Cedana solution across the entire stack. The engineer is responsible for end-to-end operational ownership, spanning from low-level system components, such as SLURM plugins and node configuration, up through Kubernetes operators, networking, and observability.

Key responsibilities include engineering solutions at client sites by installing and configuring Cedana within environments utilizing SLURM, Kubernetes, and Dynamo. Furthermore, the role requires driving product innovation by identifying technical gaps during customer engagements and feeding that feedback back to the development team to shape new product features. Performance optimization is critical, involving the measurement of reliability, throughput, and performance using internal tools, alongside the design and implementation of policy-based migration automations to optimize these metrics. The engineer must ensure the reliable operation of the platform for critical client operations by debugging issues across the entire stack and handling complex installation debugging on unfamiliar customer infrastructure. A focus on scalability is also emphasized through building internal installation playbooks to rapidly onboard subsequent customers.

To succeed in this role, candidates must possess a solid foundation in systems engineering. This includes three to ten years of software engineering experience with demonstrated expertise in configuring and managing SLURM deployments. A critical requirement is production experience setting up SLURM environments, configuring components like slurmctld, slurmdbd, accounting, and cgroup integration. Strong Linux fundamentals are essential, requiring an understanding of systemd, cgroups v2, namespaces, networking, filesystems, and kernel module loading, often necessitating the ability to analyze strace and dmesg output to formulate hypotheses. Additionally, proficiency in Kubernetes operations, including managing operators, Custom Resource Definitions, device plugins, and performing node-level debugging for controllers, is mandatory.

Bonus experience is highly valued, including backgrounds in HPC integrator field teams, client-facing technical interaction, or experience within national lab user services or university research computing. Familiarity with related technologies such as CRIU, container runtimes, GPU driver internals, distributed training stacks, or AI orchestration platforms like NVIDIA Dynamo, Determined, Ray, Kueue, or KServe offers further advantages. A strong passion for debugging complex low-level issues is also noted as beneficial. The position is offered remotely in the US, along with a competitive compensation range and equity opportunities.