A disappearing Service Processor (2025)

Recorded: May 30, 2026, 9 p.m.

Original

Summarized

A disappearing Service Processor | Oxide Computer CompanyProductProductSolutionsSolutionsResourcesResourcesCompanyCompanyPodcastsPodcastsBlogTry nowContact sales11 Dec 2025A disappearing Service ProcessorLALaura AbbottEngineerOne of the considerations in designing our Oxide rack is asking which parts we
expect to be accessible and by what means. The Oxide rack is designed to live
in a data center with exclusive access via the network. The only reason an
engineer should ever need to physically visit a rack is to replace a failing
part, such as a disk. Our Service Processor (SP) is accessible via the management network.During some of our first attempts at putting our next generation Cosmo sled
into an Oxide rack, we would see the Service Processor drop off the network.
This is a tricky situation to debug, as without network access we have limited
insight into the state of the SP itself. Debugging started based on the state
of the rest of the system (original Hubris bug may contains spoilers for the blog post!):The AMD host CPU was still alive, meaning the full system itself still had powerThe SP itself was not broadcasting over the management network that it was aliveThere were no increases in network data counters coming from the SPThe fans were spinning at a constant elevated rate. The service processor is
responsible for fan control, so this was an indication the fan controller may
have fallen back to emergency full power mode.This was not reproducible on a sled outside a rackThe Service Processor runs our custom operating system, Hubris. Each portion of the system (networking, thermal
control, update etc.) is written as a separate task. Hubris is not a true Real
Time Operating System with deadline guarantees, but it does have the notion of
task priorities. One of our working theories was that we had a software bug
that was causing task starvation. If the networking task was unable to run due
to some other task eating up all the CPU time, it would not be able to respond
over the network. A likely culprit of task starvation could be a task that had
gotten into an infinite crash loop, with all CPU time being spent restarting the
task. We adjusted the task restart time to have a longer delay to catch this
case. We also wanted to be able to observe if the SP was still making progress
even if we lacked networking access, and so switched our chassis LED from
"always on" to blinking.We were fortunate to be able to reproduce the issue with these debug changes, but the
results were still confusing: in some cases we would see the LED stuck on, and
in other cases the LED was stuck off. The task responsible for LED blinking was
near the top of priorities, which limited the number of places we could have a
stuck task.One of the many advantages of writing Hubris in Rust is eliminating
bug classes such as buffer overflows. A category of issues Hubris is
still particularly prone to is stack overflows. This is because Hubris
requires manual sizing of stacks for tasks and calculating maximum stack
size has proven tricky. Our ability to detect undersized stacks has
improved with the addition of emit-stack-sizes feature
but we can still hit some edge cases.
When a stack overflow occurs, the task safely restarts. A stack overflow in
the kernel would potentially produce similar behavior of a system that looks
like it isn’t making progress. Unfortunately for us the stack margins on the
kernel were relatively large (512 bytes!) so this was an unlikely case.At this point, we really needed to get more debugging information out of the
system. For manufacturing purposes, we have SWD debug headers. These are not
expected to be used on a production system and especially not a system in a
running rack. We had to do some creative cable pulling to get them attached
with the assistance of coworkers in the Oxide office.Fortunately, our cable attachment paid dividends: we reproduced the issue with
the probe attached! This was not immediately fruitful: the debug probe was
unable to actually halt the CPU via debug halt, which limited our ability to
extract diagnostic information. Our Service Processor uses a Cortex-M7 STM32H7,
and the number of ways to put the system in such a state is limited.This put our focus on identifying what parts of the system could cause such
behavior. A major
change from our first generation Gimlet system was the addition of an FPGA to
control more parts of our system such as host flash.
This FPGA is connected using a simple, old-school parallel bus, like the sort
you might use for RAM, and accessed via the STM32H7 Flexible Memory Controller.
As stated in the manual (Section 22.1 RM0433):Its main purposes are:
* to translate AXI transactions into the appropriate external device protocol
* to meet the access time requirements of the external memory devicesOne way a CPU can potentially get stuck is if it never receives a bus
acknowledgement from an external device. A bug in the FPGA timing, for example,
could result in the CPU hanging forever when attempting to read a register.
To test this theory, we created an FPGA test image with a register that when
read would intentionally hang the FMC bus. This produced very similar behavior
to what we observed and was a strong indicator we were looking at the right
part of the system to find the issue.We typically rely on full system dumps to debug Hubris problems. This is not
possible unless we can halt the CPU. ARM CPUs do support vector catch though:
it’s possible to configure the CPU so that on reset, it halts before executing the first instruction. Our
hope was that a vector catch reset would unstick the CPU sufficiently without
trampling over our existing state. This did work. We lost the running register
state with the program counter but the rest of the Hubris state in RAM was
preserved across reset and looked reasonably consistent. We could see what
Hubris task was running, but nothing there looked like it was accessing the FMC.Our hardware engineers did a review of FPGA timings and did find that we might
not have been meeting timing constraints required by the memory interface.
We merged the fix and figured that the vector
catch dumps were just inconsistent, most likely due to the cache. When we
ran experiments to turn off the cache the dumps were consistent but we never
reproduced the actual issue.We continued hubris development as usual over the next several weeks. One
of the changes we worked on during this period was related to our
measured boot work. Our Root of Trust (RoT) is responsible for taking
a hash of the SP flash at bootup which eventually gets used by higher
level software. To acheive the security properties we need, the SP may
reset itself multiple times in a row at first bootup. While testing
this change, we saw the same symptoms come back: the Cosmo SP would disappear
from the network and appear dead. This change turned out to be incredibly good
at reproducing the issue, turning a potentially 24+ hour reproduction rate
to approximately 10-20 minutes. The initial dumps still didn’t show a significant smoking gun,
but we were still highly suspicious of the FMC bus since there were still
limited cases that could produce such symptoms.The high reproduction rate gave us a chance to try many experiments, none of
which were fruitful:Adjusting the rate at which we reset and the number of resets before normally bootingClearing the FPGA bit stream an extra timeRestricting tasks from accessing the FMC busRemoving whole tasks that seemed to be unrelatedFinally, staring at the STM32H7 manual provided an insight: maybe the processor
itself was performing accesses on the FMC bus that we weren’t expecting!
Modern processors hold a large amount of internal state that isn’t directly
visible to the programmer. It is not possible for a programmer to know when
a CPU will pull data into or out of the cache outside of certain
synchronization points or cache instructions. A CPU writing data from the
cache to memory is considered a memory access so it’s possible for the CPU
to be making memory accesses to addresses unrelated to the current program
counter.Hubris utilizes the Memory Protection Unit (MPU) to provide isolation between
tasks and enforce privilege levels. Our configuration uses the MPU for the
unprivileged tasks but uses the default memory map for the (privileged) kernel.
In the tasks, the FMC is mapped as Uncached Device Memory. Based on our reading
of the STM32H7 manual, it turned out our chosen base address for the FMC bus
had a default memory type of Normal Cached. This means the FMC has different
attributes depending on whether it’s being accessed from a task or the kernel.Section A3.5.7 of the ARMv7-m reference manual has an entire section about
mismatched memory attributes and what properties are lost in this situation.
Based on discussion with our hardware engineers, the line "Preservation of the
size of accesses" was the most suspicious. Our FPGA interface was designed for
32-bit accesses, and 16-bit or 8-bit accesses could potentially cause problems.It’s important to note that the kernel was never intentionally accessing the
FMC through the Normal Cached mapping. The most likely scenario was:The CPU running an unprivileged task accessing the FMC issues a store that makes it to the processor’s store bufferAn interrupt occurs, switching us into privileged mode which uses the default memory mapThe store hits the cache because the default memory map said that address is cachedThe cache attempted to write to memory in ways outside the expected Device Memory attributesOne of the last lines of section A3.5.7 is "Arm strongly recommends that
software does not use mismatched attributes for aliases of the same location."
The default ARM memory map (which the kernel relies on) assigns different
attributes to different sections of the address space, and one of the
sections is set up the way we want: device memory, no caching. It turns
out the STM32H7 FMC supports changing its base address to appear in this
section of address space, likely to avoid the specific problem we were facing.
The final fix was changing the base address to the section with
matching attributes. We’ve seen no instances of this issue since that fix was merged.Transparency continues to be an Oxide value. Debugging modern CPUs often
involves diving into areas with little transparency. "Under what circumstances
will you be unable to access your memory bus" is a tricky question to answer.
Our debugging efforts this time were aided by documentation from ARM and STM
that eventually explained our problem. Given the difficulty in debugging this
issue, highlighting this potential problem in vendor documentation would be
beneficial to all customers. Oxide hopes all hardware vendors continue to
document as much of their part as possible for the benefit of their customers.The cloud you ownProductComputeStorageNetworkingSpecsCompanyHomePrinciplesCareersEventsPressPrivacy PolicySolutionsPublic SectorVMwareHybrid CloudTechFinanceAISovereign CloudHPCCI/CDEcosystemResourcesOn the MetalOxide & FriendsBlogFAQ FridaySubscribeGitHub logoLinkedin logo

The investigation into the disappearing Service Processor (SP) within the Oxide rack design involved complex debugging processes stemming from limited diagnostic access. Initial efforts focused on observing the state of the overall system because the SP was inaccessible via the management network. The debugging process started by monitoring the host CPU, noting that while the system retained power, the SP was not broadcasting over the management network and showed no increase in network data counters. Observing the elevated fan rate suggested the service processor was controlling thermal management, indicating a potential fallback to emergency full power mode.

The Service Processor runs a custom operating system called Hubris, structured as separate tasks for networking, thermal control, and updates. The team hypothesized that the disappearance was due to task starvation, possibly caused by software bugs or infinite crash loops consuming CPU time. They attempted to mitigate this by adjusting task restart delays and using chassis LEDs to observe task progress, although these early diagnostic steps yielded confusing results regarding whether the processor was truly stalled. Further investigation into the operating system revealed challenges related to memory management, particularly stack overflows, where the manual sizing of task stacks proved difficult.

To gain deeper insight, the team accessed hardware debugging interfaces, including SWD debug headers, which were attached to the Service Processor's Cortex-M7 STM32H7 core. While this provided access to internal states, the debug probe was unable to execute a debug halt, limiting the ability to extract comprehensive diagnostic information.

The hardware architecture, featuring an FPGA for managing host flash, introduced new variables. The FPGA communicates with the STM32H7 Flexible Memory Controller (FMC) using a parallel bus, responsible for translating AXI transactions and ensuring memory access timing. A potential point of failure was identified where a CPU could hang if it failed to receive a bus acknowledgment from an external device, suggesting a possible timing bug within the FPGA. To test this theory, the engineers implemented an FPGA test image designed to intentionally stall the FMC bus, which resulted in behavior closely matching the observed system issues, strongly pointing toward the memory interface as the source of the problem.

Further debugging involved analyzing CPU behavior regarding memory caching and protection mechanisms. A critical finding emerged from examining the memory attributes assigned to the FMC by the system configuration. The system leveraged the Memory Protection Unit (MPU) to isolate tasks, mapping the FMC as Uncached Device Memory for unprivileged tasks, while the kernel relied on the default Normal Cached mapping. Conflicts arose because an unprivileged task could execute a store that entered the store buffer, followed by an interrupt switching context into privileged mode, where the default mapping indicated the address was cached, leading to conflicting memory attributes. This mismatch between the memory attributes used by different execution modes was theorized to cause issues, especially concerning access size, as the FPGA interface was designed for 32-bit accesses.

The final resolution involved addressing this memory attribute dissonance. The team discovered that the STM32H7 FMC could adjust its base address to reside in the memory section with matching attributes. Implementing this change resolved the issues observed during experimentation, affirming that the instability arose from mismatched memory attributes between the operating environment and the hardware interface configuration. The experience underscores the necessity for comprehensive documentation regarding the interaction between software task behavior and low-level hardware specifics, emphasizing that transparency in vendor documentation is essential for effective debugging of modern processor systems.