Tales From a Core File - CPU and PCH Temperature Sensors in illumos

A while back, I did a bit of work that I’ve been meaning to come back to and write about. The first of these are all about making it easier to see the temperature that different parts of the system are working with. In particular, I wanted to make sure that I could understand the temperature of the following different things:

Intel CPU Cores
Intel CPU Sockets
AMD CPUs
Intel Chipsets

While on some servers this data is available via IPMI, that doesn’t help you if you’re running a desktop or a laptop. Also, if the OS can know about this as a first class item, why bother going through IPMI to get at it? This is especially true as IPMI sometimes summarizes all of the different readings into a single one.

Seeing is Believing

Now, with these present, you can ask fmtopo to see the sensor values. While fmtopo itself isn’t a great user interface, it’s a great building block to centralize all of the different sensor information in the system. From here, we can build tooling on top of the fault management architecture (FMA) to better see and understand the different sensors. FMA will abstract all of the different sources. Some of them may be delivered by the kernel while others may be delivered by user land. With that in mind, let’s look at what this looks like on a small Kaby Lake NUC:

[root@estel ~]# /usr/lib/fm/fmd/fmtopo -V *sensor=temp
TIME                 UUID
Aug 12 20:44:08 88c3752d-53c2-ea3a-c787-cbeff0578cd0

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    43.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    49.000000

hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: protocol                       version: 1   stability: Private/Private
    resource          fmri      hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
  group: authority                      version: 1   stability: Private/Private
    product-id        string    NUC7i3BNH
    chassis-id        string    G6BN735007J5
    server-id         string    estel
  group: facility                       version: 1   stability: Private/Private
    sensor-class      string    threshold
    type              uint32    0x1 (TEMP)
    units             uint32    0x1 (DEGREES_C)
    reading           double    46.000000

While it’s a little bit opaque, you might be able to see that we have four different temperature sensors here:

Core 0’s temperature sensor
Core 1’s temperature sensor
Package 0’s temperature sensor
The chipset’s temperature sensor (Intel calls this the Platform Controller Hub)

On an AMD system, the output is similar, but the sensor exists on a slightly different granularity than a per-core basis. We’ll come back to that a bit later.

Pieces of the Puzzle

To make all of this work, there are a few different pieces that we put together:

Kernel device drivers to cover the Intel and AMD CPU sensors, and the Intel PCH
A new evolving, standardized way to export simple temperature sensors
Support in FMA’s topology code to look for such sensors and attach them

We’ll work through each of these different pieces in turn. The first part of this was to write three new device drivers one to cover each of the different cases that we cared about.

coretemp driver

The first part of the drivers is the coretemp driver. This uses the temperature interface that was introduced on Intel Core family processors. It allows an operating system to read a MSR (Model Specific Register) to determine what the temperature is. The support for this is indicated by a bit in one of the CPUID registers and exists on almost every Intel CPU that has come out since the Intel Core Duo.

Around the time of Intel’s Haswell processor (approximately), Intel added another CPUID bit and MSR that indicates what the temperature is on each socket.

The coretemp driver has two interesting dynamics and problems:

Because of the use of the rdmsr instruction, which reads a model-specific register, one can only read the temperature for the CPU that you’re currently executing on. This isn’t too hard to arrange in the kernel, but it means that when we read the temperature we’ll need to organize what’s called a 'cross-call'. You can think of a cross-call as a remote procedure call, except that the target is a specific CPU in the system and not a remote host.
Intel doesn’t actually directly encode the temperature in the MSRs. Technically, the value we read represents an offset from the processor’s maximum junction temperature, often abbreviated Tj Max. Modern Intel processors provide a means for us to read this directly via an MSR. However, older ones, unfortunately, do not. On such older processors, the Tj Max actually varies based on not just the processor family, but also the brand, so different processors running at different frequencies have different values. Some of this information can be found in various datasheets, but for the moment, we’ve only enabled this driver for CPUs that we can guarantee the Tj Max value. If you have an older CPU and you’d like to see if we could manually enable it, please reach out.

pchtemp driver

The pchtemp driver is a temperature sensor driver for the Intel platform controller hub (PCH). The driver supports most Intel CPUs since the Haswell generation, as the format of the sensor changed starting with the Haswell-era chipsets.

This driver is much simpler than the other ones. The PCH exposes a dedicated pseudo-PCI device for this purpose. The pchtemp driver simply attaches to that and reads the temperature when required. Unlike the coretemp driver, the offset in temperature is the same across all of the currently supported platforms so we don’t need to encode anything special there like we do for the coretemp driver.

amdf17nbdf driver

The amdf17nbdf driver is a bit of a mouthful. It stands for the AMD Family 17h North Bridge and Data Fabric driver. Let’s take that apart for a moment. On x86, CPUs are broken into different families to represent different generations. Currently all of AMD’s Ryzen/EPYC processors that are based on the Zen microarchitecture are all grouped under Family 17h. The North Bridge is a part of the CPU that is used to interface with various I/O components on the system. The Data Fabric is a part of AMD CPUs which connects CPUs, I/O devices, and DRAM.

On AMD Zen family processors, the temperature sensor exists on a per-die basis. Each die is a group of cores. The physical chip has a number of such units, each of which in the original AMD Naples/Zen 1 processor has up to 8 cores. See the illumos cpuid.c big theory statement for the details on how the CPU is constructed and this terminology. Effectively, you can think of it as there are a number of different temperature sensors, one for each discrete group of cores.

To talk to the temperature sensor, we need to send a message on the what AMD calls the 'system management network' (SMN). The SMN is used to connect different management devices together. The SMN can be used for a number of different purposes beyond just temperature sensors. The operating system has a dedicated means of communicating and making requests over the SMN by using the corresponding north bridge, which is exposed as a PCI device.

The same way that with the coretemp driver you needed to issue a rdmsr instruction for the core that you wanted the temperature from, you need to do the same thing here. Each die has its own north bridge and therefore we need to use the right instance to talk to the right group of CPUs.

The wrinkle with all of this is that the north bridge on its own doesn’t give us enough information to map it back to a group of cores that an operator sees. This is critical, since if you can’t tell which cores you’re getting the temperature reading for, it immediately becomes much less useful.

This is where the data fabric devices come into play. The data fabric devices exist at a rather specific PCI bus, device, and function. They all are always defined to be on PCI bus 0. The data fabric device for the first die is always defined to be at device 18h. The next one is at 19h, and so on. This means that we have a deterministic way to map between a data fabric device and a group of cores. Now, that’s not enough on its own. While we know the data fabric, we don’t know how to map that to the north bridge.

Each north bridge in the system is always defined to be on its own PCI bus. The device we care about is always going to be device and function 0. The data fabric happens to have a register which defines for us the starting PCI bus for its corresponding north bridge. This means that we have a path now to get to the temperature sensor. For a given die, we can find the corresponding data fabric. From the data fabric, we can find the corresponding north bridge. And finally, from the north bridge, we can find the corresponding SMN (system management network) that we can communicate with.

With all that, there’s one more wrinkle. On some processors, most notably the Ryzen and ThreadRipper families, the temperature that we read has an offset encoded with it. Unfortunately, these offsets have only been documented in blog posts by AMD and not in the formal documentation. Still, it’s not too hard to take this into account once official documentation becomes available.

While our original focus was on adding support for AMD’s most recent processors, if you have an older AMD processor and are interested in wiring up the temperature sensors on it, please get in touch and we can work together to implement something.

sys/sensors.h

Now that we have drivers that know how to read this information, the next problem we need to tackle is how do we expose this information to user land. In illumos, the most common way is often some kind of structured data that can be retrieved by an ioctl on a character device, or some other mechanism like a kernel statistic.

After talking with some folks, we put together a starting point for a way for a kernel to exposes sets of statistics and created a new header file in illumos called sys/sensors.h. This header file isn’t currently shipped and is only used by software in illumos itself. This makes it easy for us to experiment with the API and change it without worrying about breaking user software. Right now, each of the above drivers implements a specific type of character device that implements the same, generic interface.

The current interface supports two different types of commands. The first, SENSOR_IOCTL_TYPE, answers the question of what kind of sensor is this. Right now, the only valid answer is SENSOR_KIND_TEMPERATURE. The idea is that if we have other sensors, say voltage, current, or something else, we could return a different kind. Each kind, in turn, promises to implement the same interface and information.

For temperature devices, we need to fill in a singular structure which is used to retrieve the temperature. This structure currently looks something like:

typedef struct sensor_ioctl_temperature {
        uint32_t        sit_unit;
        int32_t         sit_gran;
        int64_t         sit_temp;
} sensor_ioctl_temperature_t;

This is kind of interesting and incorporates some ideas from Joshua Clulow and Alex Wilson. The sit_unit member is used to describe what unit the temperature is in. For example, it may be in Celsius, Kelvin, or Fahrenheit.

The next two members are a little more interesting. The sit_temp member contains a temperature reading, the sit_gran member is whats important in how we interpret that temperature. While many sensors end up communicating to digital consumers using a power of 2 based reading, that’s not always the case. Some sensors often may report a reading in units such as 1/10th of a degree. Others may actually report something in a granularity of 2 degrees!

To try and deal with this menagerie, the sit_gran member indicates the number of increments per degree in the sit_temp member. If this was set to 10, then that would mean that sit_temp was in 10ths of a degree and to get the actual value in degrees, one would need to divide by 10. On the other hand, a negative value instead means that we would need to multiply. So, a value of -2 would mean that sit_temp was in units of 2 degrees. To get the actual temperature, you would need to multiply sit_temp by 2.

Now, you may ask why not just have the kernel compute this and have a ones digit and a decimal portion. The main goal is to avoid floating point math in the kernel. For various reasons, this historically has been avoided and we’d rather keep it that way. While this may seem a little weird, it does allow for the driver to do something simpler and lets user land figure out how to transform this into a value that makes semantic sense for it. This gets the kernel out of trying to play the how many digits after the decimal point would you like game.

Exposing Devices

The second part of this bit of kernel support is to try and provide a uniform and easy way to see these different things under /dev in the file system. In illumos, when someone creates a minor node in a device driver, you have to specify a type of device and a name for the minor node. While most of the devices in the system use a standard type, we added a new class of types for sensors that translate roughly into where you’ll find them.

So, for example, the CPU drivers use the class that has the string "ddi_sensor:temperature:cpu" (usually as the macro DDI_NT_SENSOR_TEMP_CPU). This is used to indicate that it is a temperature sensor for CPUs. The coretemp driver then creates different nodes for each core and socket. For example, on a system with an Intel E3-1270v3 (Haswell), we see the following devices:

[root@haswell ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/chip0
/dev/sensors/temperature/cpu/chip0.core0
/dev/sensors/temperature/cpu/chip0.core1
/dev/sensors/temperature/cpu/chip0.core2
/dev/sensors/temperature/cpu/chip0.core3
/dev/sensors/temperature/pch
/dev/sensors/temperature/pch/ts.0

On the other hand on an AMD EPYC system with two AMD EPYC 7601 processors, we see:

[root@odyssey ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/procnode.0
/dev/sensors/temperature/cpu/procnode.1
/dev/sensors/temperature/cpu/procnode.2
/dev/sensors/temperature/cpu/procnode.3
/dev/sensors/temperature/cpu/procnode.4
/dev/sensors/temperature/cpu/procnode.5
/dev/sensors/temperature/cpu/procnode.6
/dev/sensors/temperature/cpu/procnode.7

The nice thing about the current scheme is that anything of type ddi_sensor will have a directory hierarchy created for it based on the different interspersed : characters. This makes it very easy for us to experiment with different kinds of sensors without having to go through too much effort. That said, this is all still evolving, so there’s no promise that this won’t change. Please don’t write code that relies on this. If you do, it’ll likely break!

FMA Topology

The last piece of this was to wire it up in FMA’s topology. To do that, I did a few different pieces. The first was to make it easy to add a node to the topology that represents a sensor backed by this kernel interface. There’s one generic implementation of that which is parametrized by the path.

With that, I first modified the CPU enumerator. The logic will use a core sensor if available, but can also fall back to a processor-node sensor if it exists. Next, I added a new piece of enumeration, which was to represent the chipset. If we have a temperature sensor, then we’ll enumerate the chipset under the motherboard. While this is just the first item there, I suspect we’ll add more over time as we try to properly capture more information about what it’s doing, the firmware revisions that are a part of it, and more.

This piece is, in some ways, the simplest of them all. It just builds on everything else that was already built up. FMA already had a notion of a sensor (which is used for everything from disk temperature to the fan RPM), so this was just a simple matter of wiring things up.

Now, we have all of the different pieces that made the original example of the CPU and PCH temperature sensor work.

Looking Ahead

While this is focused on a few useful temperature sensors, there are more that we’d like to add. If you have a favorite sensor that you’d like to see, please reach out to me or the broader illumos community and we’d be happy to take a look at it.

Another avenue that we’re looking to explore is having a standardized sensor driver such that a device driver doesn’t necessarily have to have its own character device or implementation of the ioctl handling.

Finally, I’d like to thank all those who helped me as we discussed different aspects of the API, reviewed the work, and tested it. None of this happens in a vacuum.

Previous Entry: None| All Entries | Next Entry: A Tale of Two LEDs