A while back, I did a bit of work that I’ve been meaning to come back to and write about. The first of these are all about making it easier to see the temperature that different parts of the system are working with. In particular, I wanted to make sure that I could understand the temperature of the following different things:
-
Intel CPU Cores
-
Intel CPU Sockets
-
AMD CPUs
-
Intel Chipsets
While on some servers this data is available via IPMI, that doesn’t help you if you’re running a desktop or a laptop. Also, if the OS can know about this as a first class item, why bother going through IPMI to get at it? This is especially true as IPMI sometimes summarizes all of the different readings into a single one.
Seeing is Believing
Now, with these present, you can ask fmtopo to see the sensor values. While fmtopo itself isn’t a great user interface, it’s a great building block to centralize all of the different sensor information in the system. From here, we can build tooling on top of the fault management architecture (FMA) to better see and understand the different sensors. FMA will abstract all of the different sources. Some of them may be delivered by the kernel while others may be delivered by user land. With that in mind, let’s look at what this looks like on a small Kaby Lake NUC:
[root@estel ~]# /usr/lib/fm/fmd/fmtopo -V *sensor=temp
TIME UUID
Aug 12 20:44:08 88c3752d-53c2-ea3a-c787-cbeff0578cd0
hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=0?sensor=temp
group: authority version: 1 stability: Private/Private
product-id string NUC7i3BNH
chassis-id string G6BN735007J5
server-id string estel
group: facility version: 1 stability: Private/Private
sensor-class string threshold
type uint32 0x1 (TEMP)
units uint32 0x1 (DEGREES_C)
reading double 43.000000
hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0/core=1?sensor=temp
group: authority version: 1 stability: Private/Private
product-id string NUC7i3BNH
chassis-id string G6BN735007J5
server-id string estel
group: facility version: 1 stability: Private/Private
sensor-class string threshold
type uint32 0x1 (TEMP)
units uint32 0x1 (DEGREES_C)
reading double 49.000000
hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chip=0?sensor=temp
group: authority version: 1 stability: Private/Private
product-id string NUC7i3BNH
chassis-id string G6BN735007J5
server-id string estel
group: facility version: 1 stability: Private/Private
sensor-class string threshold
type uint32 0x1 (TEMP)
units uint32 0x1 (DEGREES_C)
reading double 49.000000
hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=NUC7i3BNH:server-id=estel:chassis-id=G6BN735007J5/motherboard=0/chipset=0?sensor=temp
group: authority version: 1 stability: Private/Private
product-id string NUC7i3BNH
chassis-id string G6BN735007J5
server-id string estel
group: facility version: 1 stability: Private/Private
sensor-class string threshold
type uint32 0x1 (TEMP)
units uint32 0x1 (DEGREES_C)
reading double 46.000000
While it’s a little bit opaque, you might be able to see that we have four different temperature sensors here:
-
Core 0’s temperature sensor
-
Core 1’s temperature sensor
-
Package 0’s temperature sensor
-
The chipset’s temperature sensor (Intel calls this the Platform Controller Hub)
On an AMD system, the output is similar, but the sensor exists on a slightly different granularity than a per-core basis. We’ll come back to that a bit later.
Pieces of the Puzzle
To make all of this work, there are a few different pieces that we put together:
-
Kernel device drivers to cover the Intel and AMD CPU sensors, and the Intel PCH
-
A new evolving, standardized way to export simple temperature sensors
-
Support in FMA’s topology code to look for such sensors and attach them
We’ll work through each of these different pieces in turn. The first part of this was to write three new device drivers one to cover each of the different cases that we cared about.
coretemp driver
The first part of the drivers is the coretemp driver. This uses the temperature interface that was introduced on Intel Core family processors. It allows an operating system to read a MSR (Model Specific Register) to determine what the temperature is. The support for this is indicated by a bit in one of the CPUID registers and exists on almost every Intel CPU that has come out since the Intel Core Duo.
Around the time of Intel’s Haswell processor (approximately), Intel added another CPUID bit and MSR that indicates what the temperature is on each socket.
The coretemp driver has two interesting dynamics and problems:
-
Because of the use of the
rdmsr
instruction, which reads a model-specific register, one can only read the temperature for the CPU that you’re currently executing on. This isn’t too hard to arrange in the kernel, but it means that when we read the temperature we’ll need to organize what’s called a 'cross-call'. You can think of a cross-call as a remote procedure call, except that the target is a specific CPU in the system and not a remote host. -
Intel doesn’t actually directly encode the temperature in the MSRs. Technically, the value we read represents an offset from the processor’s maximum junction temperature, often abbreviated Tj Max. Modern Intel processors provide a means for us to read this directly via an MSR. However, older ones, unfortunately, do not. On such older processors, the Tj Max actually varies based on not just the processor family, but also the brand, so different processors running at different frequencies have different values. Some of this information can be found in various datasheets, but for the moment, we’ve only enabled this driver for CPUs that we can guarantee the Tj Max value. If you have an older CPU and you’d like to see if we could manually enable it, please reach out.
pchtemp driver
The pchtemp
driver is a temperature sensor driver for the Intel
platform controller hub (PCH). The driver supports most Intel CPUs since
the Haswell generation, as the format of the sensor changed starting
with the Haswell-era chipsets.
This driver is much simpler than the other ones. The PCH exposes a
dedicated pseudo-PCI device for this purpose. The pchtemp
driver
simply attaches to that and reads the temperature when required. Unlike
the coretemp
driver, the offset in temperature is the same across all
of the currently supported platforms so we don’t need to encode anything
special there like we do for the coretemp
driver.
amdf17nbdf driver
The amdf17nbdf
driver is a bit of a mouthful. It stands for the AMD
Family 17h North Bridge and Data Fabric driver. Let’s take that apart
for a moment. On x86, CPUs are broken into different families to
represent different generations. Currently all of AMD’s Ryzen/EPYC
processors that are based on the Zen microarchitecture are all grouped
under Family 17h. The North Bridge is a part of the CPU that is used to
interface with various I/O components on the system. The Data Fabric is
a part of AMD CPUs which connects CPUs, I/O devices, and DRAM.
On AMD Zen family processors, the temperature sensor exists on a per-die basis. Each die is a group of cores. The physical chip has a number of such units, each of which in the original AMD Naples/Zen 1 processor has up to 8 cores. See the illumos cpuid.c big theory statement for the details on how the CPU is constructed and this terminology. Effectively, you can think of it as there are a number of different temperature sensors, one for each discrete group of cores.
To talk to the temperature sensor, we need to send a message on the what AMD calls the 'system management network' (SMN). The SMN is used to connect different management devices together. The SMN can be used for a number of different purposes beyond just temperature sensors. The operating system has a dedicated means of communicating and making requests over the SMN by using the corresponding north bridge, which is exposed as a PCI device.
The same way that with the coretemp
driver you needed to issue a
rdmsr
instruction for the core that you wanted the temperature from,
you need to do the same thing here. Each die has its own north bridge
and therefore we need to use the right instance to talk to the right
group of CPUs.
The wrinkle with all of this is that the north bridge on its own doesn’t give us enough information to map it back to a group of cores that an operator sees. This is critical, since if you can’t tell which cores you’re getting the temperature reading for, it immediately becomes much less useful.
This is where the data fabric devices come into play. The data fabric devices exist at a rather specific PCI bus, device, and function. They all are always defined to be on PCI bus 0. The data fabric device for the first die is always defined to be at device 18h. The next one is at 19h, and so on. This means that we have a deterministic way to map between a data fabric device and a group of cores. Now, that’s not enough on its own. While we know the data fabric, we don’t know how to map that to the north bridge.
Each north bridge in the system is always defined to be on its own PCI bus. The device we care about is always going to be device and function 0. The data fabric happens to have a register which defines for us the starting PCI bus for its corresponding north bridge. This means that we have a path now to get to the temperature sensor. For a given die, we can find the corresponding data fabric. From the data fabric, we can find the corresponding north bridge. And finally, from the north bridge, we can find the corresponding SMN (system management network) that we can communicate with.
With all that, there’s one more wrinkle. On some processors, most notably the Ryzen and ThreadRipper families, the temperature that we read has an offset encoded with it. Unfortunately, these offsets have only been documented in blog posts by AMD and not in the formal documentation. Still, it’s not too hard to take this into account once official documentation becomes available.
While our original focus was on adding support for AMD’s most recent processors, if you have an older AMD processor and are interested in wiring up the temperature sensors on it, please get in touch and we can work together to implement something.
sys/sensors.h
Now that we have drivers that know how to read this information, the next problem we need to tackle is how do we expose this information to user land. In illumos, the most common way is often some kind of structured data that can be retrieved by an ioctl on a character device, or some other mechanism like a kernel statistic.
After talking with some folks, we put together a starting point for a
way for a kernel to exposes sets of statistics and created a new header
file in illumos called sys/sensors.h
. This header file isn’t currently
shipped and is only used by software in illumos itself. This makes it
easy for us to experiment with the API and change it without worrying
about breaking user software. Right now, each of the above drivers
implements a specific type of character device that implements the same,
generic interface.
The current interface supports two different types of commands. The
first, SENSOR_IOCTL_TYPE
, answers the question of what kind of sensor
is this. Right now, the only valid answer is SENSOR_KIND_TEMPERATURE
.
The idea is that if we have other sensors, say voltage, current, or
something else, we could return a different kind. Each kind, in turn,
promises to implement the same interface and information.
For temperature devices, we need to fill in a singular structure which is used to retrieve the temperature. This structure currently looks something like:
typedef struct sensor_ioctl_temperature {
uint32_t sit_unit;
int32_t sit_gran;
int64_t sit_temp;
} sensor_ioctl_temperature_t;
This is kind of interesting and incorporates some ideas from Joshua
Clulow and Alex Wilson. The sit_unit
member is used to describe what
unit the temperature is in. For example, it may be in Celsius, Kelvin,
or Fahrenheit.
The next two members are a little more interesting. The sit_temp
member contains a temperature reading, the sit_gran
member is whats
important in how we interpret that temperature. While many sensors end
up communicating to digital consumers using a power of 2 based reading,
that’s not always the case. Some sensors often may report a reading in
units such as 1/10th of a degree. Others may actually report something
in a granularity of 2 degrees!
To try and deal with this menagerie, the sit_gran
member indicates the
number of increments per degree in the sit_temp
member. If this was
set to 10, then that would mean that sit_temp
was in 10ths of a degree
and to get the actual value in degrees, one would need to divide by 10.
On the other hand, a negative value instead means that we would need to
multiply. So, a value of -2 would mean that sit_temp
was in units of 2
degrees. To get the actual temperature, you would need to multiply
sit_temp
by 2.
Now, you may ask why not just have the kernel compute this and have a ones digit and a decimal portion. The main goal is to avoid floating point math in the kernel. For various reasons, this historically has been avoided and we’d rather keep it that way. While this may seem a little weird, it does allow for the driver to do something simpler and lets user land figure out how to transform this into a value that makes semantic sense for it. This gets the kernel out of trying to play the how many digits after the decimal point would you like game.
Exposing Devices
The second part of this bit of kernel support is to try and provide a
uniform and easy way to see these different things under /dev
in the
file system. In illumos, when someone creates a minor node in a device
driver, you have to specify a type of device and a name for the minor
node. While most of the devices in the system use a standard type, we
added a new class of types for sensors that translate roughly into where
you’ll find them.
So, for example, the CPU drivers use the class that has the string
"ddi_sensor:temperature:cpu"
(usually as the macro
DDI_NT_SENSOR_TEMP_CPU
). This is used to indicate that it is a
temperature sensor for CPUs. The coretemp
driver then creates
different nodes for each core and socket. For example, on a system with
an Intel E3-1270v3 (Haswell), we see the following devices:
[root@haswell ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/chip0
/dev/sensors/temperature/cpu/chip0.core0
/dev/sensors/temperature/cpu/chip0.core1
/dev/sensors/temperature/cpu/chip0.core2
/dev/sensors/temperature/cpu/chip0.core3
/dev/sensors/temperature/pch
/dev/sensors/temperature/pch/ts.0
On the other hand on an AMD EPYC system with two AMD EPYC 7601 processors, we see:
[root@odyssey ~]# find /dev/sensors/
/dev/sensors/
/dev/sensors/temperature
/dev/sensors/temperature/cpu
/dev/sensors/temperature/cpu/procnode.0
/dev/sensors/temperature/cpu/procnode.1
/dev/sensors/temperature/cpu/procnode.2
/dev/sensors/temperature/cpu/procnode.3
/dev/sensors/temperature/cpu/procnode.4
/dev/sensors/temperature/cpu/procnode.5
/dev/sensors/temperature/cpu/procnode.6
/dev/sensors/temperature/cpu/procnode.7
The nice thing about the current scheme is that anything of type
ddi_sensor
will have a directory hierarchy created for it based on the
different interspersed :
characters. This makes it very easy for us to
experiment with different kinds of sensors without having to go through
too much effort. That said, this is all still evolving, so there’s no
promise that this won’t change. Please don’t write code that relies on
this. If you do, it’ll likely break!
FMA Topology
The last piece of this was to wire it up in FMA’s topology. To do that, I did a few different pieces. The first was to make it easy to add a node to the topology that represents a sensor backed by this kernel interface. There’s one generic implementation of that which is parametrized by the path.
With that, I first modified the CPU enumerator. The logic will use a core sensor if available, but can also fall back to a processor-node sensor if it exists. Next, I added a new piece of enumeration, which was to represent the chipset. If we have a temperature sensor, then we’ll enumerate the chipset under the motherboard. While this is just the first item there, I suspect we’ll add more over time as we try to properly capture more information about what it’s doing, the firmware revisions that are a part of it, and more.
This piece is, in some ways, the simplest of them all. It just builds on everything else that was already built up. FMA already had a notion of a sensor (which is used for everything from disk temperature to the fan RPM), so this was just a simple matter of wiring things up.
Now, we have all of the different pieces that made the original example of the CPU and PCH temperature sensor work.
Further Reading
If you’re interested in learning more about this, you can find more information in the following resources:
-
The Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3 (3A, 3B, 3C & 3D): System Programming Guide discusses the Intel CPU temperature and socket sensors.
-
The Intel 300 Series Chipset Volume 1 and Intel 300 Series Chipset Volume 2 discuss the Intel platform controller hub interface.
-
The Open Source Register Reference for AMD Family 17h Processors Models 00h-2Fh provides a bit more information about the AMD sensors.
In addition, you can find theory statements that describe the purpose of the different drivers and other pieces that were discussed earlier and their manual pages:
Finally, if you want to see the actual original commits that integrated these changes, then you can find the following from illumos-gate:
commit dc90e12310982077796c5117ebfe92ee04b370a3
Author: Robert Mustacchi <rm@joyent.com>
Date: Wed Apr 24 03:05:13 2019 +0000
11273 Want Intel PCH temperature sensor
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Mike Zeller <mike.zeller@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Gergő Doma <domag02@gmail.com>
Reviewed by: Paul Winder <Paul.Winder@wdc.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
commit f2dbfd322ec9cd157a6e2cd8a53569e718a4b0af
Author: Robert Mustacchi <rm@joyent.com>
Date: Sun Jun 2 14:55:56 2019 +0000
11184 Want CPU Temperature Sensors
11185 i86pc chip module should be smatch clean
Reviewed by: Hans Rosenfeld <hans.rosenfeld@joyent.com>
Reviewed by: Jordan Hendricks <jordan.hendricks@joyent.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Looking Ahead
While this is focused on a few useful temperature sensors, there are more that we’d like to add. If you have a favorite sensor that you’d like to see, please reach out to me or the broader illumos community and we’d be happy to take a look at it.
Another avenue that we’re looking to explore is having a standardized sensor driver such that a device driver doesn’t necessarily have to have its own character device or implementation of the ioctl handling.
Finally, I’d like to thank all those who helped me as we discussed different aspects of the API, reviewed the work, and tested it. None of this happens in a vacuum.
Previous Entry: None| All Entries | Next Entry: A Tale of Two LEDs