When it comes to debugging, most often it makes sense to start at a high level in the system and work your way down, isolating problems and getting more specific along the way. However, sometimes you find that you’re spending all your time on CPU and the OS is doing its best to get out of the way. To make it easier to understand what the CPU is doing, many CPU vendors have gone through and added support for what they call performance monitoring counters.

These counters cover lots of different things on the CPU. For example:

  • The number of instructions executed

  • Information related to the TLBs (Translation lookaside buffers) and Caches

  • Information about branches

  • Information about floating point units

There is a true wealth of information here. Of course, making it easily accessible to software can be a bit more of a challenge. First we’ll look briefly at how these work from a software perspective and then we’ll talk about what we’ve done to make them a bit easier. This article is focused on x86 Intel and AMD CPUs. If you’re using ARM, RISC-V, SPARC, MIPS, or other CPUs, then while the principles may be the same, the actual implementation is quite different.

Using CPU Performance Counters

Each CPU has a different set of counters that they support. A counter represents a measurable thing on the CPU. For example, one can measure the total number of cycles, uops, or retired, conditional branch instructions. These counters are broken into two groups:

  1. Architectural Counters

  2. Micro-architectural Counters

Anything that’s in the first group is part of the standard instruction set architecture (ISA) that the CPU supports. This means that the counters will be the same between CPU generations. However, the vast majority of counters are related to the second group: micro-architectural. These deal with the actual design and implementation of the processor and therefore change from processor to processor. The vast majority of counters are in this latter bucket. Practically this means that the counters and their meaning change from generation to generation. What existed and was used on a Haswell processor may not resemble a Cascade Lake processor at all. And the only guarantee one can make is that most everything is different between AMD and Intel CPUs.

While there are hundreds of counters, only a few of them can be activated at any given time. On x86 CPUs you need to associate a counter with a specific unit. Each CPU core has a limited number of units, generally called performance monitoring units. On Intel systems, you generally only get four of them per thread and on AMD Ryzen/EPYC CPUs you get six! So we’ve gone from hundreds of counters to really only having 4-6 active ones at any given time.

Because these are a finite resource, the operating system usually virtualizes them to some degree and creates an abstraction for enabling and controlling them. The exact way that this looks can vary depending on the operating system, but generally there are tools that are part of the OS. For example, on illumos you can use cpustat(1M) or even DTrace’s CPC provider. On Linux, tools such as perf) can be used to access the counters. There are various tools on Windows and more that Intel themselves write.

On both Intel and AMD, the performance monitor counters are managed with MSRs (model-specific registers). In essence, you write the ID number of a counter you care about and then read back the values at some time later. If you write an invalid counter ID to a register, that generally results in a #GP, the x86 general protection fault, which is often used as a catch-all exception. There are different strategies for figuring out how to handle this fact.

In illumos, the historical approach to dealing with this was to encode the list of performance counters that users can use and deal with mapping from that name to a counter ID itself. This list was encoded in the drivers for the various systems. While you can easily imagine other approaches to take here, at the end of the day you need some way to map between a given CPU generation, the list of counters it supports, and information for users as to what these counters actually do.

Structured Data

Originally, and still to this day, the data about what counters a given CPU generation supports have been encoded in the various architecture manuals for the CPUs. Intel has whole chapters of the canonical Software Developer Manual dedicated to listing out the counters. The same is true at AMD’s end as well. As you might imagine, this then leads to folks copying these tables over, introducing errors in the process, and as updates to the manuals get issued, it being difficult to determine what has changed and update it accordingly.

When I was first looking at this, we hadn’t updated the performance counters in illumos in quite some time. I debated trying to go through the manuals and copy and paste all of the different tables, noting what was architectural, and what wasn’t, but it was a rather tedious problem. This almost stymied the entire effort, but while looking around, I found that there was actually a better path and one that would cover a lot more processors than those I cared about.

As part of various efforts, Intel has gone through and started putting machine parseable descriptions of this data as a part of their perfmon suite. This means that there are tables that describe on a per-CPU family and model basis the set of supported performance counters. Each entry includes a description and programming information. The nice thing about this data is that it covers all Intel CPUs that have come out since Intel’s Nehalem processor, including the Atom and Kinghts (Xeon Phi) family processors.

This data, provider as either JSON blobs or as tab separated values, is used not only by various Intel tools but it also included as part of the Linux Kernel’s perf program. Seeing it used there gave me the idea of using this same data from Intel to autogenerate not only the kernel pieces needed, but also to use it to autogenerate manual pages with all of Intel’s descriptions about the events. Here’s an example of one of the JSON blobs for an event (I’ve manually wrapped the 'PublicDescription' field):

  {
    "EventCode": "0xC0",
    "UMask": "0x00",
    "EventName": "INST_RETIRED.ANY_P",
    "BriefDescription": "Number of instructions retired. General Counter   - architectural event",
    "PublicDescription": "This event counts the number of instructions (EOMs)
	retired. Counting covers macro-fused instructions individually (that is,
	increments by two).",
    "Counter": "0,1,2,3",
    "CounterHTOff": "0,1,2,3,4,5,6,7",
    "SampleAfterValue": "2000003",
    "MSRIndex": "0",
    "MSRValue": "0",
    "TakenAlone": "0",
    "CounterMask": "0",
    "Invert": "0",
    "AnyThread": "0",
    "EdgeDetect": "0",
    "PEBS": "0",
    "Data_LA": "0",
    "L1_Hit_Indication": "0",
    "Errata": "BDM61",
    "Offcore": "0"
  }

There’s a bunch of detail here. But things that we care about include the 'EventCode' and 'UMask' which start to tell us how to program this counter. The name and description fields are helpful. The Intel name becomes the name that the OS exposes and the description information is used to generate information in the manual pages. Some of the information, such as the counter fields, tell us constraints about which performance counter units w can program a particular event into. Unfortunately, some events can only be programmed in particular counters. This tells us what it constrains it to.

cpcgen

To put this all together, I wrote a tool called cpcgen which takes as input the CPU performance counter information that we get from Intel and outputs the following:

  1. Per-processor model lists of supported counters

  2. Maps given CPU family, model, and steppings to a given list

  3. Generates manual pages for each of the counter lists using the description information

I put this together and integrated it into illumos. We went from having CPU performance counter information for practically none of Intel’s recent platforms, to having them for all of them. I then updated it once for Cascade Lake and it made the process much easier. Having this sped up the process dramatically. And in fact, means that as information about Cooper Lake and Ice Lake filters out, it will be relatively easy to keep updating the information in the system. All we have to do is update the perfmon data that we get from Intel. While it doesn’t get us out of needing to update information, it does get the bulk of the tedious parts out of this.

AMD Support

While cpcgen started off as mainly a way to consume Intel’s perfmon data, when I was looking at doing updates for AMD’s Zen 1 processors I asked myself, why not put this together for them. While AMD doesn’t have anything like this, I decided I would go ahead and take the information from the manuals and start assembling it into JSON files that have a similar look and feel. The data format looks a little different as the actual format i different. Here’s what one of the data entries looks like for measuring retired instructions:

{
        "mnemonic": "Core::X86::Pmc::Core::ExRetInstr",
        "name": "ExRetInstr",
        "code": "0x0C0",
        "summary": "Retired Instructions"
}

Here we have short names, the mnemonic that AMD uses for it, a summary, and the code that we need to program to use this. Other examples have more detailed descriptions and some have variants as well.

Ultimately, this all gets wired up in an implementation that looks like Intel does, with a few minor variants. The same tool, cpcgen, handles both Intel and AMD data files. As AMD comes out with more public documentation on the performance counters for Zen 2 chips, we’ll go ahead and update things at our end. My hope is that over time, AMD will adopt a process of having structured data. If this can be used as a starting point for it, all the better. Because we’re manually maintaining this, information for older AMD CPUs didn’t come along for free. I only did the work for those that I had access to at the time.

One part of updating the AMD side for Zen (family 17h) was that I needed to update the kernel implementation as well. As part of AMD’s Bulldozer (Family 15h) processor line, they added new features which extended the number of performance monitoring counters on the system and added new MSRs for them. If we find that the CPU supports the extended counter set (indicated by a CPUID bit), then we’ll go ahead and tell the system that we support the 6 counters (rather than the standard 4) and use the new MSR addresses for the counters. Thankfully, aside from the new counter MSR addresses and the increased number of them, using them is pretty much the same on Zen family processors.

What’s Next?

The cpcgen tool makes it easier for us to manage and update our CPU performance counter backends in illumos on x86. There is plenty more we can do from here:

  • Updates for the most recent generations of Intel and AMD processors.

  • Adding support for additional types of counters from the Intel perfmon JSON payloads.

  • Improving the tooling that consumes performance counters.

Here are some additional pointers to the code, in case you’d like to read more:

If you’d like to get involved, get in touch with the illumos community on IRC in #illumos on Freenode or a mailing list and I or someone else will help you out and see what we can do. As long as you’re willing to learn, receive feedback, and keep going despite difficulties, then it doesn’t matter what your experience is.


Previous Entry: USB Topology| All Entries | Next Entry: Joining Oxide