One of the stories that has stuck with me over the years came from a support case that a former colleague, Ryan Nelson, had point on. At Joyent, we had third parties running our cloud orchestration software in their own data centers with hardware that they had acquired and assembled themselves. In this particular episode, Ryan was diagnosing a case where a customer was complaining about the fact that networking wasn’t working for them. The operating system saw the link as down, but the customer insisted it was plugged into the switch and that a transceiver was plugged in. Eventually, Ryan asked them to take a picture of the back of the server, which is where the NIC (Network Interface Card) would be visible. It turned out that the transceiver looked like it had been run over by a truck and had been jammed in — it didn’t matter what NIC it was plugged into, it was never going to work.
As part of a broader push on datacenter management, I was thinking about this story and some questions that had often come up in the field regarding why the NIC said the link was down. These were:
-
Was there actually a transceiver plugged into the NIC?
-
If so, did the NIC actually support using this transceiver?
Now, the second question is a bit of a funny one. The NIC obviously knows whether or not it can use what’s plugged in, but almost every time, the system doesn’t actually make it easy to find out. A lot of NIC drivers will emit a message that goes to a system log when the transceiver is plugged in or the NIC driver first attaches, but if you’re not looking for that message or just don’t happen to be on the system’s console when that happens, suddenly you’re out of luck. You might also ask why are there transceivers that aren’t supported by a NIC, but that’s a real can of worms.
Anyways, with that all in mind, I set out on a bit of a journey and put together some more concrete proposals for what to do here in terms of RFD 89: Project Tiresias. We’ll spend the rest of this entry going into a bit of background on transceivers and then discuss how we go from knowing whether or not they’re plugged in to actually determining who made them and where they are in the system.
What is a transceiver?
We’ve been using the term 'transceiver'
quite a bit so far, but that’s
a pretty generic term. Let’s spend a bit of time talking through that.
First, we’re really focused on transceivers as used in the context of
networking. When people think of wired networking, the most common thing
that comes to mind are
Ethernet
Cables. Ethernet isn’t the only type of cable that’s been used. Before
Ethernet was common, BNC
coaxial cables were used on some NICs as well.
However, in the data center, Ethernet didn’t end up keeping up with the speeds and distances that connections were being use for (though 10 Gigabit Ethernet, 10GBASE-T, has started becoming more common). In this space, fiber-optic cables and copper twinaxial cables (twinax) are much more prominent. Note, twinaxial cables are rather different from their BNC coaxial relatives. Coaxial cables are used more often when there are shorter distances to cover, such as between a top of rack switch and a server. Fiber optic cables often cover longer distances or have higher throughputs.
Because different types of cables are used in different situations, several vendors got together and agreed upon a set of standards to use when manufacturing these cables. This allowed NIC manufacturers to design a single physical and electrical interface, but still support different types of transceivers. These standards (technically a multi-source agreement) are maintained by the Small Form Factor (SFF) Committee. The committee manages standards not only for networking, but also for SAS cables and other devices.
If you’ve worked in this space, you may have heard of what are called
SFP
and SFP+
cables. These cables generally support 1 and 10 Gigabit
networking respectively. The transceiver is controlled over an
i2c bus by the NIC. The
addresses and their meanings are standardized. They were originally
standardized in a standard called INF-8074
, but the current active
standard for these devices is called
SFF-8472.
With faster networking speeds, there have been additional revisions and
standards put out. Devices that support 40 Gigabit networking are called
QSFP+
because they combine 4 SFP+ devices. To support 25 Gigabit
networking, a variant of SFP+ was created called SFP28
. Finally, to
support 100 Gigabit networking, they combined 4 SFP28 devices together.
The 40 Gigabit devices are standardized in
SFF-8436 and the 100 Gigabit
have their management interface described in
SFF-8636.
The standards for various devices have somewhat similar layouts. They break data into a series of different pages of which a specific offset into the page can then be accessed via the NIC’s i2c bus. These pages contain some of the following information:
-
Control over the device and its configuration
-
Static manufacturing information such as the manufacturer’s name, the device’s name, the serial number, and more
-
Optional information about the health of the device such as the temperature, voltage, and more
The pages and addresses change from specification to specification, though a large amount of the data overlaps between them. The health information of the device is required when the connector is considered active (generally fiber-optic cables with lasers) and is optional when you have a passive device (such as a copper twinax cable).
The MAC Transceiver Capability
The first part of our adventure with getting to this data begins in the
operating system kernel. Similar to the case of
managing NIC
LEDs, the networking device driver
framework has an optional capability that a driver can implement to
expose this information called MAC_CAPAB_TRANSCEIVER
.
The transceiver capability has a structure that the device driver has to fill out which describes some basic information for dealing with transceivers. This includes the following fields:
-
The number of transceivers present on the device.
-
A function,
mct_info()
that allows one to get basic information about the transceiver. -
A function,
mct_read()
that allows one to read i2c data from the device.
The driver first indicates the number of transceivers that are present
for it. In general, this is one. However some devices actually support
combining multiple transceivers and ports into one logical device — though this isn’t commonly used. The next item, the mct_info()
function, is used to answer the two questions that were posed at the
beginning of this: Does the NIC think a transceiver is present and can
the NIC use the transceiver? Finally, the mct_read()
function allows
us to go through and read specific regions of the memory map of the
transceiver. Generally, user land reads an entire 256-byte page at any
given time.
The kernel only facilitates reading information. It generally doesn’t try and make semantic sense of any of the data. That is purposefully left to user land — unless there’s a good reason to parse binary data in the kernel, you’re usually better off not doing that.
The following device families and drivers support the
MAC_CAPAB_TRANSCEIVER
capability. Some drivers that you may be more
familiar with such as 1 Gigabit Ethernet devices aren’t on this list
because they don’t support transceivers. Supported devices include:
-
Broadcom NetExtreme II 10 Gb devices based on the
bnxe
driver. -
Chelsio T4, T5, and T6 10, 25, 40, and 100 Gbit devices based on the
cxgbe
driver. -
Intel X520 10 Gbit SFP+ devices based on the
ixgbe
driver. -
Intel 10, 25, and 40 Gbit SFP+, SFP28, and QSFP+ devices based on the
i40e
driver. -
QLogic FastLinQ QL45xxx devices based on the
qede
driver.
The way that each driver gets access to this information varies from device to device. Some, like i40e and cxgbe, issue a firmware command to read information from the transceiver. Others have dedicated registers that can be programmed to read arbitrary data from the i2c device.
libsff
and dltraninfo
Once we have the ability to read the actual data from the transceiver,
we have to make logical sense of it. To handle that, the first thing I
did was write a small library called libsff
. The goal of libsff is to
parse the various SFF binary data payloads and return structured data as
a set of name-value pairs.
If you look at the header file,
libsff.h,
you’ll see a list of different keys that can be looked up. Some of these
are rather straightforward, such as the string "vendor"
, which has the
name of the manufacturer of the transceiver. Others are a bit more
opaque and require referencing the actual SFF documents. Another useful
feature of the library is that it tries to abstract out the differences
between different versions of the specifications. The goal is that when
there is similar data, it should always be found under the same key even
if they are found in wildly different parts of the memory map or the way
we have to parse the data is different. The goal of libraries (or really
any interface and abstraction) should be to take something grotty and
transform it into something more usable as though reality were as simple
as it presents.
The one thing that the library doesn’t generally do today is parse all of the sensor data that may be available on the transceiver. The main reason for this is that the vast majority of transceivers that I had access to, did not implement it. On SFP, SFP+, and SFP28 devices, sensor information is optional for twinax based devices. With a few devices to test with, it would be pretty straightforward to add support for it though.
On its own, a library isn’t useful unless it has a consumer. The first
consumer that I’ll discuss is dltrainfo
. This is an unstable,
development program that I wrote to exercise this functionality and to
try and get a sense of what interfaces might be useful. There are two
forms of the dltrainfo
command. The first answers the questions that
we laid out in the beginning about whether the transceiver is present or
usable. When run this way, you see something like:
# /usr/lib/dl/dltraninfo
ixgbe0: discovered 1 transceiver
transceiver 0 present: yes
transceiver 0 usable: yes
ixgbe1: discovered 1 transceiver
transceiver 0 present: yes
transceiver 0 usable: yes
ixgbe2: discovered 1 transceiver
transceiver 0 present: yes
transceiver 0 usable: yes
ixgbe3: discovered 1 transceiver
transceiver 0 present: yes
transceiver 0 usable: yes
The next option is to read the information from the transceiver. Here’s an example of reading this on an Intel 10 Gbit fiber-optic transceiver:
# /usr/lib/dl/dltraninfo -v ixgbe1
ixgbe1: discovered 1 transceivers
transceiver 0 present: yes
transceiver 0 usable: yes
Identifier: 'SFP/SFP+/SFP28'
Extended Identifier: 4
Connector: 'LC (Lucent Connector)'
10G+ Ethernet Compliance Codes[0]: '10G Base-SR'
Ethernet Compliance Codes[0]: '1000BASE-SX'
Encoding: '64B/66B'
BR, nominal: '10300 MBd'
Length 50um OM2: '80 m'
Length 62.5um OM1: '30 m'
Length OM3: '300 m'
Vendor: 'Intel Corp'
OUI[0]: 0
OUI[1]: 27
OUI[2]: 33
Part Number: 'FTLX8571D3BCV-IT'
Revision: 'A'
Laser Wavelength: '850 nm'
Options[0]: 'Rx_LOS implemented'
Options[1]: 'TX_FAULT implemented'
Options[2]: 'TX_DISABLE implemented'
Options[3]: 'RATE_SELECT implemented'
Serial Number: 'AKR0EQ0'
Date Code: '110618'
Extended Options[0]: 'Soft Rate Select Control Implemented'
Extended Options[1]: 'Soft RATE_SELECT implemented'
Extended Options[2]: 'Soft RX_LOS implemented'
Extended Options[3]: 'Soft TX_FAULT implemented'
Extended Options[4]: 'Soft TX_DISABLE implemented'
Extended Options[5]: 'Alarm/Warning flags implemented'
8472 Compliance: 'Rev 10.2'
This allows us to interact with the information in a readable way.
Effectively, this dumps out the entire name-value pair set that we
construct when parsing data with libsff
. There are two additional ways
to print this data. The first one, -x
, dumps out the data as hex data
(kind of like if you run the program xxd). The second option -w
writes
out the first page, 0xa0, to a file. This allows you to take the raw
data with you.
Seeing Transceivers in Topo
The next step with all this work is to expose the transceivers as part of the system topology in the fault management architecture (FMA). This is useful for a few reasons:
-
It allows us to see what devices are present in the same snapshot as other devices like disks, CPUs, DIMMs, etc.
-
FMA’s topology is a natural place for us to expose sensors.
-
If a device is in topology, then we can generate error reports and faults against those devices.
Basically, being visible in the topology allows us to integrate it more fully into the system and makes it easy for various monitoring and inventory tools in the system to see these devices without having to make them aware of the underlying ways of getting data.
The topology information is organized as a tree. When we encounter hardware that we believe is a networking device (because its PCI class indicates it is), then we ask the kernel about how many transceivers it supports. For each transceiver, we create a port node under the NIC whose type indicates that it is intended for SFF devices.
When a transceiver is present, then we will place a transceiver node under the port. This node has two different groups of properties. The first is generic to all transceivers, which is where we indicate whether or not the hardware can use the transceiver. The second group are properties that we derive from the SFF specifications about the transceiver’s manufacturing data. This includes the vendor, part number, serial number, etc. The following block of text shows three different nodes: the NIC, the port, and the transceiver:
hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1
label string MB
FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0
ASRU fmri dev:////pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
group: authority version: 1 stability: Private/Private
product-id string X9SCL-X9SCM
chassis-id string 0123456789
server-id string ivy
group: io version: 1 stability: Private/Private
dev string /pci@0,0/pci8086,155@1,1/pci103c,17d3@0,1
driver string ixgbe
module fmri mod:///mod-name=ixgbe/mod-id=242
group: pci version: 1 stability: Private/Private
device-id string 10fb
extended-capabilities string pciexdev
class-code string 20000
vendor-id string 8086
assigned-addresses uint32[] [ 2197946640 0 3750756352 0 1048576 2164392216 0 57344 0 32 2197946656 0 3753902080 0 16384 ]
hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1/port=0
FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pc
iexdev=0/pciexfn=1
group: authority version: 1 stability: Private/Private
product-id string X9SCL-X9SCM
chassis-id string 0123456789
server-id string ivy
group: port version: 1 stability: Private/Private
type string sff
hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
FRU fmri hc://:product-id=X9SCL-X9SCM:server-id=ivy:chassis-id=0123456789:serial=AKR0EQ0:part=FTLX8571D3BCV-IT:revision=A/motherboard=0/hostbridge=1/pciexrc=1/pciexbus=2/pciexdev=0/pciexfn=1/port=0/transceiver=0
group: authority version: 1 stability: Private/Private
product-id string X9SCL-X9SCM
chassis-id string 0123456789
server-id string ivy
group: transceiver version: 1 stability: Private/Private
type string sff
usable string true
group: sff-transceiver version: 1 stability: Private/Private
vendor string Intel Corp
part-number string FTLX8571D3BCV-IT
revision string A
serial-number string AKR0EQ0
If you plug in a transceiver dedicated to fibre channel, then we’ll properly note that we can’t use the transceiver by setting the usable property to false. The following is an example of the transceiver node in that case:
hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
group: protocol version: 1 stability: Private/Private
resource fmri hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
FRU fmri hc://:product-id=X10SLM+-LN4F:server-id=haswell:chassis-id=0123456789/motherboard=0/hostbridge=0/pciexrc=0/pciexbus=1/pciexdev=0/pciexfn=1/port=0/transceiver=0
group: authority version: 1 stability: Private/Private
product-id string X10SLM+-LN4F
chassis-id string 0123456789
server-id string haswell
group: transceiver version: 1 stability: Private/Private
type string sff
usable string false
Further Reading
If you’d like to read more on this, there are a couple of different places that I can send you.
For more on the SFF standards, there’s:
If you’re interested in the illumos implementation, there’s:
-
The mac_capab_transceiver(9E) manual page which describes how device drivers are supposed to implement the interface.
-
The bnxe, cxgbe, i40e, ixgbe, and qede driver implementations of
MAC_CAPAB_TRANSCEIVER
. -
libsff’s implementation and it’s header file.
-
The dltraninfo command.
-
The pieces of the FMA topology libraries that implement support for dealing with NICs, ports, and transceivers.
Looking Ahead
If you have a favorite NIC that uses SFP-based transceivers and it isn’t supported, reach out and we’ll see what we can do. If you’d find it interesting to work on exposing more of the sensor information present in the SFPs, then we’d be happy to further mentor someone there. Once these pieces are exposed in topology, it could also make sense to wire up the FRU monitor to watch for temperature thresholds, voltage drops, or device faults.
Up next, we’ll talk about understanding the topology of USB devices.
Previous Entry: A Tale of Two LEDs| All Entries | Next Entry: USB Topology