zabbix / wrong nvme temperature readings
After installing some new nodes Zabbix began reporting NVMe drive temperatures that were... alarming. Think "above 60 °C" alarming. But when I checked the same drives manually, the values were sane. So, either the drives were gaslighting me, or Zabbix was.
The search
Decomposing what our monitoring scripts do was really just calling nvme-cli:
# nvme smart-log /dev/nvme0n1 | grep temper
temperature : 33 °C (306 K)
Nothing wrong there.
Next step was to peek under the hood. I disabled TLS on the zabbix-agent so I could sniff what the proxy was sending and what the agent was replying with.
The config sent by the zabbix-proxy looked fine:
{
"response": "success",
"config_revision": 1,
"data": [
// ...
{
"key": "nvme.device.smart[/dev/nvme1n1,temperature]",
"itemid": 3436251,
"delay": "1h",
"lastlogsize": 0,
"mtime": 0,
"timeout": "4s"
},
// ...
The data sent by the zabbix-agent also looked fine, except for the extreme temperature:
{
"request": "agent data",
"session": "cf1f77a2de59226b9edac3a4d68af46f",
"version": "7.0.18",
"variant": 1,
"host": "pve.example.com",
"data": [
{
"itemid": 3436250,
"value": "91",
"id": 201,
"clock": 1758189607,
"ns": 650994625
},
// ...
Again, I double-checked the configuration and the script:
# grep -F =nvme.device.smart /etc/zabbix/zabbix_agentd.d/nvme.conf
UserParameter=nvme.device.smart[*], sudo /etc/zabbix/scripts/nvme.device --smart "$1" "$2"
# sudo -Huzabbix sudo /etc/zabbix/scripts/nvme.device --smart /dev/nvme0n1 temperature
33
I then modified the script to report the constant value 66.
{
"request": "agent data",
"session": "cf1f77a2de59226b9edac3a4d68af46f",
"version": "7.0.18",
"variant": 1,
"host": "pve.example.com",
"data": [
{
"itemid": 3436250,
"value": "66",
"id": 201,
"clock": 1758189788,
"ns": 110298321
},
// ...
The zabbix-agent dutifully reported the 66 back to the proxy. So
then nvme-cli must be telling a different story.
At that point I suspected the environment. I modified the script to
dump the env to a temporary file:
env >/tmp/zabbix.debug
After the next zabbix-agent run, I found the /tmp/zabbix.debug file with the exact environment with which nvme-cli would be called.
A alternative approach is to to run strace, which produced the following output:
# strace -fp $(pgrep -f 'zabbix_agentd: active checks') -v -s 4096 -es=none -eq=exit -et=execve
...
[pid 3483848] execve("/bin/sh",
["sh", "-c", " sudo /etc/zabbix/scripts/nvme.device --smart \"/dev/nvme0n1\" \"temperature\""],
["LANG=en_US.UTF-8",
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin",
"PIDFILE=/run/zabbix/zabbix_agentd.pid",
"USER=zabbix",
"LOGNAME=zabbix",
"HOME=/var/lib/zabbix",
"INVOCATION_ID=1c74972630de492aa275f3d8c19f9f65",
"JOURNAL_STREAM=9:69398932",
"SYSTEMD_EXEC_PID=3483398",
"MEMORY_PRESSURE_WATCH=/sys/fs/cgroup/system.slice/zabbix-agent.service/memory.pressure",
"MEMORY_PRESSURE_WRITE=c29tZSAyMDAwMDAgMjAwMDAwMAA=",
"CONFFILE=/etc/zabbix/zabbix_agentd.conf"]) = 0
The third parameter is the list of environment variables passed to the custom script. They don't look strange at first glance — but they must be the cause.
At this point, the astute reader will notice LANG=en_US.UTF-8. That
is passed to the script, and consequently to the nvme-cli application.
LANG=en_US.UTF-8
Indeed, it turned out that the LANG variable was the culprit. Adding
LC_ALL=en_US.UTF-8 (forcefully overriding any language settings),
exhibited the problem:
# nvme smart-log /dev/nvme0n1 | grep ^tempera
temperature : 33 °C (306 K)
# LC_ALL=en_US.UTF-8 nvme smart-log /dev/nvme0n1 | grep ^tempera
temperature : 91 °F (306 K)
91 Fahrenheit instead of 33 Celsius from nvme-cli.
Here in the Netherlands we don't measure temperatures in glazed donuts
per bald eagle units, so I had my system locales set to a sane mix of
nl_NL and en_US:
# locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=en_US.UTF-8
LC_TIME=nl_NL.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=nl_NL.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=nl_NL.UTF-8
LC_NAME=nl_NL.UTF-8
LC_ADDRESS=nl_NL.UTF-8
LC_TELEPHONE=nl_NL.UTF-8
LC_MEASUREMENT=nl_NL.UTF-8
LC_IDENTIFICATION=nl_NL.UTF-8
LC_ALL=
That's my personal config passed along over the ssh session.
On the server the zabbix-agent simply inherits LANG=en_US.UTF-8 from
systemd:
# systemctl show-environment
LANG=en_US.UTF-8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
This is not a new config. On this Debian/Trixie machine, the defaults
are configured in /etc/locale.conf. In older systems, it would be in
/etc/default/locale. Systemd picks up either config and uses it when
spawning daemons.
And because we like to keep our servers close to vanilla config and because we prefer English for our systems, there is never a need to change the global default locale.
Sidenote: if you're updating
/etc/default/localeor/etc/locale.conf, runsystemctl daemon-reexecto update the environment in systemd.
nvme-cli updated
To summarize, the newer nvme-cli application started outputting temperatures in Fahrenheit instead of Celsius when your locale country is "US" (or a handful others).
That change was made in version v2.10:
$ git tag --contains c95f96e0d | sort -V | head -n1
v2.10
Debian/Trixie has 2.13-2 while Ubuntu/Noble is still at 2.8-1ubuntu0.1; explaining why this was the first time we were bitten by this.
There were apparently already others affected, as seen on the GitHub
issue tracker, and there was also some confusion about
LC_MEASUREMENT=metric.
That's wrong. Set it to a valid locale, like
LC_MEASUREMENT=nl_NL.UTF-8.
To programmers: if you really must, use the following C code when deciding to use Fahrenheit, instead of rolling your own is_fahrenheit_country():
const char *m = nl_langinfo(_NL_MEASUREMENT_MEASUREMENT);
if (m && m[0] == 2) {
printf("Fahrenheit\n");
} else { /* m[0] == 1 or anything else */
printf("Celsius\n");
}
But, I'd rather you didn't. The default of en_US is fine as long as it's unobtrusive. If you're going to change your code, expect that people use the default and don't expect (or want) your change.
Proper fix
In the long run, it might be better to force zabbix-agent to lose the
locale environment altogether. It would make the most sense when custom
zabbix-agent scripts get the C locale for consistent sorting (and
consistent temperature output). That requires some external influence
(for instance a systemd Environment= override. So for now I'll stick
to adding it where
needed.
Update 2025-11-03
In the end, the dynamic detection was removed from nvme-cli. Instead now three temperature units are always shown, regardless of locale settings.
So, we're back to the situation where plenty of applications read
LC_PAPER, but none use LC_MEASUREMENT.