Prometheus Node Exporter補足説明

Prometheusはコンテナやクラウドなどの大量コンポーネントを監視する事に特化したソフトウェアです。「Prometheus インストール (tarball編)」などのページでは、Node Exporterのインストール方法を紹介していますが、実践観点ではやや説明不足です。このページではNode Exporterの設定変更方法と補足説明をします。

関連記事 : 統合監視ツール : Linux基本操作

関連記事 : 統合監視ツール : 旧世代(使用頻度低)

関連記事 : 統合監視ツール : Zabbix

関連記事 : 統合監視ツール : Prometheus

前提
1. 参照資料
2. 動作確認済環境
コレクタ
1. コレクタ一覧
2. コレクタの有効無効切り替え
設定変更
補足

前提

参照資料

動作確認済環境

Rocky Linux 8.6
Prometheus 2.36.2
node_exporter 1.3.1
docker-ce 20.10.17
prom/prometheus v2.37.0

コレクタ

コレクタ一覧

Node exporterには監視対象を表すコレクタと呼ばれる設定があります。Node exporterの起動ログを丁寧に観察すると、起動時に有効化されるコレクタが標準出力されている事が分かります。

[root@linux010 node_exporter-1.3.1.linux-amd64]# ./node_exporter

 <omitted>

ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=arp
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=bcache
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=bonding
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=btrfs
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=conntrack
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=cpu
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=cpufreq
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=diskstats
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=dmi
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=edac
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=entropy
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=fibrechannel
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=filefd
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=filesystem
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=hwmon
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=infiniband
ts=2022-05-13T08:18:32.826Z caller=node_exporter.go:115 level=info collector=ipvs

 <omitted>

もし、コレクタのデフォルト設定が有効か無効かの調査に迫られた場合は、ヘルプよりも「GitHub node_exporter」のREADME.mdが見やくオススメです。

コレクタの有効無効切り替え

状況によってはコレクタを有効無効を変更したい事もあるかもしれません。例えば、仮想マシンならばハードウェア監視は不要ですのでhwmonは必要ないと思うでしょう。もしかしたら、デフォルト設定で無効になっているntpを監視したいというニーズがあるかもしれません。

このような場合はnode_exporterに引数を与えて実行すれば、有効無効を変更できます。node_exporterのヘルプは以下の通りです。

[root@linux010 node_exporter-1.3.1.linux-amd64]# ./node_exporter -h
usage: node_exporter [<flags>]

Flags:
  -h, --help                     Show context-sensitive help (also try --help-long and --help-man).

        <omitted>

      --collector.arp            Enable the arp collector (default: enabled).
      --collector.bcache         Enable the bcache collector (default: enabled).
      --collector.bonding        Enable the bonding collector (default: enabled).
      --collector.btrfs          Enable the btrfs collector (default: enabled).
      --collector.buddyinfo      Enable the buddyinfo collector (default: disabled).
      --collector.conntrack      Enable the conntrack collector (default: enabled).
      --collector.cpu            Enable the cpu collector (default: enabled).
      --collector.cpufreq        Enable the cpufreq collector (default: enabled).
      --collector.diskstats      Enable the diskstats collector (default: enabled).
      --collector.dmi            Enable the dmi collector (default: enabled).
      --collector.drbd           Enable the drbd collector (default: disabled).
      --collector.drm            Enable the drm collector (default: disabled).
      --collector.edac           Enable the edac collector (default: enabled).
      --collector.entropy        Enable the entropy collector (default: enabled).
      --collector.ethtool        Enable the ethtool collector (default: disable

コレクタを無効にするには「–no-collector.<collector-name>」を付与し、コレクタを有効化するには「–collector.<collector-name>」を付与します。

操作例を以下に示します。以下はデフォルトでは無効になっているsystemdの監視を有効化する操作例です。

./node_exporter \
  --no-collector.hwmon \
  --collector.systemd

それでは有効無効を切り替えられているかを確認します。metricsページ(http://<ip address>:9100/metrics)を「node_systemd_unit_state」でgrepしてみると、確かにデフォルトでは無効になるsystemdに関する情報が出力されている事が分かります。

[root@linux010 node_exporter-1.3.1.linux-amd64]# curl -s http://localhost:9100/metrics | grep -A 30 "HELP node_systemd_unit_state"
# HELP node_systemd_unit_state Systemd unit
# TYPE node_systemd_unit_state gauge
node_systemd_unit_state{name="NetworkManager-wait-online.service",state="activating",type="oneshot"} 0
node_systemd_unit_state{name="NetworkManager-wait-online.service",state="active",type="oneshot"} 1
node_systemd_unit_state{name="NetworkManager-wait-online.service",state="deactivating",type="oneshot"} 0
node_systemd_unit_state{name="NetworkManager-wait-online.service",state="failed",type="oneshot"} 0
node_systemd_unit_state{name="NetworkManager-wait-online.service",state="inactive",type="oneshot"} 0
node_systemd_unit_state{name="NetworkManager.service",state="activating",type="dbus"} 0
node_systemd_unit_state{name="NetworkManager.service",state="active",type="dbus"} 1
node_systemd_unit_state{name="NetworkManager.service",state="deactivating",type="dbus"} 0
node_systemd_unit_state{name="NetworkManager.service",state="failed",type="dbus"} 0
node_systemd_unit_state{name="NetworkManager.service",state="inactive",type="dbus"} 0
node_systemd_unit_state{name="auditd.service",state="activating",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="active",type="forking"} 1
node_systemd_unit_state{name="auditd.service",state="deactivating",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="failed",type="forking"} 0
node_systemd_unit_state{name="auditd.service",state="inactive",type="forking"} 0
node_systemd_unit_state{name="basic.target",state="activating",type=""} 0
node_systemd_unit_state{name="basic.target",state="active",type=""} 1
node_systemd_unit_state{name="basic.target",state="deactivating",type=""} 0
node_systemd_unit_state{name="basic.target",state="failed",type=""} 0
node_systemd_unit_state{name="basic.target",state="inactive",type=""} 0
node_systemd_unit_state{name="cpupower.service",state="activating",type="oneshot"} 0
node_systemd_unit_state{name="cpupower.service",state="active",type="oneshot"} 0
node_systemd_unit_state{name="cpupower.service",state="deactivating",type="oneshot"} 0
node_systemd_unit_state{name="cpupower.service",state="failed",type="oneshot"} 0
node_systemd_unit_state{name="cpupower.service",state="inactive",type="oneshot"} 1
node_systemd_unit_state{name="crond.service",state="activating",type="simple"} 0
node_systemd_unit_state{name="crond.service",state="active",type="simple"} 1
node_systemd_unit_state{name="crond.service",state="deactivating",type="simple"} 0
node_systemd_unit_state{name="crond.service",state="failed",type="simple"} 0

設定変更

textfile 動作原理確認

コレクタtextfileはデフォルトで有効になっているものの、監視対象となるディレクトリは未設定の状態です。もし、テキストファイルを監視したい場合はディレクトリの明示的な指定が必要です。

それではtextfileを動作確認してみましょう。簡単な例として、/etc/shadowの行数をカウントしてユーザ数を調査する操作を以下に記します。

mkdir /var/prom/
echo 'shadow_entries' $(grep -c . /etc/shadow) > /var/prom/shadow.prom

念の為、/var/prom/shadow.promに想定通りの出力が記述されているかどうかを確認します。

[root@linux010 ~]# cat /var/prom/shadow.prom 
shadow_entries 24
[root@linux010 ~]#

以下のように起動引数collector.textfile.directoryを指定してnode_exporterを起動します。

./nod_exporter \
--collector.textfile.directory=/var/prom/

/etc/shadowの調査結果が、node_exporterから出力されている事を確認します。

[root@linux010 ~]# curl -s http://localhost:9100/metrics | grep -A 10 "HELP shadow_entries"
# HELP shadow_entries Metric read from /var/prom/shadow.prom
# TYPE shadow_entries untyped
shadow_entries 24

textfile 定期実行

前述のような/etc/shadowの調査を手作業で操作するのは現実的ではありません。何らかの自動的な仕組みに落とし込む必要があります。例えば、crontabならば以下のように編集すると5分間隔で実行されます。

[root@linux010 ~]# crontab -l
*/5 * * * * echo 'shadow_entries' $(grep -c . /etc/passwd) > /var/prom/shadow.prom
[root@linux010 ~]#

textfileは工夫次第で様々な監視を実現できます。その他の例は「GitHub Text collector example scripts」を参照ください。

timestamp

textfileの監視を定義すると、同時に監視対象ファイルのタイムスタンプも出力されます。以下の出力例ならば、/var/prom/shadow.promのタイムスタンプは
1.652438101e+09である事が分かります。

[root@linux010 ~]# curl -s http://localhost:9100/metrics | grep -A 2 "HELP node_textfile_mtime_seconds"
# HELP node_textfile_mtime_seconds Unixtime mtime of textfiles successfully read.
# TYPE node_textfile_mtime_seconds gauge
node_textfile_mtime_seconds{file="/var/prom/shadow.prom"} 1.652438101e+09

出力はUNIX時刻ですので、適宜、日本時刻などに変換ください。操作例は以下の通りです。コマンドライン操作が手間ならば、「UNIX時間⇒日付変換」のような便利サイトを使っても良いでしょう。

[root@linux010 ~]# TZ=Asia/Tokyo date --date "@1652438101"
Fri May 13 19:35:01 JST 2022
[root@linux010 ~]#

マウントポイント指定

node_exporterのヘルプを以下に再掲します。

[root@linux010 ~]# node_exporter -h
usage: node_exporter [<flags>]

Flags:

      <omitted>

      --path.procfs="/proc"      procfs mountpoint.
      --path.sysfs="/sys"        sysfs mountpoint.
      --path.rootfs="/"          rootfs mountpoint.

いくつかの起動引数はマウントポイントを指定するものがあります。node_exporterは/proc配下や/sys配下などを読み取った監視結果を返しますが、このファイルパスを変更したい場合に使用します。例えば、/proc/cpuinfoを読み取るのが不都合で、これを/host/proc/cpuinfoに変更したい場合に有効です。

この引数指定が有効になるのはdockerでnode_exporterを使用する場合です。以下のようにpath.rootfsを「/」から「/host」に変更してコンテナを起動します。

docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

動作理解のために、node_exporterコンテナをBusyBox(/bin/sh)で操作してみましょう。

[root@linux010 ~]# docker run -d \
>   --name node-exporter \
>   --net="host" \
>   --pid="host" \
>   -v "/:/host:ro,rslave" \
>   quay.io/prometheus/node-exporter:latest \
>   --path.rootfs=/host
Unable to find image 'quay.io/prometheus/node-exporter:latest' locally
latest: Pulling from prometheus/node-exporter
aa2a8d90b84c: Pull complete 
b45d31ee2d7f: Pull complete 
b5db1e299295: Pull complete 
Digest: sha256:f2269e73124dd0f60a7d19a2ce1264d33d08a985aed0ee6b0b89d0be470592cd
Status: Downloaded newer image for quay.io/prometheus/node-exporter:latest
7156993716a026d38ce9180c0f2e6dd314c8570a9df7dda464c52f257987d8c5
[root@linux010 ~]#
[root@linux010 ~]#
[root@linux010 ~]# docker exec -it node-exporter /bin/sh
/ $

/hostがマウントされている事を確認します。

/ $ df -h
Filesystem                Size      Used Available Use% Mounted on
overlay                  30.0G      2.1G     27.8G   7% /
tmpfs                    64.0M         0     64.0M   0% /dev
tmpfs                   372.4M         0    372.4M   0% /sys/fs/cgroup
shm                      64.0M         0     64.0M   0% /dev/shm
/dev/sda2                30.0G      2.1G     27.8G   7% /host
tmpfs                   372.4M         0    372.4M   0% /host/sys/fs/cgroup
devtmpfs                340.0M         0    340.0M   0% /host/dev
tmpfs                   372.4M         0    372.4M   0% /host/dev/shm
tmpfs                   372.4M      9.9M    362.5M   3% /host/run
tmpfs                    74.5M         0     74.5M   0% /host/run/user/1000

  <omitted>

/host配下のファイルを確認すると、ホスト(仮想マシン)のファイルを閲覧できる事ができます。

/ $ ls -l /host
total 20
lrwxrwxrwx    1 root     root             7 Oct  9  2021 bin -> usr/bin
dr-xr-xr-x    5 root     root          4096 Feb 23 17:36 boot
drwxr-xr-x    2 root     root             6 Feb 23 17:37 data
drwxr-xr-x   18 root     root          2960 May 13 06:00 dev
drwxr-xr-x   81 root     root          8192 May 13 10:45 etc
drwxr-xr-x    3 root     root            23 May 13 02:29 home
lrwxrwxrwx    1 root     root             7 Oct  9  2021 lib -> usr/lib
lrwxrwxrwx    1 root     root             9 Oct  9  2021 lib64 -> usr/lib64
drwxr-xr-x    2 root     root             6 Oct  9  2021 media
drwxr-xr-x    3 root     root            22 May 13 02:29 mnt
drwxr-xr-x    3 root     root            24 May 13 10:45 opt
dr-xr-xr-x  133 root     root             0 May 13 06:00 proc
dr-xr-x---    3 root     root           158 May 13 10:35 root
drwxr-xr-x   27 root     root           800 May 13 10:45 run
lrwxrwxrwx    1 root     root             8 Oct  9  2021 sbin -> usr/sbin
drwxr-xr-x    2 root     root             6 Oct  9  2021 srv
dr-xr-xr-x   13 root     root             0 May 13 06:00 sys
drwxrwxrwt    8 root     root           172 May 13 10:51 tmp
drwxr-xr-x   12 root     root           144 Feb 23 17:35 usr
drwxr-xr-x   21 root     root          4096 May 13 09:12 var
/ $

例えば、コンテナ側の/host/proc/cpuinfoならば、ホスト(仮想マシン)側の/proc/cpuinfoに相当します。

/ $ cat /host/proc/cpuinfo 
processor : 0
vendor_id : GenuineIntel
cpu family  : 6
model   : 85
model name  : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
stepping  : 7
microcode : 0xffffffff
cpu MHz   : 2593.906
cache size  : 36608 KB
physical id : 0

  <omitted>

補足

RPM版 Node Exporterの起動引数指定

/etc/systemd/system/multi-user.target.wants/node_exporter.serviceを調査すると、変数NODE_EXPORTER_OPTSで起動引数を指定できる事が分かります。

# -*- mode: conf -*-

[Unit]
Description=Prometheus exporter for machine metrics, written in Go with pluggable metric collectors.
Documentation=https://github.com/prometheus/node_exporter
After=network.target

[Service]
EnvironmentFile=-/etc/default/node_exporter
User=prometheus
ExecStart=/usr/bin/node_exporter $NODE_EXPORTER_OPTS
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
RestartSec=5s

変数NODE_EXPORTER_OPTSは/etc/default/node_exporterで定義されます。例えば、/etc/default/node_exporterを以下のように編集すれば、hwmonが監視対象外になります。

NODE_EXPORTER_OPTS="--no-collector.hwmon"

/etc/default/node_exporterの編集後、node_exporterを再起動します。

systemctl restart node_exporter.service

想定通りの引数が付与されている事を確認します。

[root@linux010 ~]# ps aux | grep node_exporter
prometh+    3509  0.0  1.7 715716 13588 ?        Ssl  09:02   0:00 /usr/bin/node_exporter --no-collector.hwmon
root        3513  0.0  0.1 221928  1192 pts/0    S+   09:02   0:00 grep --color=auto node_exporter

Docker版 Node Exporterの起動引数指定

Docker版のNode Exporterは末尾にNode Exporterの引数を指定すれば、その引数が反映されてます。起動例は以下の通りです。

docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host \
  --collector.arp

それでは上記のコンテナ起動後、想定通りのコレクタが動作しているかどうかを確認します。デフォルトでは無効になっているarpが取得可能である事を以下のように確認します。

[root@linux010 ~]# curl -s http://localhost:9100/metrics | grep -A 2 "HELP node_arp_entries ARP entries by device"
# HELP node_arp_entries ARP entries by device
# TYPE node_arp_entries gauge
node_arp_entries{device="eth0"} 2
[root@linux010 ~]#

アトミックな設定変更

前述の操作で紹介した/etc/shadowのカウント方法は、大規模環境や高信頼性環境には合わない乱暴な実装方法です。/etc/shadowのカウント方法を再掲します。

echo 'shadow_entries' $(grep -c . /etc/shadow) > /var/prom/shadow.prom

この方法はファイル書き込みの瞬間、一時的に空ファイルの状態になります。システムによっては、この一瞬の誤検知が大きな問題になり無視できない状況になる事もあります。そんな大袈裟な話はあるかと思う人も居るとは思いますが、「オライリー本入門 Prometheus」で警告している事象であり、筆者自身も実践で経験した事があるトラブルです。

一瞬の空ファイルの状態がないアトミックな状態変更をするならば、以下のようなスクリプトを作成します。「/var/prom/shadow.prom.$$」という一時ファイルを作成し、これを「/var/prom/shadow.prom」にrenameします。

「$$」はプロセスIDが格納されます。

renameコマンドは同一ファイルシステムでの移動を実現します。もしファイルシステムが異なる場合はmvコマンドを使用ください。

#!/bin/bash
echo 'shadow_entries' $(grep -c . /etc/shadow) > /var/prom/shadow.prom.$$
rename /var/prom/shadow.prom.$$ /var/prom/shadow.prom

前提