Category Archives: Collectd Graph Panel

Server stats with CollectD, InfluxDB and Grafana (with downsampling)

Almost 10 years ago I started developing a web frontend for CollectD, Collectd Graph Panel (CGP). A PHP frontend that displays graphs in PNG format using rrdtool and the RRD files created by CollectD.

A lot has happened since then. Because of the IoT hype time series databases like Graphite, InfluxDB and TimescaleDB became more popular. Also visualization tools gained more traction, of which Grafana is the most popular one.

In this blogpost I’m going to show a replacement of CollectD, RRD files and CGP, by using CollectD, InfluxDB and Grafana. I will:

  1. Hook up CollectD to InfluxDB to store the metrics
  2. Configure InfluxDB to aggregate data over time (it doesn’t do this automatically like RRD)
  3. Use a Grafana dashboard to display the graphs with the same colors and styling I was used to in CGP

Hooking up CollectD to InfluxDB

This is pretty simple. First of all follow the installation guide to install the InfluxDB service.

InfluxDB supports the CollectD protocol. It can be configured to listen on UDP port 25826, which CollectD clients can send metrics to.

I more or less used the default values that were already provided in /etc/influxdb/influxdb.conf:

[[collectd]]   
  enabled = true
  bind-address = ":25826"
  database = "collectd"
  retention-policy = ""
  typesdb = "/usr/share/collectd/types.db"
  security-level = "none"
  batch-size = 5000
  batch-pending = 10  
  batch-timeout = "10s"
  read-buffer = 0

In the configuration of the CollectD clients, InfluxDB can be configured as server in the network plugin:

LoadPlugin network
<Plugin network> 
  Server "<InfluxDB-IP-address>" "25826"
</Plugin>

The metrics the CollectD clients collect are now send to InfluxDB.

Downsampling data in InfluxDB

Unlike with the RRD files created by CollectD, InfluxDB doesn’t come with a default downsampling policy. Metrics are just send by the CollectD clients every 10 seconds and saved in InfluxDB and kept indefinitely. You will have super detailed graphs when you for example zoom in on some hourly statistics from 5 months ago, but your InfluxDB data-set will keep growing resulting in gigabytes of data per CollectD client.

In my experience for server statistics you want to have detailed graphs for the most recent metrics. This is useful when you want to debug an issue. Older metrics are nice to display weekly, monthly, quarterly or yearly graphs to spot trends. For graphs with these timeframes 10 second metrics are not required. Metrics for these graphs can be aggregated.

In InfluxDB the combination of “Retention Policies” (RPs) and “Continuous Queries” (CQs) can be used to downsample the metrics. One of the things you can define with an RP is for how long InfluxDB keeps the data. CQs automatically and periodically execute pre-defined queries. This can be used to aggregate the metrics to a different RP.

I’ve been fairly happy with the aggregation policy in the RRD files used by CollectD. Let’s try to setup the same data aggregation system in InfluxDB.

Information about the aggregation policy can be extracted from the RRD file by using the rrdinfo command. Let’s take for example the cpu-idle.rrd file. This shows that this RRD file contains 1 metric per 10 seconds:

$ rrdinfo cpu-idle.rrd | grep step
step = 10

And this shows the different aggregation policies for the average value of the metrics:

$ rrdinfo cpu-idle.rrd | grep AVERAGE -A6 | egrep '(rows|pdp_per_row)'
rra[0].rows = 1200
rra[0].pdp_per_row = 1
rra[3].rows = 1235
rra[3].pdp_per_row = 7
rra[6].rows = 1210   
rra[6].pdp_per_row = 50
rra[9].rows = 1202
rra[9].pdp_per_row = 223
rra[12].rows = 1201
rra[12].pdp_per_row = 2635

There are 5 different aggregations. They all have Primary Data Points per row (pdp_per_row), which means that for example 1 row (metric) is an aggregation of 7 Primary Data Points. And it shows the number of rows that are kept.

Summarized this RRD file contains:

  • 1200 metrics of a 10 second interval (12000s of data == 3.33 hours)
  • 1235 metrics of a (7*10) 70 second interval (86450s of data =~ 1 day)
  • 1210 metrics of a (50*10) 500 second interval (605000s of data == 1 week)
  • 1202 metrics of a (223*10) 2230 second interval (2680460s of data == 31 days)
  • 1201 metrics of a (2635*10) 26350 second interval (31646350s of data == 366 days)

Let’s connect to our influxdb instance and configure the same using RPs and CQs.

$ influx
Connected to http://localhost:8086 version 1.7.6
InfluxDB shell version: 1.7.6
Enter an InfluxQL query
> show databases
name: databases
name
----
_internal
collect
> use collectd
Using database collectd
> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 0s        168h0m0s           1        true

The database by default contains the “autogen” RP, with a duration of 0s. No data will be thrown away. First modify the duration of the autogen retention policy to 200 minutes:

> alter retention policy "autogen" on "collectd" duration 200m shard duration 1h
> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 3h20m0s   1h0m0s             1        true  

Now add the additional RPs:

> CREATE RETENTION POLICY "day" ON collectd DURATION 1d REPLICATION 1
> CREATE RETENTION POLICY "week" ON collectd DURATION 7d REPLICATION 1
> CREATE RETENTION POLICY "month" ON collectd DURATION 31d REPLICATION 1
> CREATE RETENTION POLICY "year" ON collectd DURATION 366d REPLICATION 1
> show retention policies
name    duration  shardGroupDuration replicaN default
----    --------  ------------------ -------- -------
autogen 3h20m0s   1h0m0s             1        true  
day     24h0m0s   1h0m0s             1        false
week    168h0m0s  24h0m0s            1        false
month   744h0m0s  24h0m0s            1        false
year    8784h0m0s 168h0m0s           1        false

For downsampling in InfluxDB I want to use more logical durations compared to what was in the RRD file:

  • 70s -> 60 seconds
  • 500s -> 300 seconds (5 minutes)
  • 2230s -> 1800 seconds (30 minutes)
  • 26350s -> 21600 seconds (6 hours)

These CQs will downsample the data accordingly:

> CREATE CONTINUOUS QUERY "cq_day" ON "collectd" BEGIN SELECT mean(value) as value INTO "collectd"."day".:MEASUREMENT FROM /.*/ GROUP BY time(60s),* END
> CREATE CONTINUOUS QUERY "cq_week" ON "collectd" BEGIN SELECT mean(value) as value INTO "collectd"."week".:MEASUREMENT FROM /.*/ GROUP BY time(300s),* END
> CREATE CONTINUOUS QUERY "cq_month" ON "collectd" BEGIN SELECT mean(value) as value INTO "collectd"."month".:MEASUREMENT FROM /.*/ GROUP BY time(1800s),* END
> CREATE CONTINUOUS QUERY "cq_year" ON "collectd" BEGIN SELECT mean(value) as value INTO "collectd"."year".:MEASUREMENT FROM /.*/ GROUP BY time(21600s),* END

With these CQs and RPs configured you will get 5 data streams: autogen (the default), day, week, month and year. To retrieve the aggregated metrics from a specific RP you have to prefix the measurement in your select query with it. So for example to get the cpu idle metrics you can execute this to get the metrics in the 10s resolution:

> select * from "cpu_value"
# or
> select * from "autogen"."cpu_value"

To get it in 60s resolution (RP “day”):

> select * from "day"."cpu_value"

This is important to know when creating graphs in Grafana. When you want to show a “month” or “year” graph you can not simply do select value from "cpu_value" where type_instance='idle', because you will only get the metrics from the “autogen” RP. You have to explicitly define the RP.

Collectd graphs in Grafana

To install Grafana follow the installation guide.

Create a user in InfluxDB that can be used in Grafana to read data from InfluxDB:

> create user grafana with password <PASSWORD>
> grant read on collectd to grafana

To get access to the CollectD data in InfluxDB you need to configure a data source in Grafana:

Configure CollectD data source.

Now let’s for example create a graph for the load average.

Select Retention Policy in query

As you can see you have to explicitly select the RP for the metrics you want to display in the graph. There is no easy way to get metrics automatically from all RPs at once. This is of course not really convenient, because once the graph on your dashboard is configured you want to be able to change the time range and just see the data from whatever RP that has the metrics in the most detailed way. So ideally you want the RP to be automatically selected based on the time range that is selected.

There are luckily more people having this issue and Talek found a nice workaround for it.

We can create a variable that executes a query based on the current “From” and “To” time range values in Grafana to find out what the correct RP is. This variable can be refreshed every time the time range changes. The query to find out the correct RP is executed on measurement “rp_config” that has a separate RP (forever) without a duration so this data never gets deleted.

Configure the extra RP and insert the RP data:

CREATE RETENTION POLICY "forever" ON "collectd" DURATION INF REPLICATION 1
INSERT INTO forever rp_config,idx=1 rp="autogen",start=0i,end=12000000i,interval="10s" -9223372036854775806
INSERT INTO forever rp_config,idx=2 rp="day",start=12000000i,end=86401000i,interval="60s" -9223372036854775806
INSERT INTO forever rp_config,idx=3 rp="week",start=86401000i,end=604801000i,interval="300s" -9223372036854775806
INSERT INTO forever rp_config,idx=4 rp="month",start=604801000i,end=2678401000i,interval="1800s" -9223372036854775806
INSERT INTO forever rp_config,idx=5 rp="year",start=2678401000i,end=31622401000i,interval="21600s" -9223372036854775806

In the start and end times I added one extra second (86400000i -> 86401000i) because I noticed when for example selecting the “Last 24 hours” range in Grafana, $__to$__from never was exactly 86400000 milliseconds.

Create the variable in Grafana:

Create $rp variable in Grafana

And use the $rp variable as RP in the queries to create the graph:

Configure $rp in query

There is one caveat with this solution. It only works when the end of the time range is now (current time), for example by selecting a “Quick range” that starts with “Last …”. The query only looks at how long the time range is. Not if the RP contains the full time range. I’ve not been able to achieve this by using the available variables in Grafana like $__from, $__to and $__timeFilter and the possibilities that InfluxQL has. I’ve tried to adjust the query to do something like select rp from rp_config where $__from > now() - "end", but that is not supported by InfluxDB and returns an empty result.

The effect of the caveat is that when you zoom in on older metrics, the $rp variable will select an RP that does not contain the data anymore. When changing the $rp variable manually you can see that less detailed metrics are available in different RPs. For example:

GIF of different retention policies

Result: Less storage required

I monitor 6 systems with CollectD in my small home-setup. After configuring the CollectD clients to send the metrics to InfluxDB and running this setup without RPs and CQs for a couple of weeks it already required 6 gigabyte of storage. After configuring the RPs and CQs the CollectD InfluxDB now uses 72 MB. The RRD files in my previous setup used ~186 MB for these 6 systems.

Free space (var-lib-influxdb)

Grafana Dashboard available

To make things easy I’ve already created a dashboard that uses the same colors and styling as Collectd Graph Panel. It can be downloaded here: https://grafana.com/dashboards/10179

Grafana: CollectD Graph Panel

Collectd Graph Panel v1

v1 is here. CGP is finished 😆

Joking aside. It has been requested multiple times. So let’s get it over with. The last version was more then 3.5 years ago. This will be the last tagged version of CGP. Every commit in the master branch after this release can be considered as a new release. 😉

Use git and “git pull” to keep up-to-date or download the latest version here.

Notable Changes since v0.4.1:

  • mobile support (responsive design)
  • automatic support for all plugins (markup/styling in json)
  • hybrid graph type (canvas graph on detail page, png on the others)
  • svg graph support
  • support for newer PHP versions
  • deprecate support for collectd 4

Special thanks for this version go to Peter Wu for improving security, Manuel Luis for maintaining jsrrdgraph and Vincent Brillault for his amount of contributions.

githubGitHub: https://github.com/pommi/CGP
Download: https://github.com/pommi/CGP/archive/master.zip
Git: git clone https://github.com/pommi/CGP.git

Linux bcache SSD caching statistics using collectd

In October 2012 I started using bcache as an SSD caching solution for my Debian Linux server. I’ve been very happy about it so far. Back then I used a manually compiled 3.2 Linux kernel based on the bcache-3.2 git branch provided by Kent (which has been removed). This patch needed to be applied to make bcache work with grsecurity. I also created a Debian package of the bcache-tools userspace tools to be able to create the bcache setup.

At the start of this year I moved to a 3.12 kernel, also manually compiled. It’s quiet a relief that bcache is included in mainline since the 3.10 kernel. 🙂

This is my setup:

  1. 500GB backing device – 20GB caching device (qcow2 images)
  2. 1.3TB backing device – 36GB caching device (file storage)

The past year I’ve definitely noticed the performance difference using bcache. But I was still curious about when and how bcache was using the attached SSD. Is it using the write-back cache a lot? How many times can bcache read it’s data from the SSD cache instead of accessing the HDD?

I created a python script to collect all kinds of bcache statistics (parts of the code in this script are copied from bcache-status). This script outputs the statistics to STDOUT in a collectd exec plugin compatible way. The collectd exec plugin can be configured in collectd.conf this way:

&lt;Plugin exec&gt;
Exec "user:group" "/path/to/collectd-bcache"
&lt;/Plugin&gt;

To visualize the collected data I created a bcache plugin for CGP. This is the result:

bcache-cache-hit-ratio bcache-access bcache-usage bcache-bypassed

Write-back to HDD throttled

At some time I noticed that in my case flushing data from the write-back cache to the HDD was somehow rate-limited to ~3 MB/s. You can nicely see this in these graphs:

bcache-dirty-data bcache-throughput

These threads on the mailinglist of bcache mention the same thing:

Kent explained that this is managed by the PD controller in bcache. The PD controller has been rewritten in the 3.13 Linux kernel, so I’m very interested if this behavior changed. I didn’t upgrade my kernel to 3.13 yet because I’m a very cautious about it. Still a lot of development is going on at the bcache project. But I’m looking forward to upgrading to 3.13, 3.14 or probably 3.15.

githubcollectd-bcache: github.com/pommi/collectd-bcache
CGP bcache plugin: github.com/pommi/CGP/…/bcache.json

Collectd Graph Panel v0.4

After 2,5 years and about 100 commits I’ve tagged version 0.4 of Collectd Graph Panel.

This version includes a new interface with a sidebar for plugin selection.

Javascript library jsrrdgraph has been integrated. Graphs will be rendered in the browser using javascript and HTML5 canvas by setting the “graph_type” configuration option to “canvas”. This saves a lot of CPU power on the server. Jsrrdgraph has some nice features. When rendered, you can move through time by dragging the graph from left to right and zoom in and out by scrolling on the graph.

Demo:

The Collectd compatibility setting has been changed to Collectd 5. If you’re still using Collectd 4, please set the “version” configuration setting to “4”, otherwise the graphs of a couple plugins won’t be showed right (like the interface, df, users plugins).

In this version of CGP, total values are added to the legend of I/O graphs and generated colors will be created using a rainbow palette instead of 9 predefined colors. Please read the changelog or git log for more information about the changes.

New plugins:

Special thanks for this version go to Manuel Luis, who developed jsrrdgraph, xian310 for the new interface, Manuel CISSÉ, Rohit Bhute, Matthias Viehweger, Erik Grinaker, Peter Chiochetti, Karol Nowacki, Aurélien Rougemont, Benjamin Dupuis, yur, Philipp Hellmich, Jonathan Huot, Neptune Ning and Nikoli for their contributions.

I’ve been using GitHub for a while now. You can download or checkout this version of CGP from the GitHub URL’s below. When you have improvements or fixes for CGP, don’t hesitate to send in a Pull Request on GitHub!

v0.4.1 update

I just removed the dependency on mod_rewrite when using jsrrdgraph to draw the graphs. This may solve javascript error: Invalid RRD: “Wrong magic id.” undefined.

githubGitHub: https://github.com/pommi/CGP
Download: https://github.com/pommi/CGP/archive/v0.4.1.zip
Git: git clone https://github.com/pommi/CGP.git

Collectd Graph Panel v0.3

It has been a while and many people were already using one of the development versions of CGP. Time to release a new version of CGP: v0.3.

Special thanks for this version go to Manuel CISSÉ, Jakob Haufe, Edmondo Tommasina, Michael Stapelberg, Julien Rottenberg and Tom Gallacher for sending me patches to improve CGP. Also thanks to everyone that replied after the previous release.

In this version there are a couple of minor improvements and some new settings to configure. But the most important: Support has been added for 13 Collectd plugins. CGP now supports 26 Collectd plugins. Also a couple of plugins have been updated. Please read the changelog or git log for more information about the changes.

New plugins:

Download the .tgz package, the patch file to upgrade from v0.2 or checkout the latest version from the git repository.

Download: http://pommi.nethuis.nl/…/cgp-0.3.tgz
Patch: http://pommi.nethuis.nl/…/cgp-0.2-0.3.patch.gz
Git: http://git.nethuis.nl/pub/cgp.git

Collectd Graph Panel v0.2

Version 0.2 of Collectd Graph Panel (CGP) has been released. This version has some interesting new features and changes since version 0.1.

A new interface is introduced, based on Daniel Von Fange’s styling. A little bit of web 2.0 ajax is used to expand and collapse plugin information on the host page. The width and heigth of a graph is configurable, also the bigger detailed one. UnixSock flush support is added to be able to let collectd write cached data to the rrd file, before an up-to-date graph of the data is generated. CPU support for Linux 2.4 kernel and Swap I/O support is added. And some changes under the hood.

Download the .tgz package, the patch file to upgrade from v0.1 or checkout the latest version from my public git repository.

CGP Overview Page CGP Server Page CGP Detailed Page

Download: http://pommi.nethuis.nl/…/cgp-0.2.tgz
Patch: http://pommi.nethuis.nl/…/cgp-0.1-0.2.patch.gz
Git: http://git.nethuis.nl/pub/cgp.git

Collectd Graph Panel v0.1

Collectd Graph Panel (CGP) is a graphical web front-end for Collectd written in php. I’ve developed CGP because I wasn’t realy happy with the existing web front-ends. What I like to see in a front-end for Collectd is clear overview of all the hosts you are managing and an easy way to get detailed information about a plugin.

This first release is a very basic version of what I have in mind. Not all plugins are supported yet and the user interface can be done better. Still a lot to work on!

Please give it a try! It should work out-of-the-box. Download the tgz package or checkout the latest version from my public git repository.

Download: http://pommi.nethuis.nl/…
Git: http://git.nethuis.nl/pub/cgp.git