Hi everyone!
TL;DR
We (Martin Gauk and I) would like to integrate something like a generalized monitoring subsystem into Moodle and specifically build an endpoint for Prometheus to collect various Moodle specific metrics.
Background
At the TU Berlin we run a relatively large Moodle instance (~35k students). At this scale, some degree of proper monitoring is almost a requirement. We use Prometheus + Grafana to monitor various systems with e.g. dashboards for our Matrix Synapse instance or general system resource monitoring (see Node Exporter).
But we quickly noticed that there are Moodle specific metrics that we would like to monitor more closely. So I added a crude implementation of very basic monitoring capabilities with a simple Prometheus endpoint into our fork of Moodle.
Problem
As far as we know, Moodle currently does not offer an API to define metrics and expose them to monitoring tools like Prometheus. With so many moving parts and the customizability (e.g. via plugins) that Moodle provides, admins may be missing out on a lot of potential insights in operating a Moodle instance. A simple, custom Prometheus endpoint like we have now is not too difficult to write, but this is a band-aid fix.
Solution
We envision a general API for monitoring Moodle.
The following features are not all necessary or equally important in our opinion. Some are definitely worth discussing.
- Monitoring API that is backend-agnostic, i.e. not just for Prometheus.
- Admins can easily define and expose their own metrics and their labels.
- Some sensible default metrics are pre-configured.
- Metrics can be classified in at least two categories: 1) Those expected to change frequently and thus needs to be collected frequently, and 2) those that do not. This tiered approach might reduce the load of this metric collection system.
- Endpoints for pull-based systems like Prometheus utilize the new Routes function of Moodle and are secured with a simple token.
- Reference implementation for the monitoring API for Prometheus.
- Optionally a simple built-in dashboard available in Moodle itslef (using the included chart.js) so that smaller instances do not need to install their own visualization software like Grafana.
Approach(es)
How exactly this should be integrated into Moodle is up for debate and something we have not even decided amongst ourselves yet.
We identified three fundamentally different approaches that all seem to make sense and have their own pros and cons.
A) Proper Monitoring subsystem
The monitoring concept itself should be generalizable enough to allow us to create an API that is agnostic to the software used and can therefore be extended with plugins that implement it for a specific tool like Prometheus. This seems cleanest and champions extensibility, but may also prove most challenging as general solutions tend to be, especially if I am wrong and monitoring is conceptually so different among different systems that a general API loses its meaning.
B) Monitoring is just lib/
code
Neither a plugin, nor subsystem, a more specific monitoring approach targeted at selected systems lives in the core and provides functionality for the most common use cases and offers a limited degree of customizability. The upside is a (probably?) more straightforward implementation.
C) Just some monitoring classes, but endpoints live under admin/tool/
with a default implementation shipped with Moodle
Default could be Prometheus for example. Other plugins could add theirs as admin tools.
Just a plugin?
There is of course also the option of just having a plugin (e.g. a local_
plugin) for a concrete monitoring use case. But we exclude this here because this obviously does not affect the Moodle core and is therefore not something we need to discuss at this event.
Again, we welcome a discussion about these and potentially other ways to implement monitoring for Moodle at the event. (Provided of course that anybody else is even interested in the topic of monitoring Moodle.)
Practical use cases/example metrics
- Users online in the last 1 minute, 5 minutes, 1 hour...
- Various course statistics (total number, hidden, enrollments...)
- Quiz attempts currently in progress.
- Page load times.
- Number of running tasks total (ad-hoc and scheduled).
- Number of overdue tasks.
- Cache-specific metrics like e.g. ratio of hits/misses for redis cache or total number of cache keys.
Assuming a decision can be made in a reasonable amount of time for which route to take in implementing this, two days time should be enough to at least have proof of concept and a working demo.
Hoping that this resonates with some of you devs/sysadmins, looking forward to discussing this with you.
Sincerely,
Daniel