MongoDB slow operations profiler and visualizer

This java web application collects and stores slow operations from one or more mongoDB system(s) in order to visualize and analyze them. Since v2.0.0 it may be easily extended to an administration tool by implementing commands to be executed against the configured database system(s). The initial version of the software has been presented during the MongoDB User Group Berlin on 4th of June 2013. Slides of the presentation can be found here. Version 2.4 has been presented in November 2018 at the Percona Live Europe Conference.

The following first screenshot demonstrates how slow operations are visualized in the diagram: The higher a point or circle on the y-axis, the slower was the execution time of this operation. The greater the diameter of the circle, the more slow operations of this type were executed at this time.

You can zoom-in by drawing a rectangle with the mouse around the area you are interested in. I suggest to zoom-in first horizontally, and then zoom-in vertically. Press Shift + drag mouse in order to move the visible area around. Double click returns to the initial viewport.

While the mouse hovers over the diagram, the corresponding slow operations are shown in bold and details of them are displayed on the right-hand side legend. The different colors in the legend are just to better distinguish the different entries and have no further meaning.

The more "Group by" checkboxes are checked, the more details constitute slow operation types, thus the more details you'll see for each slow operation type.

Example

For example, the following screenshot shows that only one database has been filtered, defined in the search form by its label, server addresses, replica set and database name for a specific time period. Since the slow operations are defined in the search form to be grouped by their collections, operations, queried and sorted fields, the legend below on the right shows these 4 details grouped together for the time period which is being hovered over by the mouse, here at 00:03 o'clock. As the the resolution is set to Minute, all slow operations occurred during within the minute hovered over by the mouse, here from 00:03 until 00:04 o'clock, are shown on the right-hand side in the legend.

In this example, from 23:58 to 00:06 o'clock occurred 5 different slow operation types which are printed in different colors in both the diagram and the legend. As the legend is sorted by y-axis, thus execution time in milliseconds, the slowest operation type is shown first. Instead, you may select "sort legend by:" count-value to see in the legend the most profiled slow operation types first (which correspond to the biggest circles).

In this screenshot, the slowest operation type at 00:03 o'clock happened on collection apiOfferlist by executing a query on both fields _id.productId and _id.siteId. This slow operation type occurred 112 times at this precise minute, its minimum duration was 2 ms, maximum 47 ms, average 10 ms and the sum of all these queries was 1.115 ms.

Below you see how many documents (min, max, avg, sum, stdDev) were returned by this operation. And last but not least you have some metrics about how many index keys were read (rKeys), how many documents were read (rDocs) and written (wDocs) and also that no in-memory sort (memSort) had to be done (no sort at all in this case since no sort field was defined).

The second slowest operation type in this screenshot is a getmore operation. Since v2.9.0 the application shows also the originating query of the getmoreoperation. In this example, the query logically combined both fields _id.productIdand parentProductId by OR and its result was logically combined by AND with the field _id.siteId.

In general, multiple fields belonging to an operator expression are enclosed in square brackets [] and the last field is suffixed by .$operator e.g. [a, b.$or]. If the operator applies to only one field, square brackets are omitted, e.g. a.$gt.

The metrics about execution times, returned documents, read and written documents and/or index keys are to read in the same manner as above.

Example screenshot of the analysis page

Screenshot

Summarized table

Since v1.0.3 the user analysis page, besides the diagram, has also a table of all selected slow operation types. The table is filterable. Columns are sortable and selectable to be hidden or shown. Wherever it makes sense, the sum of the values of a column is displayed at the bottom of the column.

For example, to see the most expensive slow operations first, just sort descending by column Sum ms.

Here is a reduced screenshot of the table without any filter on rows because the Search textbox is empty. However not all columns are shown. You can toggle a column to be shown or hidden by clicking its name in the table header. The table here is sorted by column Sum ms, so we see in the first row the most expensive slow operation type: either there were many of this type or they were generally slow.

Let's interpret the first row to get you familiar with. In the column Group you see the attributes you've grouped by (selected in the search form above). Here again, the slow ops occurred in the collection (col) "apiOfferlist". The slow operations (op) were "updates" and the field (fields) to select the documents to be modified was an _id field having a sub-document querying the 3 fields productId, siteId and segments. Sub-documents being queried on more than one field are enclosed by curly brackets {}. However, if only one sub-document field was queried, dot-notation is preferably used.

Since v2.13.0, documents of profiled slow operations are stored in the collector database in order to show them as an example for any given query shape which may help to understand the output of more complex queries better. So, the column Group in the data table adds for each slow operation type, an example link to see an original document from the system.profile collection as an example of a profiled slow operation. The example matches the query shape of the given slow operation type as long as the checkboxes Operation, Queried fields, Sorted fields and Projected fields (which are defining a query shape) have been selected in the Group by section of the search form. At least Operation needs to be selected in the Group by section to show the link to an example document. However, if only Operation is selected, an example only exists if it had the same Operation and no Queried fields, Sorted fieldsand Projected fields. So, besides selecting the checkbox Operation, you should also tick checkboxes such as Queried fields, Sorted fields or Projected fields in the Group by section to get an existing slow operation example document which matches the query shape.

The rest of the columns in the data table should be self-explanatory. High values in the column ms/ret (and vice versa low values in column ret/ms) may indicate a performance problem for operations which return documents, because these columns show the time in ms needed to return 1 document respectively the number of documents returned within 1 ms.

Screenshot

Since version 2.9.0 the analysis page shows additional metrics about:

If rKeys is 0 then no index has been used, resulting in a collection scan. In this case you should consider adding an index. If rKeysis much higher than the number of returned documents (column Sum ret), the database is scanning many index keys to find the result documents. Consider creating or adjusting indexes to improve query performance.

If rDocs is 0 then the query was "covered" which means that all information needed could be retrieved from the index, thus no document had to be fetched into memory. For write operations, if rDocsis higher than wDocsthen some documents didn't need to be (re)written because they were already up-to-date.

The value of wDocs is the sum of how many documents have been deleted, inserted or modified by the corresponding slow operation type. This being said, you may tick Operation in your Group by settings to know the number of written docs for the corresponding write-operation type (i.e. remove, insert, update).

The value of memSortis true if no index could be used to sort the documents. In this case you should consider adding or adjusting indexes so that no in-memory sort is needed anymore.

Application status page

Since v1.0.3 there is also a page to show the application status. Besides showing the status of the collector, means where and how many slow operations have been collected (read and written) since application restart, it shows also every registered database in a table. Since profiling works per database, each database to be profiled is in one row.

The table is filterable. Columns are sortable and selectable to be hidden or shown. Wherever it makes sense, the sum of the values of a column is displayed at the bottom of the column. The table is by default sorted by the columns DBS Label, ReplSet and Status which gives a very good overview over a whole bunch of clusters. Hint: Hold shift key pressed while clicking the column headers in order to sort multiple columns.

Here is a reduced screenshot of some first rows of the table, ordered by columns DBS Label, Status and Database, with a filter "datastore" applied on rows:

Screenshot

At its right side, the table has a bunch of time slot columns (10 sec, 1 min, 10 min, 30 min, 1 hour, 12 hours, 1 day). These show the number of slow operations collected during these last time periods, so you can see already here which databases may behave abnormally. In such case, you may either analyse those databases or switching off collecting or lower their slowMs threshold in order to profile less slow operations.

Actions

Since v2.0.2, an Actions panel is shown always on top in order to choose actions that can be executed against the databases of selected row(s). Both refresh and analyse actions were implemented already before v2.0.0. refresh gets and shows the latest data of the selected database(s). analyse opens the above mentionned analysis page to show the slow operation types of the last 24 hours of the selected node(s) respectively database(s). Both collecting start/stop and set slowMs were also already implemented before but since v2.0.0 they are only shown to authorized users. "Authorized users" are users who used the url parameter adminToken set to the right value (see below under "configuration" for more details).

Since v2.0.0. you may execute commands against the selected database system(s). Since v2.0.3 you can choose whether the command has to run against the corresponding database system (i.e. mongoS-router) or against the individually selected nodes (i.e. mongoD). The difference is that the command will run either against the entry point of the database system (i.e. router or primary) or against all selected nodes wich may be secondaries as well. Current implemented commands are:

The command result is shown in a new page in a filterable table. Columns are sortable as well, so you can detect immediately outliers. Hint: Hold shift key pressed while clicking the column headers in order to sort multiple columns.

Here is a cutout of a screenshot showing the result of the current-operation command. The table is sorted by column "secs running" in order to see slow operations first.

Screenshot

Implementing new commands is quite easy: just create a new java class which implements the interface de.idealo.mongodb.slowops.command.ICommand. The interface has only 2 methods in order to execute the database command and to transform the result to a corresponding table structure.

This being said, from v2.0.0 on, the webapp may be extended from a pure monitoring and analyzing tool to an administration tool.

Dynamic configurations

Since v1.2.0, authorized users may dynamically upload new configurations in order to add, remove or change databases to be registered respectively to be profiled. The configuration of the collector writer may also be changed. "Authorized users" are users, who used the url parameter adminToken set to the right value (see Configuration below for more details). The uploaded config is not persisted server side and will be lost upon webapp restart. All servers of changed "profiled"-entries are (re)started. Also the collector needs to be restarted if its config changed. Even though stops and starts are executed simultaneously, it may take some time depending on how many changes need to be applied, thus how many readers, respectively the writer, are involved by the config change.

Setup

Preconditions

Either:

  1. java 1.8 or newer
  2. maven 2.0 or newer
  3. mongoDB 2.0 or newer

Or:

  1. Docker

Starting up

Starting up by using Docker

  1. Download both files Dockerfile and docker-compose.yaml from github:
    • curl -O https://raw.githubusercontent.com/idealo/mongodb-slow-operations-profiler/master/Dockerfile
    • curl -O https://raw.githubusercontent.com/idealo/mongodb-slow-operations-profiler/master/docker-compose.yaml
  2. Being in the folder of both downloaded files, spin up the docker containers by issuing the command:
    • docker-compose up -d
  3. The application can be accessed through a web browser by the URL:

  4. To visualize and analyze slow operations either select one or more entries and click "analyse" or use the following URL:
Docker does automatically the following:
Some helpful Docker commands:

Be aware that neither both mongoD instances (collector-db and test-db) nor the web server (profiler-webapp) are secured. This means that mongodb can be accessed without authentication from within their containers. Also SSL/TLS is not enabled.

Starting up by having already installed git, java, maven and mongodb

  1. Clone the project:
    • git clone https://github.com/idealo/mongodb-slow-operations-profiler.git
  2. Enter the server addresses, database and collection names in file "mongodb-slow-operations-profiler/src/main/resources/config.json" (see Configuration below)
  3. While being in the in the project folder "mongodb-slow-operations-profiler/", build a war file by executing in a shell:
    • mvn package
  4. Deploy the resulted war file mongodb-slow-operations-profiler.war on a java webserver (e.g. tomcat). Dependent on the above mentionned config.json, it may automatically start collecting slow operations. If no slow operations exist yet on the mongoD's, the collector(s) will sleep 1 hour before retrying.
  5. The application can be accessed through a web browser by the URL:
  6. To visualize and analyze slow operations either select one or more entries and click "analyse" or use the following URL:

Configuration

The application is configured by the file "mongodb-slow-operations-profiler/src/main/resources/config.json". It's a json formatted file and looks like this:

{
  "collector":{
    "hosts":["myCollectorHost_member1:27017",
             "myCollectorHost_member2:27017",
             "myCollectorHost_member3:27017"],
    "db":"profiling",
    "collection":"slowops",
    "adminUser":"",
    "adminPw":"",
    "ssl":false
  },
  "profiled":[
    { "enabled":false,
      "label":"dbs foo",
      "hosts":["someHost1:27017",
               "someHost2:27017",
               "someHost3:27017"],
      "ns":["someDatabase.someCollection", "anotherDatabase.anotherCollection"],
      "adminUser":"",
      "adminPw":"",
      "ssl":false,
      "slowMS":250,
      "responseTimeoutInMs":2000
    },
    { "enabled": false,
      "label":"dbs bar",
      "hosts":["someMongoRouter:27017"],
      "ns":["someDatabase.someCollection", "anotherDatabase.*", "*.myCollection", "!excludedDb"],
      "adminUser":"",
      "adminPw":"",
      "ssl":false,
      "slowMS":250,
      "responseTimeoutInMs":2000
    }
  ],
  "yAxisScale":"milliseconds",
  "adminToken":"mySecureAdminToken",
  "defaultSlowMS":100,
  "defaultResponseTimeoutInMs":2000,
  "defaultExcludedDBs": ["admin", "local", "config"],
  "maxWeblogEntries":100
}

This example configuration defines first the collector running as a replica set consisting of 3 members on hosts "myCollectorHost_member[1|2|3]" on port 27017, using the collection "slowops" of database "profiling". Both adminUser and adminPw are empty because the mongodb instance runs without authentication. If mongoD runs with authentication, the user must exist for the admin database with role "root".

After the definition of the collector follow the databases to be profiled. In this example, there are only two entries. However, keep in mind that the application will resolve all members of a replica set (even if only 1 member has been defined) respectively all shards and its replica set members of a whole mongodb cluster.

Fields of profiled entries explained:

The field yAxisScale is to be set either to the value "milliseconds" or "seconds". It defines the scale of the y-axis in the diagram of the analysis page.

In v2.0.0 the field adminToken has been introduced to restrict access to administrative functionalities i.e. stop/start of collecting slow operations, setting the threshold slowMs, seeing the currently used configuration or uploading a new configuration. To grant access to these functionalities, add the parameter adminToken= followed by your configured value, i.e. mySecureAdminToken, to the URL of the application status page, i.e. http://your-server:your-port/mongodb-slow-operations-profiler/app?adminToken=mySecureAdminToken.

In v2.4.0 some new options have been introduced:

In v2.11.0 a new option has been introduced:

Version history

Third party libraries

License

This software is licensed under AGPL 3.0. For details about the license, please see file "LICENSE", located in the same folder as this "README.md" file.