Manticore May Start Before MongoDB

We’ve had reports that in Kolab 16, when the system restarts, the Manticore service may be started before the MongoDB service (has become functionally apt). This would render Manticore unavailable on every system restart.

This article is about how we are going to work resolve this properly;

Make Manticore Depend on MongoDB

This would instruct systemd to wait with starting Manticore until after the MongoDB service has been started. This is a valid resolution, but locks in Manticore and MongoDB to run on the same node. We would like to avoid this where we can. The corresponding snippet of the systemd unit file would look as follows:

[Unit]
Description=Collaborative Editing for ODF Documents
After=network.target
Requires=mongod.service

Make Manticore Not Fail

Currently, Manticore fails (fatally) when MongoDB is not available during its startup. This seems improper, and leaves us to wonder what happens when the connection to MongoDB is broken while Manticore is running (and perhaps restored few moments thereafter).

Avoiding the fatal error is a development effort encapsulated in T981.

Delay the Startup of Manticore

In order to give MongoDB a chance to become available functionally (as opposed to “yes, blob been called”), we could choose to delay the start-up by some number of seconds:

[Service]
ExecStartPre=sleep $number

This is definitely an ugly workaround and literally just guesses whether or not MongoDB has had a chance to start up yet.

A better approach would be to test the functionality, such as perhaps a mongo command-line invocation with an appropriate exit code. However, the setting for where the mongod servers live (likely not on localhost) is encapsulated in a JavaScript file. It could be set and retrieved and used via an environment variable, but we’d like to avoid having consumers edit (or better: copy off and edit the copy of) systemd unit files.

Delay the Restart of Manticore Failing

The systemd unit file can utilize a setting that delays the restart of Manticore, such that it is likely to be restarted still (within the threshold of maximum number of restarts) after the MongoDB service is finally available:

[Service]
RestartSec=60

This would imply, however, that should the Manticore service fail during any point, it’ll automatically be unavailable for up to a minute.

Add a Timeout to the Start-up

Should the aforementioned T981 be implemented and resolved successfully, we will likely add a time-out to the start-up routine in order to log the appropriate amount of error messages should Manticore fail to start-up, polling for MongoDB to become available.

This would ensure systems give administrators the necessary verbosity about what’s going on:

[Service]
TimeoutStartSec=60

Should Manticore fail to start up completely within a time-window of 60 seconds, errors be logged some place and the restart timer would kick in to try again.

Advertisements