ARM auto recover mode
In the ARM auto recover mode if a physical storage (say A1) becomes temporarily unvailable, archiving will continue if there is other physical storage configured in the corresponding realm. As soon as the physical storage A1 becomes available again, the auto recover mode resumes archiving into A1 as well. The missing documents in A1, archived during its outage, are automatically resynchronized into A1. This is being done usually during the night time when the load of the system is low.
The auto recover process is triggered and controlled by a process on the application server. In a clustered ImageMaster application server environment with multiple active ImageMaster domains it is crucial to enable the auto recover feature on one application server at most (see “Auto recover settings” in the installation manual for details [IM ImageMaster]).
Example auto recover mode
A realm “realmB” has two archives B1 and B2. The minCopy parameter of realmB is set to 1. When archiving a document with attachment into realmB, the attachment is written to B1 and B2.
If, for instance, archiving the attachment into B2 fails and writing the attachment to B1 succeeds, archiving the document with attachment succeeds as minCopy is 1. Additionally, each realm has a parameter “Enable Auto Recovery” which can be enabled or disabled for instance via the AdminClient. If archiving an attachment fails into B2 and the auto recover parameter is set to true, ImageMaster will write an extra recovery entry in its database. These recovery entries are later used to restore the missing attachments in B2 automatically from another archive (B1 in this case). This means in order to set up realmB for auto recover mode, it is necessary to set the auto recover parameter of realmB to true.
In summary, realmB has two archives B1 and B2, minCopy is set to 1 for realmB and the auto recover flag is set to true for realmB.
Example continued: normal operation of the system and automatic handling of a temporary outage
In normal operations, an attachment of a document being archived is written into B1 and B2 successfully.
If archive B2 is temporarily offline due to a network outage, archiving an attachment to realmB will succeed with archiving the attachment successfully only into B1 and with failing to archive the attachment into B2. As the auto recover flag is enabled for realmB, a recovery entry is written for this attachment for B2 in the ImageMaster database. As well, the archive B2 due to the archive unavailability mechanism is set to “temporarily unavailable” immediately (assuming that the unavailableTimeout was set to 0). Furthermore, assuming that the OnlineCheckingInterval parameter for the archive B2 is set to 10, B2 will stay temporarily unavailable for 10 minutes, even if it physically becomes available earlier.
Now assume that after 5 minutes the network route to B2 is reestablished. ImageMaster will continue to archive attachments for realmB only into archive B1 as archive B2 is still set to temporarily unavailable for another 5 minutes (because so far only 5 minutes of the total of 10 minutes have elapsed). During these 5 minutes the system continues writing recovery entries into the database for archive B2.
After these additional 5 minutes, when the total interval of 10 minutes has elapsed, the archive B2 is reset to “available” automatically by ImageMaster. Subsequent archive requests will write their attachments both in archive B1 and in archive B2; no further auto recover entries are written. The normal operation of the ImageMaster system has resumed after 10 minutes in this example.
Example continued: after temporary outage automatic resynchronization of archives
In order to activate the resynchronization of archives after temporary outages, the auto recover flag in the auto recover general configuration tab needs to be activated in the AdminClient (Archive Configuration). Furthermore, at least one schedule has to be created to define when the resychronization of archives is to be performed.
An auto recover schedule is identified by its name. It is weekday based, i.e. it can be specified that the schedule runs on a selected set of weekdays or on all weekdays. A schedule has a start and end time which defines when a schedule starts or ends on the weekday or weekdays specified.
For instance in the example the auto recover flag of the auto recover general configuration is enabled and a schedule “Nightly” is defined with the weekdays Monday to Friday. The start time is 22:00 and the end time is 03:00.
This means that archive resynchronization takes place in this system. More specifically, on Monday the resynchronization starts at 22:00 and stops at 03:00 Tuesday morning, on Tuesday the resychronization starts at 22:00 and stops at 03:00 Wednesday morning. These nightly cycles continue until finally the resychronization starts at 22:00 on Friday and continues until 03:00 Saturday morning. The resynchronization resumes on the next Monday evening at 22:00 and the weekly cycle restarts.
The general auto recover configuration also has a setting with how many threads in parallel the resynchronization should take place. In this example the parameter is set to 5 threads.
Assuming that during the temporary outage of B2 1.000 recovery entries have been written for B2, the ImageMaster system starts at Monday 22:00 to resynchronize B2, i.e. copy the missing attachments from B1 into the archive B2 (with 5 threads). Each time a recovery of an attachment copy from B1 to B2 finished successfully, the corresponding recovery entry for B2 is deleted from the database. After several minutes no recovery entries exist anymore in the database and the resynchronization process stops as there is no further resynchronization to be done.
If a recovery attempt for B2 failed, for instance due to a random error, the recovery entry would be kept in the database and the entry is retried for recovery after a certain number of minutes. This duration in minutes is by default 60 minutes. The parameter in the auto recover general tab to configure this duration is called “retryTimeout”. Note that when a recovery entry is created for a given archive, the first time a recovery is tried for this entry is also after 60 minutes by default, or more specifically the time after which the first recovery attempt is tried is as well defined by the duration parameter retryTimeout in the auto recover general configuration.
The auto recover general configuration and the auto recover schedules are managed per ImageMaster tenant.
Example continued: temporary outage handling and monitoring
For an administrator usually no manual actions are needed to handle a temporary outage of an archive. But it is vital to monitor auto recover entries in the database via the ImageMaster monitoring subsystem. The absolute minimum for a productive system using ARM auto recover mode is to monitor the number of auto recover entries in the system. It is good practice to issue a WARNING for 1.000 autorecover entries in the database and a critical WARNING for 10.000 entries via the monitoring system.
For setting up and using the ImageMaster monitoring subsystem in context of auto recover monitoring, refer to the system manual [SM Monitoring].