Providing an SFX failover system using MySQL replication
Anne Highsmith
Texas A&M University
SFX is a critical system, especially for e-journal records in their catalog (all title-level records that link back to SFX).
Wanted a failover method for un/planned outages similar to their Voyager failover system.
Cost? Double servers, double fun, double costs?
TAMU is not paying an additional license fee for another SFX license beyond their current two licenses. According to their agreement with Ex Libris, they continue to pay for 2 licenses; the third SFX installation is always dark. One could do this via virtualization or via separate servers. At TAMU they have a service name in addition to individual server names for SFX. Using the service name, they can switch from regular to failover SFX installations without waiting for DNS propagation - they just change the service in Apache.
Failover setup:
Begin by installing a vanilla SFX; take various steps in between [notes missed these]; wind up with a cold backup of production in the newly-installed SFX.
Anne notes that, in the case of separate servers, you should NOT try this across operating systems.
MySQL replication:
This step is done to keep the failover and production SFX installations in sync RE KB changes.
Anne recommends that an experienced MySQL DBA be involved since there are some special steps for replication. In her words, "magic happens here."
After replication was set up, their testing revealed second or sub-second data copy from production to failover.
Now that it's up and running:
TAMU's setup is fixed to be completely hidden when not in use (per their agreement with Ex Libris): reverse proxy Apache is down; SFX Admin is disabled.
Failover SFX does not need KB updates because replication takes care of these by copying the KB from production. Still need to do software updates - they use a special option of rev-up to do software only.
In the event of an outage:
Switching over to failover SFX takes about 10 minutes with 3 people coordinating their actions. Once active it's available for limited use, i.e.: public use only; no staff updates can be done until production is back up (see above RE: keeping SFX Admin disabled). Anne discusses the implications of running a failover this way. The failover KB is, in effect, stuck at that point in time until production is brought back. Staff can get very behind with their work when SFX is up to the public but down to staff. In theory, it would be possible to reverse the direction of the production > failover replication once failover is activated for public use. This would mean that failover becomes the new production (including full staff access, etc) and the old production, once available, would become the new failover.
Comments