Just the other day we had to DR some of the systems, all in anticipation of the move to our European data center. Now I’m not unfamiliar with DR tests as I’ve participated in quite a number of them in the past. In actual fact I’ve actually invoked a DR solution for real on two occasions in my 30 year career in IT.
To keep this in perspective what I will say is that there should always be an attempt made to make the rehearsal as realistic as possible, that’s where things become a little difficult. It’s not always possible in the real IT world to make things that realistic, as you are in fact just building problems for yourself. As any one with any significant amount of experience in the IT world will tell you it’s difficult to DR most systems in isolation as there are many interdependency’sto be taken into consideration. However when the system that you are rehearsing the DR for has the added overhead of governmental regulation the magnitude of the task becomes much greater, well this is the situation that I found myself in the other day.
I had done a significant amount of chasing the Goal Posts around the park and had prepared as well as I could for the event, however as usual just when you think that you’ve made the thing Idiot Proof – you find out that the Idiots have been upgraded. And so it was with the DR the other day, to simulate an actual crash on the server the mirrors were broken at the SAN level. This would make the test realistic I was told, so make sure that the actions are carried out during the online day.
When the go ahead was given, the SAN team presented the disk to an other server, while leaving it on the original Live server – not much of a problem then I thought. The original server was still seeing the disks, however the virtual one that they had been presented to had indeed been shut down – I know cause it was me wot dun it! What I didn’t realise was that the Hyper visor on the original server still had control of the disk, as the mirror disk was presented to the new server. The original server threw a fakey, but as by now I was working on the DR server I didn’t notice.
As most software has a degree of inbuilt intelligence (Although if you speak to a developer you’d never know) the disk management softwere tried to compensate for our stupidity. Although I eventually managed to outsmart (or so I thought) the Volume Manager, it didn’t give up trying. So off we went into the recovery phase, the system came up although the disk barrf’d when it came round to starting up the server and it’s dependancies.
After a significant time spent pratting around and the realization the system was having strange problems with the disk, I decided on the sledgehammer to crack a nut approach. This in the fullness of time proved absolutely to be not the answer to the problem, with bloody nobbs on! The problem proved to be that the disks that were being used for the DR were mirrored at a host level, as well as at a SAN level, this is still not that much of a problem in reallity.
What did prove to be a problem was that the disks that the SAN team chose to use for the DR were in fact the primary mirrors as far as the SAN went. This meant that as work was carried out on the disks provided to the DR server was also reflected on the disks that were there to be used for the recovery of the live server. In order to repair the serer I had to make several changes to the disks, however they did eventually come in as requested. Leaving a great feeling of self satisfaction, which I might add was very short lived. All to soon it was time to revert to the original server and this is where things got more interestin, suffice to say that the disaster rehearsal was OK. Especially the disaster bit, hence the title of the post.