Archive for January, 2008

Bug fixes, wget functionality + unpack on-the-fly, WYSIWYG editor

Now that we have a usable file management module in esprit things are coming together quite well.  Note that I’ll be temporarily postponing new features after “Log Rotation” and edit-in-place for e-mail addresses are implemented in the next update to concentrate on adding content to the wiki.  My goal is to have the “Resource Center” retired by mid-February; the sooner the better.

  • Added: wget-like facility to File Manager, download and decompress archives on the fly
  • Added: domain_fs_path/domain_info_path methods to HTML controller
  • Fixed: recursive chown didn’t recursively chown
  • Fixed: abort mailbox creation if uid isn’t present in uid table
  • Fixed: rewrote file upload handling, which should handle larger files without puking unexpectedly due to memory limits (bug #38)
  • Fixed: shifting rows in Quick Menu on IE6
  • Changed: color layout of control panel
  • Changed: replaced save as icon in File Manager
  • Changed: increased single file upload limit from 4M to 64M
  • Changed: Manage Users & Mailboxes -> Manage Users; Mailbox Routes -> Manage Mailboxes

One late addition to the party that I’m sure most people will appreciate.

  •  Added: WYSIWYG FCKeditor to File Manager

Comments

esprit update, missing subdomain facility, File Manager enhancements

Sorry about the last commit, which conveniently left the subdomain changes out ;)

Changelog from Sunday’s work, which includes a couple of enhancements and miscellaneous fixes:

  • Added: octal permission view to File Manager
  • Fixed: ownership of parent directory following extraction leaves root:root ownership (bug #67)
  • Fixed: file permission change results in “Invalid Directory” message
  • Fixed: MySQL Manager, PostgreSQL Manager, phpMyAdmin, and phpPgAdmin password field types converted from text to password
  • Fixed: return value on SQL_Module::edit_mysql_user() always returned false
  • Changed: Increased Urchin profile allowance to 2 free profiles per account

Comments

Kernel upgrade February 2, 2008

A maintenance window has been set for all servers on February 2, 2008 at 2 AM.  During this time the latest kernel, 2.6.24, will be brought up.  Please expect a 3 - 5 minute window of inaccessibility as the servers are brought down and restarted.

Several NIC fixes for the sky2 driver/NAPI API (”API” is quite redundant) have been introduced in this kernel and unfortunately the new network cards are affected by the bugs fixed in 2.6.24.

Comments

New esprit update, subdomain facility is here

Friday is a big day of changes.  New esprit update scheduled for 3 AM EST (-0500 GMT).  Changelog follows:

  • Added: new subdomain facility, replaces Ensim version.  Includes global/local subdomains (see http://guide.apisnetworks.com/index.php/Apache)
  • Added: SPF Wizard support for secondary domains
  • Fixed: SPF records weren’t being saved on postback
  • Fixed: is_int() check on File_Module::quota incorrectly handles octal values
  • Fixed: kludge, force sticky bit on Majordomo configuration; calling chmod from another module drops the 0 which is significant if no special bits are set, e.g. 2755 is preserved while 0755 is converted to octal.  To fix at a later date once regression testing is implemented.
  • Fixed: mysql_escape_string() warning in MySQL/PostgreSQL database backups
  • Changed: renamed “Quota Tracker” to less ambiguous “Disk Usage Watcher”
  • Changed: upgraded jQuery to 1.2.2
  • Changed: increase color transition from 2.5 seconds to 3 seconds for cosmetic reasons
  • Removed: clean-up of old JS no longer in use, prototype and scriptaculous

Comments

Service Upgrades on Saturday (PHP, Dovecot, vim, and more)

Slew of upgrades due in this weekend, including:

PHP 5.2.5, PHP 4.4.8, Apache 2.2.8, Dovecot 1.0.10, Postfix 2.4.6, SpamAssassin 3.2.4, vim 7.1 and sqlite 3.4.2

Given the number of upgrades for each server the maintenance window will be between the hours of 12 AM - 4 AM EST (-0500 GMT) on Saturday, January 26th.

If there are any other services you would like to see updated, then you have until Friday night to make a note of it or forever hold your peace until the next batch update.

Comments

RubyGems upgrade, File Manager upload revert, misc bug fixes

A few fixes have trickled in over the past two days in addition to the two listed below:

  • Changed: Upgraded RubyGems to 1.0.1
  • Fixed: revert File Manager “Upload” button that was previously missing (oversight, honest)

Older fixes from the last few days bundled into the blog post:

  • Fixed: editing Majordomo configuration resulted in incorrect permissions/ownership combination (660, admin user/group)
  • Fixed: next SQL backup date should use today as base instead of the next_date field

Comments

Drive failure on Augend

Augend has experienced a drive failure in one of its drives and is in the process of being replaced. During this time the /tmp/ directory’s contents will be unavailable.

Update: 12:59 AM EST (-0500 GMT)
The failed drive has been swapped out and /tmp rebuilt (RAID 0 slice).  Remaining partitions should be resync’d in a couple of hours.  Everything appears to be in proper working order with a 8 minute window when services writing to /tmp were temporarily taken down.

Comments (1)

esprit Update, symlink support is here!

Folks, good news: I’m pushing the latest esprit update to the servers at this time.  In addition to the much wanted symlink support, there are several compression interface fixes, which means you should be able to extract an archive from your site.  Having said that, adding a wget component would likely be another beneficial addition in the near future.  After temporarily canning the one-click interface (it’s not coming back for a while) I can go back to concentrating on adding new features to the control panel and there’s a thread opened for user requests.  If you have a request, then post there or just reply to this news post.  Changelog follows.

  • Added: postback status color fade
  • Added: rename/symlink support in the File Manager
  • Added: directory drop-down for extraction destination (still needs some work on responsiveness)
  • Fixed: file names with spaces were not properly extracted via the Zip interface in the File Manager
  • Fixed: code clean-up in generic AJAX handler library, File Manager
  • Fixed: encode filenames in File Manager when used in URL
  • Fixed: interface mismatch in File Manager’s compression handler
  • Fixed: billing service connection errors were displayed
  • Fixed: miscellaneous code clean-ups
  • Removed: old debugging information from Module_Skeleton::synchronize_changes
  • Changed: upgraded jQuery to 1.2.1
  • Changed: upgraded esprit backend to PHP 5.2.5

Comments (1)

Minor maintenance window tonight - MySQL upgrade, SSL certificate refresh

MySQL will be upgraded to 5.0.51 tonight between the hours of 1 - 2 AM EST (-0500 GMT).  One fix looks similar to the problem we have seen occur rarely in the past with deadlocking, such that a table has a mutually exclusive write lock on it and that lock trickles out to all other processes.  Ultimately, this bug kills all read/write requests on all databases until our internal monitor catches and restarts the MySQL server, but that’s anywhere between 1 - 15 minutes depending upon traffic.  I’ve personally handled roughly this situation 4 times in the past year, which has existed since the early days of 4.0 mind you, so it is extremely rare.  Whether my suspicions are right or not… we’ll have to wait and see.

SSL certificates will also be updated tonight since we’re rapidly approaching one year since leaving EV1 and moving to Gnax.  Given the previous post regarding the power outage I would say Gnax knows how to treat us well on our birthday.

Comments (1)

Power Outage at Gnax

Over night around 4:40 AM EST I received a call from a co-worker explaining the damndest and strangest thing happened, we lost connectivity to the entire data center.  Not just a particular segment on the network being unresponsive, not just high latency from a switch upgrade, no, a total outage.  Thankfully before I settled on driving down to the data center at 5 AM [and consequently get stuck in Atlanta "rush hour" traffic for 1 hour coming back 2 miles] our pings began receiving replies.  Everything’s up right? Wrong.

Recall yesterday that all of the servers were taken down to replace the big, bulky, integrated network card with little dinky, albeit still expensive, D-Link PCIe cards purchased over the weekend.  My policy with kernel upgrades is to run the new kernel once for a week.  If there is a system lockup the server is automatically rebooted and we are back to the old kernel.  Usually this works unless the older kernel doesn’t have the driver for the new network card built-in — oops.  Good timing nevertheless.

We spent the remaining 30 minutes logging into each server through the on-board VNC to manually reboot and bring up the new kernel again.  I always knew the DRACs would come in handy.  Beyond the reboot there were tiny individual server fix-ups we had to do on each one, chiefly bringing the primary and secondary DNS servers back up.  I can’t say when things returned back to normal, because at that hour things were a blur with me mostly fuming at the mouth.

What the hell happened, those are your exact thoughts, right?  Mine too…  In addition to the thousands of others at Gnax…  I can say for sure it was a power outage.  Apparently it was something internal after the backup generators in the power distribution scheme.  Everyone lost power.  Every single server.  All 5,000.  Yes, that is what we call a major screw-up.   I don’t know when we’ll get a clear story on the cause, because all of the staff — owner included — are scrambling to reboot servers, fsck filesystems, replace access switches (yes, the power outage knocked one of the out), and answer the torrent of tickets/phone calls streaming in… and we’re all wondering: what the hell just happened, Gnax?

To say I’m disappointed isn’t scratching the surface of how I feel about the entire situation.  Between this event and the inbound GigE links that perpetually suck (PCCWBTN and Global Crossing specifically) I am beginning to question the competence in design of this data center.  Here’s hoping this isn’t a sign of things to come.  Reboots are a pain and cleaning up after an outage is a nightmare.

Once I know for certain what happened I’ll update the entry.  No idea when that will happen though since the people who have an answer are busy regretting their jobs right now .

* Gnax thread (needs registration) - “What the hell just happened?
* WHT thread (publicly accessible) - “GNAX DOWN!?

Update: 1:00 PM EST (-0500 GMT)
Straight from the horse’s mouth:

At approximately 4:45 am EST the NAP suffered a power outage lasting approximately 10 seconds from Georgia Power.

The generators fired and came online 15 seconds after the initial outge and the load was transferred to generators which ran for 30 minutes while monitoring the incoming power quality from GA Power at which time the load was transferred back to utility.

One of the UPS’s that serves part of the facility suffered a battery outage on 2 different redundant strings which caused it to drop the load.
We installed a second redundant string approximately 9 months ago to minimize the possibility of this type of situation. The batteries in the 2 strings are setup in parallel meaning each is capable of carrying the full load for up to 5 minutes.

All it takes is 1 battery in a string to fail for the entire string to fail. this is the same in all ups systems and is the reason we installed the second string from advice from the manufacturer.

The original string batteries are 1.5 years old and were installed new. The second string is 9 months old and was installed new.

A single battery in the second string failed after 3 batteries in the first string failed.

We turned the generators back on to avoid an interruption during troubleshooting and maintenance and MGE sent a tech onsite within an hour to troubleshoot at which time we discovered the battery issue. we replaced the batteries within an hour of diagnosis and brought the system back onlnine and out of maintenance bypass.

The load is currently protected and all batteries have been tested again.

Both sets of batteries have been maintained and tested by MGE direct service every 6 months under a pm plan that they recommended for proper maintenance and operation.

This was extremely rare and unforseen to have something like this happen.

We are purchasing our own battery tester and will set up a monthly pm on the batteries that we will conduct ourselves in addition to the 6 month pm that MGE does on the UPS as well as the batteries. We are also researching a real time battery monitoring system that can predict battery failure.

Batteries are the weakest link in the system and we feel like we properly followed recommended engineering and maintenance on these systems. - however that will not assure 100% as we found out today in a very rare incident.

Extemporaneous events that continued to affect service during the outage:
one of the main metro e switches that runs the links of our backbone went offline during the outage and during that powerinduced reboot we lost connectivity to half our backbones. we have our backbones split in half - with half going out the east and half out the west side of the building taking dirverse paths across redundant switches to the final interconnect points.
the switch was unstable when it came back online due to a gbic that died and for some odd reason rebooted itself several times about every 10 minutes. we replaced the gbic with a spare we keep onsite.

This caused half the backbones to go up and down and placed a large cpu load on the different core routers we have due to bgp table loads going on - this is very cpu intensive and when you have a lot of up and down it can appear that the network is completely down (it is if you are on a link that is flapping) but the fact is that the entire network was not down but was impacted. this settled down when the switch was stabilized.

We split our backbones up over several different redundant backbone routers.

once this switch was brought back online and stabilized the network stabilized as well.

an access switch that serves 16 servers also died and we replaced it with a spare once we found the issue. we keep spares on site for every piece of network gear we have.

an apc that was only 6 months old and is a dual fed apc from 2 different power sources (including the newer ups) failed and did not come back - we replaced it with an onsite spare. it was bizarre to say the least and of course it powered one of our 3 main dns clusters so we lost dns capacity for an hour.

Most of the issues currently going on are related to server hardware that did not do well in a power reboot situation or need a fsck. we are actively working on them and will not rest until all is well.

Many customers in the facility do have A and B feeds from our power. we offer this through different ups systems / different power panels and different transformers. Some very early customers that purchased a and b feeds when we only had one ups system at the NAP are on the same ups and as such lost power. those customers will be offered a free move on their b feed to the newer ups to increase their power diversity - they simply need to open a ticket.

What are we doing on power in the future?

We have another UPS from MGE on order as of 4 weeks ago that is due to deliver in mid Feb that will increase the diveristy of the power in the facility. We plan on having 2 battery strings on it as well.

We are in the process of installing another set of 5 cummins generators and another 3000 amp transformer which will further diversify our generator and transformer plant - this will be completed in mid february - construction of this is going on currently we took delivery of the switchgear and generators 2 weeks ago. 4 ups/ will be moved to the new power feeed and g enarators to diversify the power source to the UPS . this will give us 100% redundancy on the A / B feeds at that point.

We installed a redundant b feed to our metro e gear and 2 dual fed apcs at our TELX cabinet after TELX suffered a complete UPS failure at 56 marietta 4 months ago. This turned out to be good because there was another complete failure of the B ups 4 weeks ago - but we were not affected since we had a redundant feed from them. the outage affected all customers on the second floor. we would have more than 50% of our network had we not been on dual fed apcs and dual power feeds at the building which would have been bad.

we are increasing the battery pm schedule to monthly from biannual.

we are researching a battery monitoring system for the strings.

we will be taking a fuel delivery this week to restock our main fuel supply

we are examining in depth on of our 4 core metro switch abnormalities this morning and if we do not find a rfo from the manufacturer will be examining replacing it or upgrading to a different more robust solution - which has been in our long term plan but may get moved up.

we will be doing another power examination of our core swithcing routers ( currently 6 of them all with dual fed power ) and our core metro e switches (currently 4 of them) to make sure that our power feeeds are truly redundant and no legacy circuits are there to affect them.

we will be examining our on site spares inventory to make sure we are still at correct levels since we used some items this morning.

We appologize for the outage caused by the failure of hte primary and backup batteries and will continue to provide the best service at an excellent price.
The MGE tech that has all the major accounts in Atlanta including coke and several others told us that this was a very freak occurance with negligible odds of happening and in his opinion we have done everything right on our maintenance and pm and redundancy of the batteries and he would have done the same thing and that there was really nothing he would have recommended different at that point.

we are still going to make the changes above that I mentioned though.

As indicated on their forum there are still a hundreds of servers down waiting on direct intervention from a tech on-site 8 hours later.  Thank goodness for the DRACs.  We had all of the servers back up within the first 30 minutes.

Comments (1)

« Previous entries