Archive for January, 2008

Emergency Maintenance - NIC Swap (1/7)

Maintenance window has been scheduled between 2 - 3 AM on January 7th to replace the network cards in all servers.  Please note there is no exact window for each server, but expect 10 minutes of downtime as each server is brought down, new NIC installed, and new drivers are built.  This effectively resolves the erratic packet loss we’ve been seeing over the past few months.

Comments

NIC Debugging (bnx2 driver)

(Originally posted December 28 at 4:33 PM EST):

Aleph will be temporarily taken down between the window of 1 - 3 AM EST (-0500 GMT) on Saturday, December 29th to remove the TCP offload engine jumper from the server. Aleph will be inaccessible during 3 minute stints as the server is rebooted. During this window the network driver (bnx2) will be upgraded in an ongoing attempt to resolve a rare situation resulting in a dropped packet. Aleph’s kernel will be temporarily upgraded to the 2.6.24 release candidate, which includes an updated bnx2 driver. Possibly, the packet rot may resolve this issue; at this time the exact cause is unknown.

I have isolated the problem to the BCM5708 chipset present in all of the Dell PowerEdge 1950 servers. Whether it’s a defect in the kernel drivers, proprietary TOE support (should be inactive on Linux…), Intel’s I/OAT TCP offload feature, message signal interrupts sent to the network card, or something else is still unknown.

1/3/2008: Still no resolution on the issue. I’ve escalated it to the mailing lists to see if anyone else has similar issues.

1/5/2008 at 8:49 PM EST: New NIC is in Aleph.  Given there are no further packet drops between now and tomorrow the remaining servers will have new NICs installed.  A reboot would be required.

Comments

Next entries »