Just time to have a quick rant about the work machines, on Tuesday night while out at dinner I got a call from Marco telling me that the site appeared to be down, a quick talk with him (this was actually a quick talk, I spoke too quickly and Mike had to take over) and worked out that it was one of our web application nodes being locked due to an APC bug (it's on my todo list to track down).
All was fine as freddie managed to graceful Apache on the machine and I continued eating.
Before we got home I got another call telling me the servers were completely gone now, couldn't ping our main load balancing machine but the web application and database servers were all up. Got someone to open a ticket with our hosting company and went to finish eating.
Got home, logged in to our main database server and then went around checking to see if I could identify what was up, by this point our hosts were planning to reboot the server because that always fixes everything. Took me about 30 seconds to work out our external NIC was down on the load balancer, ethtool showed me everything was up just that there was no active link.
Responded in ticket asking them not to touch the machine and look at the cause to see if a retainer clip had came loose or if someone had tripped over a cable. One hour later we were told the port on the switch had been deactivated by a security mechanism, apparently some form of DoS.
Found out this morning that we'd apparently just had more than 15, 000 packets a second go through the port and this triggered it, no DoS. Probably release traffic but the fact it took them 2 hours to work this out.
In case anyone is wondering who the hosting company is its The Planet, the servers may be cheap but sometimes I think their support can be troublesome.