We've been having an issue with our firewall cluster (completely separate units, primary and secondary, for redundancy) holding onto connections after they're long dead. Imagine it like you have three phone lines at your business, you get a call on one line and the when the call is over, someone doesn't hang up the phone - the call is over, but the line remains open and you can't use it. Now this happens on all three lines and you cannot take any more calls in or out. That's essentially what's going on. Except we can support 25000 calls (connections) and something that is not easy to identify is holding them open.
The issue initially surfaced on Thursday, when we had three hours of intermittent connectivity while we attempted to trace the exact source. We had another 15 minute outage yesterday after a troubleshooting step that should not have affected our services did (and I beat my senior engineer in his head for performing ANY steps during business hours).
We've been taking the site down at 11P yesterday and today, and probably tomorrow to perform various troubleshooting steps. Last night, I took down the server CR lives on so I could make a few changes to an application that lives with CR on this server that I suspect is the culprit.
Tonight we're taking down our services to force back over to our primary firewall for further testing.
Did I explain that at least mostly clear enough?