Two days ago, I got a call from the customer service of one client that they were getting an ASP .NET error page (with the custom error page) when they were trying to use the account administrative service to manage user account data through a secure ASP.NET application. With ASP .NET on a windows box, the actual error message will appear in the Event Log in the Application section so I took a look there.
The error message was an InvalidOperationException when using a System.Data.SqlClient.SqlConnection Open method. The issue was that it wasn't able to get a pooled connection from the .NET SQL pool. When recycling the application pool for this application in the IIS manager, the page would load up to the point when it required SQL Connections and then would wait. I waited for a minute or two and it continued to try to load. Upon refreshing the page, the ASP .NET error page appeared again. It was clear that the server was unable make a connection with the database and that the connection pool was being exausted. This wasn't immediately picked up by the application monitoring systems because they're on the public side of the server cluster.
What struck me as odd was that the application is designed to handle those kinds of InvalidOperationExceptions gracefully by creating a new connection. The idea here, is that; while, it's important to make sure that all database resources are managed and properly closed and disposed, it would be impossible to crash the application by exausting the sql connection pool.
The default in this application is to use a pooled connection but if the pooled connections are unavailable and .NET throws an InvalidOperationException on SqlConnection.Open, modify the connection string to inform .NET to create a new unpooled connection.
private SqlConnection GetSQLConnection(Database db, bool pooled)
{
SqlConnection dbconn = null;
lock (openConnections)
{
if (openConnections.ContainsKey(db))
{
dbconn = openConnections[db];
}
else
{
string connectionstring = ConnectionStrings[(int)db];
if (pooled)
connectionstring += ";Min Pool Size=5;Max Pool Size=60;Connect Timeout=2;Connection Timeout=45";
else
connectionstring += ";Pooling=false;Connect Timeout=45;Connection Timeout=45";
dbconn = new SqlConnection(ConnectionStrings[(int)db]);
openConnections.Add(db, dbconn);
}
}
return dbconn;
}
The important bit about that code is:
- If the connection is pooled, append the pool size and pool timeout settings to the connection string.
- If the connection is not to be pooled, make sure to tell the .NET library to skip the pool.
Then, the connection is opened inside a Try/Catch for InvalidOperationException. If an error is thrown by the try/catch, it gets a non-pooled connection.
try
{
conn.Open();
}
catch (InvalidOperationException)
{
// Try to start a non pooled database connection.
conn = GetSQLConnection(db, false);
if (conn.State == ConnectionState.Closed)
{
conn.Open();
}
}
When everyone is doing their job, this extra protection is not needed and rarely, if ever, executes. 40 database connections simultaniously open is Slashdot Effect territory. It doesn't happen often.
This bit of code was acting exactly as designed. It just so happens however, that a Microsoft Update got applied a few hours earlier and the server was waiting for the next planned outage window to reboot. After rebooting the server, the server was once again able to resume normal operations.
Shortly after, I started getting complaints that a separate application wasn't allowing users to log-in. After debugging for a bit, it turned out that the state was no longer being kept in the cluster for users. This issue was on a different server in the cluster. When a user 'switched' which 'behind the scenes' server in the cluster, the server didn't know what the user did last and therefore reported to the application that the user wasn't logged in. It just so happened that a Microsoft Update got applied several hours earlier and the server was waiting for the next planned outage window to reboot. After restarting the second server, I stopped getting complaints.
So far, there's been no re-occurance. All servers in the cluster have been updated and rebooted. I can't wait to see what sort of system specific, untestable situation will happen on the next round of 'Microsoft Updates'