Institutional Knowledge

Wherein we write down some stuff that we know.

Institutional Knowledge header image 2

Dealing with Downtime

August 28th, 2007 · No Comments

Yesterday we were cruising right along in the portal. Things were looking quite good for the first day of the fall term. We were sustaining over 1,500 concurrent users, but let me stop here for just a second and visit what this means. Before we were running uPortal we ran another “monolithic” portal product and it suffered from severe performance problems, especially under high load at peak times. I could argue how “monolithic” it was since many of the components were on different pieces of hardware, but I’ll let it slide because there was no effort made to distinguish the different applications from “the portal.”

We currently run two production app servers (Tomcat, Java 5, Dell 2850s) and yes, we keep the portal as simple as possible. People login, and then they go to e-mail, WebCT, or PeopleSoft. So what makes the portal different from a static HTML page with links? Two things. We know who you are and to some extent what you do on campus, so we show you things that are appropriate for your group. The real killer app is single sign-on, which in our case is CAS. You login once, and then we can get you to all these other applications, safely, without having the user login again.

Anyway, back to our Monday debacle. A big requirement of the initial uPortal deployment was that it had to stand up to peak demand, and by all measurements the portal meets and exceeds this requirement. So, where did it all go wrong? It all started deep in the core of the network…

The day the core network died

Around 2:30pm we noticed that network connectivity was gone. I think it’s time to interject a small quote from Rands.

The fact that a computer without an Internet connection is essentially a very expensive DVD player is a recent development, but the fact is, when I sit down at my MacBook and there is no wireless I think, “Well, I could play Bejeweled, right?”

When the network goes funky CAS can lose touch with LDAP and without LDAP it can’t authenticate people. Our killer app is useless at that point and that’s when you know things aren’t going well. You also know your day is about to get worse, because the only solution is to bounce tomcat. This, of course, means (in our current configuration) we lose any existing user sessions. So, as soon as we are back up, we are inundated with a wave of people needing to re-authenticate. CAS was a “little” slow, but it eventually slogged through the onslaught of requests.

So, what’s with the odd bump in portal sessions after the network recovered? Scott surmised that it was phantom sessions. When the network went down many people probably closed their browser windows, killing their current session on the browser side but not on the portal side. After the server-side timeout was hit, the sessions (on the server) started to drop as Scott predicted.

Luckily we were in communication with the help desks and they were able to explain to people what was going on. It looks like a CAS 3.1 cluster could be useful sooner rather than later.

Tags: Portal