What isn't entirely obvious in the above graphs? These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients.That's something to think about.
One likes to think that cloud computing is the silver bullet for scalability - use their stuff, build a system that assumes scale, and no worries, mate. But this makes it clear that there are basic operational issues that need to be kept in mind. Doing it manually may not cut it.
Amazon for example provides no tools for handling traffic increases automatically. It's your job to monitor and kick off new instances. But if this is happening in a matter of 60 seconds, you'd better be very sure you know what to do, and quickly.
Personally, I've never had to deal with this, so I have no tips for you. If I were to start up a new site, I'd go find people who do and have a long talk with them.