From Larry Osterman again, I just can't help but comment on his link from today, Dare Obasanjo aka Carnage4Life - Amazon Developer on Replacing Operations with Developers.
I think this is a bit of a simplistic view. The idea that you had your notes to the operations team in the morning and they find the root cause is okay, if and only if your operations team has the time and is capable of that. In many places, as soon as they see a stack trace in the logs, that automatically means it's for a developer to look at and they throw it right back. Or, they have their own new "most important thing", and they don't look at it either.
For operations to be able to follow up on issues, they have to have a few developer-types on their team.
For development to not build sites and services that suck total ass to maintain, they have to have a few operations-minded people on their team.
I don't think you can replace operations with developers, but you have to have a system where the developers feel the pain. When schedules get cramped, it's all too easy to let stuff out that will cause production problems. Unless that pain is reflected onto the individuals responsible, at least a little bit, you're going to have disfunction.
There should be division of labor, but it should be much blurrier than it usually is.
I find a good method is to have the developers get paged (or at least emailed at first) every single error message out of the logs (scrubbed if necessary). Then they know what's going on in prod, and it causes them enough grief that they are motivated to fix it (although the best are motivated by personal pride also).
The one piece often left out in this equation is to have your developers have enough (read only please) access to prod, or prod log data that they can actually investigate the problems that come down the pipe.
On the flip side, at least a few people in ops should absolutely understand the code enough to dig in a bit, assign problems to the right developers, and fix the occasional problems. If it's just a big black box, it's almost impossible to troubleshoot properly.
No pain, no gain. :-)
How about a world in which everyone can get full visibility into metrics and raw data associated with a given application/site? Better yet they get it without having to log in directly to any servers.
Perhaps this is a standard framework that allows any new application that gets developed to get this for free, just by adhering to a few standard means by which to log. Perhaps management might want a nice roll up of this data across application for high level, but drill down capabilities to get down the transaction and raw event levels.
Apply these tools consistently through all of your environments so you have the same level of visibility. Provide a subscribe and unsubscribe option so that individuals or DLs can be choose to get this data. Even better, since the data is centralized and normalized, build any "x" number of tools against it to do any number of analysis against it. Develop it once, anyone can use, extend, etc.
Want a sharepoint widget to show the number of originations, easy just call the service. Want to change that to show the number of Alerts sent yesterday, just modify the XML being sent to the service.
Now this certainly does not address dumps/traces/etc. but it is a start. Take it a step further to build "administration" considerations into your application such as adjusting logging levels, etc. and you really start to be able to do a lot as an individual developer, while empowering your supporting operations teams with the same tools.
This is Rob Meyer's weblog, a weblog focused on software development and system administration based on 10 years of experience. Want to explore further? You can find out more me or see the rest of my website.
Wondering if I've written on something in particular? Try searching:
You might want to take a look at some of the more requested postings (as judged by incoming traffic):
Want more? Subscribe to this site
or contact me at rob at big dis dot com.
See my writings on: