IPS team members should consult this playbook every time they participate in issue-resolution activity on behalf of ISDA.
Issue Definition
- Our primary point of contact for support issues is the isda-ops mailing list. Everyone on isda-ops@mit.edu should be monitoring this mailing list and the ISDA:Admin RT queue on help.mit.edu (https://help-mit-edu.ezproxy.canberra.edu.au/Search/Results.html?Order=DESC&Query=(%20Queue%20%3D%20'ISDA%3A%3AAdmin'%20)%20and%20(%20Status%20%3D%20'new'%20or%20Status%20%3D%20'open'%20or%20Status%20%3D%20'stalled'%20)&Rows=50&OrderBy=id&Page=1&Format=%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__id__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3A%23'%2C%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__Subject__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3ASubject'%2C%0A%20%20%20Status%2C%0A%20%20%20QueueName%2C%20%0A%20%20%20OwnerName%2C%20%0A%20%20%20Priority%2C%20%0A%20%20%20'__NEWLINE__'%2C%0A%20%20%20''%2C%20%0A%20%20%20'%3Csmall%3E__Requestors__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__CreatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__ToldRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__LastUpdatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__TimeLeft__%3C%2Fsmall%3E').
- Email representing immediate operational issues should be forwarded to RT (via the isda-admin-rt list), while email representing bug reports should be filed in Jira.
- If someone makes a support request through another channel, we email the issue to isda-ops (and file it appropriately in RT or Jira) to make sure the rest of the team is notified.
- Automated responses from monitoring software come through to isda-ops@mit.edu, where they can be monitored and forwarded to the RT queue if necessary.
- All team members in an Operations role who are on the floor must meet to discuss the issue. A person is selected to lead the resolution. This is the Tech Lead.
- If possible, they must include non-operational staff in the discussion who are assigned to, or conversant with, the system in question.
-
- The Tech Lead is not necessarily operations personnel. This assignment is up to staff available at the time of the issue report.
- The Tech Lead responds to the initiators of the message, alerting those parties that resolution is underway.
- The Tech Lead must define the type of issue and then proceed accordingly. Be sure to file the issue appropriately (Jira or RT) depending on type:
- Urgent Response: A system, whether that is a whole server or a particular application, is unresponsive.
- Bug Report: An application is not doing the right thing in some particular case, but it is not generally broken; a system is not down.
- N.B.: for Operations staff, both kinds of issues are priority #1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).
Urgent Response (Resolution)
- For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
- Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
- If not, Tech Lead or designee restarts component manually, determines if this resolves issue or if a more persistent problem exists.
- Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
initiator of problem ticket, isda-leaders@mit.edu, isda-integrators@mit.edu, isda-ops@mit.edu, and if any end-user applications could have been affected, computing-help@mit.edu - In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
- It is ZIPS expectation that Emergency Response takes precedence over other project work.
- SCRUM: No resolution work should proceed until SCRUM is performed with available resources to discuss process and possible resolutions.
- Tech Lead is project manager for duration of issue resolution. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
- Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
- Post Mortem: Tech Lead reviews response with IPS team lead. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.
Bug Reports-Handoff
- Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
- Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
- Tech Lead and managers determine staff members responsible for issue. This is the Team.
- Tech Lead sends message to Recipient List notifying them of the issue.
- Contact the Team and do preliminary troubleshooting to determine the nature of the issue and, therefore, the staff responsible to remedy the issue.
- Transfer ticket information to system of record for the issue resolvers.
- Notify the Recipient List of the transfer and the managers now responsible for the issue.
5 Comments
Paul B Hill
There must be a document that explains what systems and services for which isda-ops is responsible. This should cross index both the hostnames and the services running on those hosts.
Paul B Hill
Why doesn't isda-ops@mit.edu feed directly into an ISDA request tracker queue? It seems like a waste of effort to require the tech lead to later enter the data into RT.
If isda-ops has responsibility for the queues ISDA::WS-Support and ISDA::THALIA-SUPPORT, and additional queues to come, then the question becomes, is it faster to re-enter the data after receiving the mail, or is it faster to transfer the case from one queue to another. I believe that it should be faster to transfer the case to a different queue.
Andrew M Boardman
Some of the email to isda-ops is informational in nature, and when there's a problem report, it's not an individual message but an entire slew of them, many automatically generated via nagios and unthreaded. I don't see a way to organize these rationally without human intervention.
Also, it's not necessarily a question of transferring between queues; some of it needs to feed into Jira instead.
Paul B Hill
I have a concern about the statement, "All issue reports must come through isda-ops@mit.edu." During the last post mortem meeting I heard someone say that SAIS (the customer in this case) prefers to have a number to call instead of sending email.
What number should a customer call, and what procedures does that trigger?
Andrew M Boardman
The phone number is (to my knowledge) undetermined, but is planned to feed into voicemail which will be stored and forwarded as email.