IPS team members should consult this playbook every time they participate in issue-resolution activity on behalf of ISDA.

Issue Definition

  1. Our primary point of contact for support issues is the isda-ops mailing list. Everyone on isda-ops@mit.edu should be monitoring this mailing list and the ISDA:Admin RT queue on help.mit.edu (https://help-mit-edu.ezproxy.canberra.edu.au/Search/Results.html?Order=DESC&Query=(%20Queue%20%3D%20'ISDA%3A%3AAdmin'%20)%20and%20(%20Status%20%3D%20'new'%20or%20Status%20%3D%20'open'%20or%20Status%20%3D%20'stalled'%20)&Rows=50&OrderBy=id&Page=1&Format=%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__id__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3A%23'%2C%0A%20%20%20'%3Cb%3E%3Ca%20href%3D%22%2FTicket%2FDisplay.html%3Fid%3D__id__%22%3E__Subject__%3C%2Fa%3E%3C%2Fb%3E%2FTITLE%3ASubject'%2C%0A%20%20%20Status%2C%0A%20%20%20QueueName%2C%20%0A%20%20%20OwnerName%2C%20%0A%20%20%20Priority%2C%20%0A%20%20%20'__NEWLINE__'%2C%0A%20%20%20''%2C%20%0A%20%20%20'%3Csmall%3E__Requestors__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__CreatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__ToldRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__LastUpdatedRelative__%3C%2Fsmall%3E'%2C%0A%20%20%20'%3Csmall%3E__TimeLeft__%3C%2Fsmall%3E').
    1. Email representing immediate operational issues should be forwarded to RT (via the isda-admin-rt list), while email representing bug reports should be filed in Jira.
    2. If someone makes a support request through another channel, we email the issue to isda-ops (and file it appropriately in RT or Jira) to make sure the rest of the team is notified.
    3. Automated responses from monitoring software come through to isda-ops@mit.edu, where they can be monitored and forwarded to the RT queue if necessary.
  2. All team members in an Operations role who are on the floor must meet to discuss the issue. A person is selected to lead the resolution. This is the Tech Lead.
    1. If possible, they must include non-operational staff in the discussion who are assigned to, or conversant with, the system in question.
    1. The Tech Lead is not necessarily operations personnel. This assignment is up to staff available at the time of the issue report.
  1. The Tech Lead responds to the initiators of the message, alerting those parties that resolution is underway.
  2. The Tech Lead must define the type of issue and then proceed accordingly. Be sure to file the issue appropriately (Jira or RT) depending on type:
    1. Urgent Response: A system, whether that is a whole server or a particular application, is unresponsive.
    2. Bug Report: An application is not doing the right thing in some particular case, but it is not generally broken; a system is not down.
    3. N.B.: for Operations staff, both kinds of issues are priority #1. It is acceptable for all other responsibilities to be on hold until resolution (system down) or handoff (bug report).

Urgent Response (Resolution)

  1. For an unresponsive component, either the Server Operations team or an automated process should have attempted to restart the component.
    1. Someone familiar with the system must check to see if restart procedures occurred and if that temporarily resolved the problem.
    2. If not, Tech Lead or designee restarts component manually, determines if this resolves issue or if a more persistent problem exists.
  2. Notification: preliminary problem description (and resolution, if applicable) sent to Recipient List:
    initiator of problem ticket, isda-leaders@mit.edu, isda-integrators@mit.edu, isda-ops@mit.edu, and if any end-user applications could have been affected, computing-help@mit.edu
  3. In conjunction with managers currently present, Tech Lead forms Team to troubleshoot issue.
    1. It is ZIPS expectation that Emergency Response takes precedence over other project work.
  4. SCRUM: No resolution work should proceed until SCRUM is performed with available resources to discuss process and possible resolutions.
    1. Tech Lead is project manager for duration of issue resolution. Tech Lead is final arbiter for delegation of tasks, priorities, and timing.
  5. Notification: If resolution is lengthy, Tech Lead will update Recipient List at least once per day of status of resolution.
  6. Post Mortem: Tech Lead reviews response with IPS team lead. If emergency response offers the opportunity for improvement of process, Tech Lead calls a post-mortem with parties who participated in the resolution.

Bug Reports-Handoff

  1. Tech Lead notifies a team leader or manager responsible for each tier of the system affected.
    1. Tech Lead collects information from managers on which mail lists to send notification of issue. This is the Recipient List for this issue. Note it in the ticket.
    2. Tech Lead and managers determine staff members responsible for issue. This is the Team.
  2. Tech Lead sends message to Recipient List notifying them of the issue.
  3. Contact the Team and do preliminary troubleshooting to determine the nature of the issue and, therefore, the staff responsible to remedy the issue.
  4. Transfer ticket information to system of record for the issue resolvers.
  5. Notify the Recipient List of the transfer and the managers now responsible for the issue.

  1. There must be a document that explains what systems and services for which isda-ops is responsible. This should cross index both the hostnames and the services running on those hosts.

  2. Why doesn't isda-ops@mit.edu feed directly into an ISDA request tracker queue? It seems like a waste of effort to require the tech lead to later enter the data into RT.

    If isda-ops has responsibility for the queues ISDA::WS-Support and ISDA::THALIA-SUPPORT, and additional queues to come, then the question becomes, is it faster to re-enter the data after receiving the mail, or is it faster to transfer the case from one queue to another. I believe that it should be faster to transfer the case to a different queue.

    1. Some of the email to isda-ops is informational in nature, and when there's a problem report, it's not an individual message but an entire slew of them, many automatically generated via nagios and unthreaded.  I don't see a way to organize these rationally without human intervention.

      Also, it's not necessarily a question of transferring between queues; some of it needs to feed into Jira instead.

  3. I have a concern about the statement, "All issue reports must come through isda-ops@mit.edu." During the last post mortem meeting I heard someone say that SAIS (the customer in this case) prefers to have a number to call instead of sending email.

    What number should a customer call, and what procedures does that trigger?

    1. The phone number is (to my knowledge) undetermined, but is planned to feed into voicemail which will be stored and forwarded as email.