Skip to content

Instantly share code, notes, and snippets.

@PhillyCDO
Created August 19, 2013 14:46
Show Gist options
  • Save PhillyCDO/6269933 to your computer and use it in GitHub Desktop.
Save PhillyCDO/6269933 to your computer and use it in GitHub Desktop.
Postmortem of Crime Data Service Outage

Overview

On Monday, August 5th, 2013, we were notified via Twitter of some suspected issues with our crime incident data service. Upon investigating, we learned that updates to the crime data service and the static download file that accompanies it had not processed correctly over the previous weekend and were continuing to fail at regular update intervals.

Over the following week and 1/2, staff in the Office of Innovation and Technology (OIT) and the Philadelphia Police Department (PPD) worked to identify the cause of the failure and to correct it, to ensure that data consumers could use this data in third-party applications and tools.

Status

As of the morning of August 15th, all issues seem to have been resolved and updates to the data service appear to be working normally.

Root Cause

There were several issues that led to this service interruption.

First, and most importantly, a data processing script that runs nightly and updates both the database behind the public data service and the static download file failed to execute properly. This prevented data updates from occurring, and IT staff did not receive timely notification of the incident because such notices were set up to run when the script failed to complete execution successfully, not when the script itself failed to run. This approach has now been modified, and notices to IT staff are now set up to run both when script execution fails to complete properly, and also when the script itself is not run.

When the issues preventing the update script from running were resolved, we observed several instances of script execution failure caused by database changes and user permissions related to data used by this service. After discussions between OIT and PPD staff, all of these changes were accounted for and fixed, allowing the data update script to complete execution successfully.

Finally, intermittent network issues caused data updates to fail on a couple of instances during the outage period. This created some confusion and made more speedy diagnosis of the other issues affecting successful updates more difficult to diagnose and correct.

Outcome

We deeply regret any inconvenience this service interruption may have caused to users of our data service, and we are taking steps to enhance our monitoring of this and other public facing data services from the City of Philadelphia to help ensure more stability and reliability.

The City of Philadelphia is in the process of developing a comprehensive API management strategy that we hope will allow us to identify more quickly when issues occur with one of our public services, allowing us to more efficiently implement mitigation strategies.

Any user of a City of Philadelphia data service that notices issues or has questions can report them in one of the following ways:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment