Insight Partners is a VC which has companies like Twitter, Docker, BlaBlaCar, HelloFresh and bmc in their portfolio. They developed the periodic table of software development KPIs to steer their portfolio.
The KPIs used by Insight Partners are a good starter for companies who like to transform their business as they are
a) high level enough to not get lost in the weeds of a particular technology
b) tied to specific economic benefits and hence hard to game
c) agreeable on their usefulness
While there are many useful KPIs among the ones mentioned here, we selected some of them as an initial proposal. Our personal preference for the top 10 KPIs would be ramp time (RT), deployment frequency (DF), developer happiness (DH), drag factor (DRF), PM and quality metrics + number discreet outages (NDO), mean time to recover (MTTR) , Appdex score(APD) and API/Service Performance (ASP).
However, which of them are chosen as the top 10 KPIs depends very much on the individual situation a company is in, we we are happy to have a deeper conversation with you on which may be most appropriate for you.
RT: Ramp Time - Length of time required for a new developer to push code to production for the first time.
The normal onboarding time varies from company to company, but typically consists between 2 weeks and 8 weeks. With a previous familiarity of GitHub and its workflows, it can typically be cut in half, saving 2 to 4 weeks of on boarding time per developer. Our evidence from previous customers even suggests that only 30 hours are needed in average to get a developer fully up to speed with GitHub's work flows and tooling.
<number developers on boarded / switching platforms > / 52 weeks * <number of weeks saved> * #average dev salary
Training on proprietary version control, collaboration and issue tracking system typically requires a dedicated trainer (team). With GitHub, only minimal training is required which is freely available and can be done in a self-paced, self service manner:
<number developers needing training> / <class size> * <average cost of training>
More than 40 million developers access GitHub each month, 4 million on a day to day basis. It is fair to say that the majority of all Open Source projects and personal Git repositories world wide are hosted on GitHub. GitHub is used to teach software development in universities, hackathons and in any hands-on tutorial/workshop describing how to approach topics like mobile development, continuous integration, devops, etc.
Most developers, especially the 20 to 35 year old generation is most likely already exposed to GitHub and is using it day to day for personal, university and professional purposes. GitHub's universal contribution workflow - The GitHub Flow - with its ability to experiment, collaborate, incorporate feedback and deploy will already be familiar, reducing the onboarding time and need for training significantly.
GitHub provides APIs and reports to show when a user got initially created (and onboarded).
The first commit pushed to production could be identified by scanning the commit history on the production branches / looking at the contributor graphs per repository.
DF: Deployment Frequency- number of times per day/week/month that a team deploys.
HF: Hot Fixes - Count of HotFixes over a given interval
The number of times a team can deploy (either to production or an internal system), correlates to how fast a team can
- fix an issue in production (reducing risk, HF)
- change course based on customer feedback (validated learning, create the right product)
- deliver new features to the customer (time to market)
- execute a deployment fearlessly
The shorter it takes to go through the cycle from the initial scheduling of work to actual deployment, the higher the deployment frequency. To shorten the cycle time without compromising quality and audit compliance
- quality of the newly developed feature bug / fix has to be guaranteed by a mix of peer code review and automation
- all required stakeholders need a well understood mechanism to get drawn into the conversation and give their comments / approval
- waiting times where one group has finished their deliverables and is waiting for the feedback of another one has to be reduced
GitHub's Pull request based collaboration workflow GitHub Flow allows to
- draw in interested stake holders to give timely feedback
- get early feedback from other developers, detecting problems early in the process
- trigger continuous integration systems automatically and displays their results in the conversation with possibility to enforce quality gates
- trigger deployments based on CI and peer review results, inform all connected deployment systems and display their deployment results
GitHub invented Hubot, the first ChatOps bot. ChatOps
- bridges the communication gap between developers and operations by making all system interaction visible from a common chat room
- enables new developers and operations people to learn from the very first day how the entire deployment cycle looks like, everyone is pairing all the time, allows teaching by doing by making things visible
- places all related tools in the middle of the conversation, reducing communication and manual approval friction between Dev and Ops significantly
- allows to trigger hotfix and feature deployments directly from Chat, getting immediate feedback on the impacts on production (exceptions, response time, see performance related KPIs for details) with the ability to discuss them and change course if needed
An independent study from GitPrime that analyzed more than 9400 commercial software projects showed that the developer throughput (and hence deployment frequency) is 28 percent higher with projects ran on GitHub as compared to projects managed with BitBucket or GitLab.
- Average number of pull requests opened / merged / closed per week (pulse of repository)
- Number of deployment events per week
DH - Developer Happiness: Employee Net Promoter Score - Based on a scale of zero to ten, how likely is it you would recommend this company as a place to work?
Attracting and retaining top talent is probably the most important objective of any engineering organization as more and more companies see their main ability to compete in the quality and speed of their IT based innovations. The costs for rehiring, retraining and lost business opportunities because of the missing capabilities to execute often add up to 3 to 6 months of salary for the position.
attrition rate * <average monthly salary of a developer> * <number developer> * 6 = costs related with low developer happiness
GitHub is the chosen and trusted community of more than 15 million developers and the majority of all Open Source projects. Being able to use GitHub within their companies as well is a huge attractor as
- it allows to use the same appealing tools used in personal projects and Open Source in the Enterprise
- it shows that the company is open to modern collaboration concepts like social coding and inner source
- it proves that the company takes developer productivity and job satisfaction seriously by providing the best tools money can buy
- it proves the company knows about GitHub and will most likely have a look at the personal development portfolio on GitHub.com during the interview process
- it demonstrates the company is open to new, appealing technologies and methodologies like agile development and CI/CD which will lead to high performing organizations (developers prefer to work with technologies and methodologies whose mastery will improve their market value and employability)
The State of DevOps 2016 Report from Forest Technologies has found out that employees in high-performing organizations were 2.2 times more likely to recommend their organization to a friend as a great place to work, and 1.8 times more likely to recommend their team to a friend as a great working environment.
We recommend designing a survey with just one question: "On a scale of zero (not likely at all) to ten (extremely likely), how likely is it you would recommend this company as a place to work?". Calculate the employee net promoter score of the results and use it for benchmarking.
AT - Allocation of Time: % of time the dev team spends 'fire-fighting' vs. spending on planned work.
DRF - Drag Factor: % of capacity not spent towards sprint goals.
Developer time is some of the most precious time of a company as it directly relates to the time spent in delivering new features and products to customers. Unfortunately, most time of developers is typically not spent developing but attending all-hands meetings or having to fire fight problems that arose because bugs were only found in production or during sprint review . The goal of development managers should be to free up as much developer time to actually get things done and "ship them". Frequent shipping and early feedback for the features being developed is crucial to course correct if the output is not matching customer expectations (from both feature direction and feature quality).
AT * <sum of all developer salaries> = money spent on activities not contributing to customer value
DRF * <sum of all developer salaries> = money spent because of meeting/communication overhead
GitHub Flow is GitHub's recommended, asynchronous development workflow. It allows to develop new features and fixing bugs while involving stakeholders very early in the process and preventing long development times without feedback whether customer value is added. Pull requests, a GitHub innovation and the heart of GitHub Flow, foster asynchronous communication, so the need for all-hands coordination meetings is much lower, allowing developers to focus on actual development. As communication within pull requests is text based and searchable by everyone, any stakeholder interested in the particular work going on automatically gets informed which is preventing many misunderstandings because somebody was not invited to a meeting or could not participate because of different time zones.
Whenever a pull request gets updated with a new commit, connected continuous integration servers will automatically provide their feedback on all code quality related KPIs and tell whether any feature regression took place. Having this information at hand immediately after having performed a change as opposed to deploy or sprint review time prevents the majority of fire fighting situations. This approach is known as shifting left in the DevOps and Agile testing world.
As long as a pull request does not get merged (which can be prevented if feature regressions are spotted or quality KPIs are not met), it does not have any impact on production, enabling developers to experiment without risk and innovate to provide as much customer value as possible.
An independent study from GitPrime that analyzed more than 9400 commercial software projects showed a 30 percent higher (asynchronous) collaboration rate between developers on projects ran on GitHub as compared to projects managed with BitBucket or GitLab.
- have a look in the developers' calendar to see how many meetings take place per week
- https://hbr.org/2016/01/estimate-the-cost-of-a-meeting-with-this-calculator
- look at project related KPIs to see how much unplanned work got scheduled during an iteration
NDO: Number of Discrete Outages - A primary leading indicator of application stability (high # of very short outages indicates instability).
MTTR: Mean time to recover - Measure of the time required for system/component to come back online (repair and recovery) after an outage.
HA: Host Alerts - Frequent checks of load, memory, CPU, IO, services, processes, etc.
SR: Server Resiliency - Number of servers/instances before a process/system component outage is suffered.
SS: Service Status - Frequent, end-to-end tests of high level site components along with individual services
U: Uptime - Break down the distribution of total site uptime across each component of the software/infrastructure (including 3rd party pieces).
Whenever a component of the software development platform goes down, software development comes to a halt for all developers. Connected systems like CI will not be able to report their results back into the platform, causing the entire deployment platform to fail.
NDO * MTTR * <number of developers and ops> * <average engineer salary per minute>
According to an IDC study, the average cost of an infrastructure failure for a Fortune 1000 company is $100,000 per hour.
GitHub Enterprise is based on the same code base as GitHub.com which is the 65th most visited web site on the planet. Outages are not acceptable, so the code base has been designed for resiliency and stability. GitHub Enterprise provides a high availability setup out of the box, no generic blueprint but a productized solution that can be configured within minutes. Backup and disaster recovery are also included as part of the virtual appliance and enable to bootstrap a completely new environment within minutes without any custom scripting, knowledge of the different data stores or their location. GitHub Enterprise also provides clustering with redundant component setup if the customer has significant load above 5000 users.
GitHub ships with built in monitoring tools + the ability to forward logs to a central place + SNMP connectivity + actionable recommendations when to alert. All functionality is accessible and configurable from the web interface.
Service status for all services is directly shown in the site admin tool bar:
APD: Appdex Scores - # of satisfied/tolerating requests as % of total requests.
ASP: API/Service Performance -Average time between a request originating from an API/Services vs. the time it takes to respond to the request (PLT for API calls)
PLT: Page Load Time - Time between client request to rendering the delivered result
PET: Page Execution Time - Time spent on server and browser to process and render result (PLT - network transport time)
PV: Page Visits - Number of requests, great indicator for developer activity
SC: Server Costs (Newly introduced) - Server costs to handle user requests with an acceptable response time
A performant software development platform is crucial for developer productivity as well as for the reliability and efficiency of connected tools like CI systems. Performance problems are one of the primary reasons for friction in the development cycle, significantly impacting the threshold to contribute - studies showed that whenever pages need to load longer than 700 ms (PLT > 700ms), developers will get frustrated and disengage (PV will go down). Long cloning times from CI systems affect how fast features and bug fixes can be delivered to the customer. If API requests time out, long running build chains have to be restarted.
optimal developer velocity - APD * optimal developer velocity = gap in developer velocity caused by bad performance
ASP * <number of API calls per day> = daily time waiting for developer platform in all build and release cycles
Performance can usually be optimized by vertical and horizontal scaling. Vertical scaling adds more resources (CPU, memory, hard disks) on a single machine whereas horizontal scaling adds more machines itself. Depending on the nature (vertical scaling) and number of server (horizontal scaling), server costs (SC) can have a significant impact on the total cost of ownership.
SC = <number of servers required to return requests within acceptable response time> * <server cost>
GitHub Enterprise shares the same code base as GitHub.com which has to serve the requests of millions of users per day. Our virtual appliance is heavily performance tuned to handle the request of up to 10,000 users without creating frustrating experiences because of slow interactions with the platform.
The requirements for GitHub Enterprise at up to 3,000 users are modest: 4 vCPUs, 32GB of memory, 250GB attached storage, and 80GB of root storage. The license allows for highly available configuration and disaster recovery - and configuration of each of those is quite simple. GitHub Enterprise may scale up to 10,000 users in production on a single instance. For customers who push the limits of a single appliance, a clustering option is available at no additional charge.
GitHub publishes hardware recommendations for up to 5,000 users in our public-facing documentation, which are included in the table below. Many of our customers are using a single virtual appliance to serve over 10,000 users. When they do reach this size of deployment we prefer to work together to configure the most performant, single-instance environment.
Users | vCPUs | Memory | Attached Storage | Root Storage |
---|---|---|---|---|
500 - 3,000 | 4 | 32 GB | 250 GB | 80 GB |
3,000 - 5,000 | 8 | 64 GB | 500 GB | 80 GB |
3 servers (HA Pair, DR) x $6,000/server = $18,000 (pa)
- PET, PLT and status of services are directly displayed in the header of every GitHub Enterprise site admin.
The built in monitoring tools shows the response times per user, API and number of requests in real time, aggregated over three hours, 1 day, 1 week or 1 month.
-
count number of servers needed to to run the entire development platform, including backup servers, identity management servers, high availability, data bases, wiki nodes, issue/tracker nodes, search nodes, app server nodes
SAD: Seasonally-adjusted downtime - A measure of slow/impacted components in a site's infrastructure that are impacting customers based on time of the day, week, month or year.
Downtimes have a very similar effect to outages with the exception that they can be better scheduled. With developers working around the globe and automated CI systems accessing business critical development platforms all the time, it is often very challenging to find a time where systems can be disconnected. Maintenance is typically required whenever a) a software component has to be updated because of a feature or security update b) a software configuration update has to be performed that cannot be done without restarting components c) an infrastructure component like data base, data store, CPU, memory has to be upgraded
SDA = time(a + b + c)
SDA * <number of developers and ops> * <average engineer salary per minute> = impact of system downtime on developer productivity
According to an IDC study, the average total cost of unplanned application downtime per year is $1.25 billion to $2.5 billion for a Fortune 1000 company.
Furthermore, upgrades require careful pre-planning and simulation on a staging system. The more time is needed to
- setup a staging system
- apply the main systems' configuration to the staging system
- learn about all procedures to upgrade and smoke test related components
- configure fail over and disaster recovery
- purchase and deploy necessary hardware
the more full time administrators will be needed and the longer it will take before any security bug fix or new feature can be applied. There have been cases where upgrade planning consumes multiple months, which leaves the business at risk from non fixed security updates and prevents the developers from using the latest features to improve their productivity.
time needed for upgrade planning, upgrades and escalations per year / person year = number of FTE administrators
GitHub Enterprise is shipped as a virtual appliance, so all components comprising the system are already included. Backups and most configuration changes can be done without any downtime. Upgrades to new features or security fixes will automatically update all components and storage systems at once. Updating literally requires just some minutes of downtime. Update steps can be tested on a staging system first which is always covered by the GitHub Enterprise license. Staging systems can be filled with production data with a single command. Updates do not require any manual steps (but to initiate the update procedure which can also be done via our API). No expert knowledge on any contained component and their relationship is required, there are no "it depends", "general guidelines", but just out of the box procedures.
It is possible to automatically check and download updates from the web interface.
Maintaining GitHub Enterprise is often a task given to a DevOps Tools Administrator responsible for maintaining many tools in the development toolchain. Many of our large customers inform us it takes about 15 percent of one FTE’s time to maintain GitHub Enterprise.
- time track activities around upgrade planning, administration, upgrade, staging, backup to determine administration costs
- track down time per each server over SNMP or collectd
- automatically check for updates:
DBC: Database Connection Count - # of live connection objects in memory. Correlating to a unit of time or event. Also useful to analyze time spent per table.
SL: Server/Load Performance - Plot of server load vs. memory usage, where each point is a server, color-coded by functional area (e.g., front-end, back-end, checkout, database).
AB: Aggregated Bandwidth - Measurement of bandwidth at key points across the network environment. Specific touch points help to understand the amount of traffic moving internally on the network. Too much chatter often signals that services are making more external calls than necessary.
Outages and slow performance of business critical development and deployment platforms significantly impacts developer productivity, production stability, customer satisfaction and time to market. Most outages and performance issues do not appear out of the blue but are the result of gradually increasing load on system resources like data bases, CPUs, network bandwidth and disk space. Proper monitoring and alerting allows to detect and address performance bottlenecks before they turn into real problems.
NDO * <cost of an outage> * (probability to proactively detect an outage) = improvement potential with better monitoring strategy
GitHub Enterprise provides built in monitoring to display DBC, SL, AB and a lot of other KPIs in either real time or aggregated over 1 day, 3 days, 1 week and 1 month. Our alerting policies are based on the KPIs mentioned above and have been developed with our experience in providing our virtual appliance to more than 2 million developers working with it. Furthermore, it is possible to forward logs and configure external monitoring via SNMP.
Screenshots from the monitoring tools:
At GitHub we believe in providing developers the best tools for the job. GitHub is widely regarded as the best of breed source control management and collaboration tool. While a suite approach can be quite attractive, it's a compromise on quality developer tooling with potential impact on developer productivity, efficiency and happiness.
We recognize that other companies are best of breed in areas like continuous integration, project management and static code analysis just to mention a few. As such, our main focus is to provide a state of the art REST and WebHook API to be the easiest tool to integrate with. With 19M+ developers using GitHub.com, we're happy to see the best software tools in the ecosystem integrated with GitHub.
Feedback from tools like continuous integration servers, external issue trackers and deployment related events directly show up in GitHub's collaboration workflow (pull requests). If quality or project management KPIs are not met, tools can block a pull request to be merged, creating customer specific quality gates based on their individual KPIs.
Whether you connect custom tools your team has already built or choose from hundreds of pre-built integrations like Travis CI, JIRA, Slack, Atom, and Heroku, developers can get work done while tools melt into the background.
An independent study from GitPrime that analyzed more than 9400 commercial software projects showed a 12 percent lower code churn on projects ran on GitHub as compared to projects managed with BitBucket or GitLab. The same study also attests GitHub-based engineers lower operational risk when looking at 3 metrics including number of files affected, number of unique edit locations, and the number of lines affected by unit of change.
A wealth of third party tools integrate with GitHub. Our Integrations Directory features some of the most popular and well documented integrations offered, but there are many more available through third parties, as well as open source integrations on GitHub.com.