Created
September 12, 2012 01:54
-
-
Save tav/3703693 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The service dispatcher within the App Engine app needs to log accounting info | |
for the current request, e.g. `log.Infof("T:%s:queries:%d", user, n)`. This | |
data will later be parsed out and used for quota and billing purposes. | |
------------------------------------------------------------------------------- | |
In order to enable offline log processing, we need to implement a `/_logs` | |
handler on the App Engine app. This should validate against a shared secret | |
`key` and expose the [Log Query API] to callers. | |
[Log Query API]: https://developers.google.com/appengine/docs/go/log/reference | |
------------------------------------------------------------------------------- | |
For error tracing on GAE we will be relying upon App Engine's builtin logging | |
support. To facilitate this, we need to enable logs retention up to a large | |
limit. A petabyte should be plenty :) | |
------------------------------------------------------------------------------- | |
An `aws2stat` daemon needs to be written which enables [Amazon CloudWatch] for | |
all of our EC2 instances, Elastic Load Balancers and DynamoDB. The daemon | |
should also download all of the CloudWatch metrics and upload them to TempoDB | |
at regular intervals. | |
[Amazon CloudWatch]: http://aws.amazon.com/cloudwatch/ | |
------------------------------------------------------------------------------- | |
A generic mechanism should be added to `amp/runtime` that will allow our | |
daemons to reload their config files when sent a `SIGUSR1` signal. | |
------------------------------------------------------------------------------- | |
A `tweet2stat` daemon needs to be written which will take a config file | |
specifying queries, e.g. | |
```yaml | |
espra: #espra OR espra.com | |
all: espra OR espians OR https://alpha.espra.com | |
``` | |
The script should then regularly query Twitter for tweets matching the | |
specified queries and update TempoDB with counts of any new tweets. | |
------------------------------------------------------------------------------- | |
Daemons like `doozerd` and `statsy` should have fixed addresses within | |
clusters. Unfortunately, communicating with Elastic IPs from within EC2 incurs | |
charges. But it seems that we might be able to [get the internal ip address] | |
from the Public DNS name of an Elastic IP. | |
If this works, write a `get-host-for-elastic-ip` script which when given an | |
Elastic IP address: | |
* Assigns it to a temporary EC2 instance | |
* Uses that to discover the Public DNS name for the IP | |
[get the internal ip address]: http://alestic.com/2009/06/ec2-elastic-ip-internal | |
------------------------------------------------------------------------------- | |
An `amp/tempodb` package needs to be written to support the [TempoDB API]. | |
[TempoDB API]: http://tempo-db.com/docs/api/ | |
------------------------------------------------------------------------------- | |
Write an `amp/dynamodb` package that supports reading and writing data to [DynamoDB]. | |
[DynamoDB]: http://aws.amazon.com/dynamodb/ | |
------------------------------------------------------------------------------- | |
Implement an `amp/statsy` package that provides a clean API for sending | |
metrics to a `statsy` daemon over UDP. In addition, a `statsy.ProcInfo()` | |
function should be provided which sends info about the current process's | |
resource usage, e.g. cpu, resident memory, &c. | |
Since sending a message for everything could get overwhelming, an API should | |
be provided to sample the data, e.g. | |
NewTimer("upload").Sample(100).Every(5 * time.Seconds) | |
The sampling rate should then adapt in real-time to reflect changes in load. | |
------------------------------------------------------------------------------- | |
Implement a `statsy` daemon which receives metrics data over UDP within a | |
cluster and uploads the data to TempoDB. When sending the data, it should | |
aggregate certain classes of metrics and account for sampling rates. And it | |
should signal its own resource usage by sending `statsy.RawProcInfo()`. | |
------------------------------------------------------------------------------- | |
Add a `DynaLog` network logging option to the `amp/log` package so that it can | |
persist log entries to DynamoDB. It should automatically buffer unsent log | |
items to disk, so that they can be resent once DynamoDB is responsive again. | |
Also, add an option to automatically nuke standard file logs within | |
`amp/runtime`. | |
[DyanmoDB]: http://aws.amazon.com/dynamodb/ | |
------------------------------------------------------------------------------- | |
Setup `statsy` within our EC2 clusters. They should be given a perma-hostname | |
corresponding to their cluster name, e.g. `st-us1.espra.com`. | |
------------------------------------------------------------------------------- | |
Setup `doozerd` within our EC2 clusters. They should be given a perma-hostname | |
corresponding to their cluster name, e.g. `dz-us1.espra.com` and have 3 | |
instances with distinct elastic IPs attached to it. | |
------------------------------------------------------------------------------- | |
We need 4 elastic IPs in each cluster: | |
* doozerd (3) | |
* statsy (1) | |
We therefore need to ask Amazon to [increase the address limit] to 8 for us. | |
[increase the address limit]: http://aws.amazon.com/contact-us/eip_limit_request/ | |
------------------------------------------------------------------------------- | |
Write a `remonit` daemon that can be deployed at multiple locations outside of | |
our core infrastructure in order to monitor uptime, latency and response | |
times. It should support: | |
* DNS lookups | |
* HTTPS requests | |
The results should be collated and uploaded to TempoDB. Any network failure | |
should result in a timestamped error file being written with additional info | |
like traceroutes, name servers and their ips, host ip, certificates, times | |
taken, partial contents, &c. | |
------------------------------------------------------------------------------- | |
We need to decide between [harvestd] and [Diamond] for monitoring core server | |
metrics on our EC2 instances. In either case, a standard configuration needs | |
to be put together and a handler needs to be written that sends the metrics to | |
our `statsy` daemons. | |
[diamond]: https://github.com/BrightcoveOS/Diamond | |
[harvestd]: https://github.com/mk-fg/graphite-metrics | |
------------------------------------------------------------------------------- | |
An `espra/statsy` package needs to be written which exposes something similar | |
to the `amp/statsy` interface for capturing metrics. But since this will be | |
running inside of App Engine, instead of sending the metrics to a `statsy` | |
daemon over UDP, we need to capture the info in `memcache` which then gets | |
aggregated and uploaded to TempoDB using a `/_statsy` task queue handler. | |
------------------------------------------------------------------------------- | |
Key business metrics like the following must be captured within the app: | |
* Sign ups | |
* Upgrades/Downgrades/Cancellations | |
* Logins | |
* Successful Payments and Overdues | |
Metrics must support additional flags specifying any content optimisation | |
factors, e.g. "blue-button", "orange-button.upgrade-now-text", &c. | |
------------------------------------------------------------------------------- | |
There needs to be a `_timings` handler on an App Engine backend instance which | |
receives info from browsers relating to Navigation Timing. This should | |
aggregate the data elements and send the data to TempoDB. | |
------------------------------------------------------------------------------- | |
We can implement `timings.coffee` which uses the Navigation Timing API in | |
browsers like Firefox and Chrome to send back network/page load times to the | |
App Engine app. Whilst this data can't be fully trusted, it gives us enough | |
data to work with for now. | |
------------------------------------------------------------------------------- | |
Write an `espra-mission-control` app that aggregates all of our devops | |
metrics, logging and analytics in one central place. | |
Care should be taken to host this on infrastructure independent to the rest of | |
our infrastructure, i.e. not on Route53, DNSMadeEasy, EC2 or GAE. Even the | |
domain used should be different, i.e. not `espra.com`. Perhaps Hetzner and | |
Linode DNS would be viable hosts. | |
Mission control will grab the time-series data stored in TempoDB and display | |
them as sexy graphs and aggregated counts on the browser. It should be | |
possible to transform the displayed data with custom aggregate and map | |
functions as well as correlate metrics against one another. It might be useful | |
to have an option that kills outliers above the 90% threshold for timing- | |
related metrics. | |
Since we will be looking at this all day, Mission Control must look pretty. At | |
least as pretty as [Librato Metrics] and [Geckoboard]. For all of the time- | |
series graphing, the sexy [Cubism.js] library along with [Rickshaw] should | |
help in this regard. | |
Metrics could be associated with additional metadata for display purposes — | |
including custom icons for triggers (e.g. Flurry of tweets, Performance | |
killing deploy, &c.). It should be possible to save custom views with a given | |
name. All of this info should be persisted to config for use on reload. | |
The info/error logs stored to both DynamoDB and within App Engine should be | |
viewable directly from Mission Control. Though, given that it's possible for | |
App Engine frontends to be down whilst their admin dashboard is still up, a | |
direct link to the dashboard for error logs wouldn't hurt as a backup. | |
For request logs, Mission Control should cache and serve from BigQuery. With | |
the ability to drill down into standard request analytics around user, ip, | |
service, resource, geo-location, &c. through both batch queries and ad-hoc | |
interactive queries. | |
The relative number of requests to our origin and CDN servers can be used to | |
spit out a CDN hit/miss metric. And metrics from `aws2stat` could be used to | |
suggest adding extra instances within a given cluster. | |
The `/_get_all_states` handler on the App Engine app can be used to provide | |
specific info on the recent-ness and the level of backlogged-ness of services | |
like `aws2stat`, `log2stat`, `remonit`, `tweet2stat`, &c. | |
It should be possible to configure alerts via e-mail (Mandrill), SMS (Nexmo) | |
or web hooks when certain conditions are met within a certain time period: | |
* No metrics for a given service. | |
* No metrics for a given service from at least N subs. | |
* Metrics above or below a given threshold. | |
The Mandrill and Nexmo accounts used must be independent of the accounts used | |
on our main App Engine app. And, finally, mission control should also ensure | |
that all of our SSL certificates are checked for expiry and send out an alert | |
every day for the 15 days before. | |
[Cubism.js]: http://square.github.com/cubism/ | |
[Geckoboard]: http://www.geckoboard.com/ | |
[Librato Metrics]: https://metrics.librato.com/ | |
[Rickshaw]: http://code.shutterstock.com/rickshaw/ | |
------------------------------------------------------------------------------- | |
There needs to be an `/_accounting` handler on the App Engine app which our | |
services outside of GAE can call to update regarding resource usage by users. | |
This needs to be accompanied by a taskqueue handler which then updates the | |
billing and invoices for those users. | |
------------------------------------------------------------------------------- | |
Implement `logs2stat` which: | |
* Routinely grabs the App Engine logs exposed via the `/_logs` handler, parses | |
out the requests and accounting info, then aggregates them before uploading | |
metrics to TempoDB (e.g. req/s, browser, &c.) and structured data for | |
analytics to BigQuery. | |
* Does the same by grabbing request logs from DynamoDB in our various | |
clusters and then uploading to TempoDB and BigQuery. | |
* Does the same again by grabbing request logs for our CDN. | |
* Aggregates accounting info and calls the `/_accounting` handler on our App | |
Engine app with data relating to user accounts. | |
It should be possible to provide sharding factors to `logs2stat` so that if | |
our log data becomes too much for a single server to sync, it can be done on | |
multiple machines at the same time. | |
------------------------------------------------------------------------------- | |
Implement a set of handlers on our App Engine app for storing and retrieving | |
state/config data by our external daemons/apps like `log2stat` and mission | |
control in case they go down and have to resume from a certain point: | |
* `/_init_state` should take a `key` and `secret` and initialise a given | |
state. An html form should be presented if no parameters are set and the | |
handler should only be callable by admins, i.e. `user.IsAdmin()`. | |
* `/_set_state` should take a `key`, `secret` and `value` and store the | |
key/value with a timestamp. | |
* `/_get_state` should take a `key`, `secret` and return the stored value and | |
timestamp. | |
* `/_get_all_states` should return a list of the keys, values and timestamps | |
for all stored states using a special master secret. | |
------------------------------------------------------------------------------- | |
Write a minimal `dme-route53` tool that lets us CRUD records to both DNS Made | |
Easy and Route53 simultaneously. | |
------------------------------------------------------------------------------- | |
Our production `bolt` deployment script on the deployment server should | |
automatically add deployment metrics to TempoDB so that overall system | |
performance can be correlated to: | |
* App Engine Deploys | |
* DNS Updates | |
* EC2 Cluster Deploys | |
* Provisioning of EC2 Instances | |
------------------------------------------------------------------------------- | |
Auditing should be enabled on the deployment server. | |
------------------------------------------------------------------------------- | |
Both the deployment server and the mission control app server should be | |
firewalled off from the wider internet and only be available over secure | |
channels. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment