Beloved users, and whomever else may find it of interest,
We recently made the switch from nginx to uwsgi. Well, I say recently, as far as I can tell from the commit logs we started work on it around July 10th, so that's fully six weeks ago. We just deployed it this week, and after a bumpy first few days it seems to be settling in well. We thought we'd share why we switched, and how it's going.
We switched because:
- we need to dynamically configure new virtualhosts for our users
- Apache won't dynamically load config, it needs a
restart graceful
which is still too disruptive - a single, generic Apache config that handles all users is getting unweildy
- nginx+uwsgi will make it much easier for us to do things like static files and per-user config tweaks
- we're also hoping for some performance improvements
Our experience was:
- sure enough, nginx+uwsgi is much simpler to configure and much more flexible. but:-
- Apache's model is to only start workers when they're needed. uwsgi starts them all up-front. That took some hacking, since we actually wanted the on-demand behaviour.
- no perf. improvements out-of-the-box, but we think there's lots of potential for tuning
- we did plenty of functionality testing, but not enough load testing, so our first deploy was a little rough.
Onto the detail!
For those that don't know, PythonAnywhere allows users to develop but also host Python apps, either on our domain via a username.pythonanywhere.com subdomain, or from their own domain names. So we need to be able to dynamically add, remove and update virtualhost configs. One way or another, our web servers need to look at an incoming request, and, based on the domain name, redirect them to the appropriate Python WSGI application.
With Apache, there's two ways of doing this - you can either write a config
file for each user and domain, or you can try to write a single global config
file that can handle any domain and point it to the right WSGI app. The
problem is that the first approach requires you to bounce Apache for each
config change - not nice: even if restart graceful
doesn't disconnect workers until they've finished existing requests, it still means that new requests have to wait for the server to spin up again - we don't want to be taking down one of our web servers, even for just a few seconds, every time a new user signs up!
Meanwhile the second approach is horrible to configure, and won't let us do everything we want anyways.
Let's take a little peek at some of the contortions we were putting ourselves through in the old days:
LogFormat "%{cased_username}e|%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined-vhost
LoadModule vhost_dbd_module /usr/lib/apache2/modules/mod_vhost_dbd.so
DBDExptime 5
<VirtualHost *:80>
ServerAlias *
FcgidMaxRequestLen 2097152
DBDriver mysql
DBDParams host=mysql.server,user=django_anywhere,pass=MYSQL_PASSWORD,dbname=anywhere
DBDocRoot "SELECT concat('/mnt/user_storage/webapps/', lower(username), '/escalate_privileges.py'), username as cased_username from auth_user, webapps_userdomain where auth_user.id = webapps_userdomain.user_id and webapps_userdomain.domain_name = %s" HOSTNAME
CustomLog "|/usr/bin/python -u /home/anywhere/user_logs/splitter.py" combined-vhost
<DirectoryMatch /mnt/user_storage/webapps/(.+)>
Order allow,deny
Allow from all
Options +ExecCGI
AddHandler fcgid-script .py
</DirectoryMatch>
</VirtualHost>
Can you see what's going on in there?
Well, you might spot that we foolishly allowed mixed-case usernames, and that it's causing is no end of troubles. Sadly we're just going to have to suck it up - nginx+uwsgi isn't going to make that any easier for us
Next you might see we've had to hand-roll a script called splitter.py
that
pipes user access logs to their own files instead of the default Apache ones.
Not ideal perhaps, but it works.
The real pain starts in those DBDocRoot
params: we're using mod_vhost_dbd
to get Apache to check with the database for each request in order to figure
out which Python WSGI file to send it to. ouch.
But that's not the end of it - escalate_privileges.py
hints at a new world of
ugly.
#!/usr/bin/python
# Copyright (c) 2012 Resolver Systems Ltd.
# All Rights Reserved
#
import os
import pwd
import signal
import subprocess
import sys
from anywhere.jails.spawn import spawn_unpiped_chrooted_process
def get_username_from_filename():
lowercase_username = os.path.basename(os.path.dirname(__file__))
for pwd_rec in pwd.getpwall():
if pwd_rec.pw_name.lower() == lowercase_username:
return pwd_rec.pw_name
raise Exception("No such user %r" % (lowercase_username,))
def get_signal_handler(pid):
def handler(signum, _):
subprocess.call(['/usr/bin/sudo', '/bin/kill', str(pid)])
return handler
def main():
username = get_username_from_filename()
process = spawn_unpiped_chrooted_process(
'/usr/bin/python /bin/serve_wsgi.py %s %s' % (username, " ".join(sys.argv[1:])),
username
)
handler = get_signal_handler(process.pid)
signal.signal(signal.SIGUSR1, handler)
signal.signal(signal.SIGTERM, handler)
process.wait()
if __name__ == '__main__':
main()
Yep, that script's purpose is to spawn a process that chroots to the user's
home directory (and note that the only way we can get the username is to read
the current directory name - Apache point blank refuses to do anything helpful
like set an environment variable for us). I'll spare you the additional fun of
serve_wsgi.py
, which is just for logging (the counterpart to splitter.py
from earlier). We then hand off to the user's WSGI app.
Our Apache setup works (if it ain't broke...) for now, even if it does involve a database hit for every request, two hacks for access and error logs, and a strange dance involving passing parameters in the form of directory names to a custom chrooting script... But even discounting all that, it still limits us in terms of what we can do for the future.
For static file handling, we'd have to implement something a little like this:
RewriteCond /mnt/user_storage/homedirs/%{ENV:subdomain}/var/www/static%{REQUEST_URI} -f
RewriteRule ^/(.*) /mnt/user_storage/homedirs/%{ENV:subdomain}/var/www/static/$1 [L]
<DirectoryMatch /mnt/user_storage/homedirs/(.*)/var/www/static/>
Order Allow,Deny
Allow from All
Options SymLinksIfOWnerMatch
</DirectoryMatch>
<Directory /mnt/user_storage/homedirs>
Options SymLinksIfOWnerMatch
</Directory>
That tells Apache to check every request against a folder in the user's
personal space, and uses mod_rewrite
to return static files from that folder
if they exist. We can even get symlinks working, at a pinch.
But, we can't support a custom location for static files for each user.
Moreover, mod_vhost_dbd
severely constrains what we can do. You might have
spotted these two lines, which appear to show us using some arbitrary info
retrieved from the database to process requests:
LogFormat "%{cased_username}e|%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined-vhost
DBDocRoot "SELECT concat('/mnt/user_storage/webapps/', lower(username), '/escalate_privileges.py'), username as cased_username from auth_user, webapps_userdomain where auth_user.id = webapps_userdomain.user_id and webapps_userdomain.domain_name = %s" HOSTNAME
Except that the %{cased_username}
variable isn't actually available until about half-way through serving the request... by which time it's too late to use it to, say, specify a custom directory for the WSGI script, or static files...
So Apache makes it hard for us to do things like per-user configuration for where to serve static files from.
Here's another bit of fun. Apache would often leave zombie processes lying around, which would start to eat up resources on the server. So, we had a daily "cleanup" job that looked like this:
# Kill all user-owned processes that are children of init (i.e. not a child of apache or its workers)
python -c "import psutil; [p.kill() for p in psutil.process_iter() if p.gids.real == 60000 and p.parent.pid==1]"
# Kill serve_wsgis that are owned by root and are not children of apache
python -c "import psutil; [p.kill() for p in psutil.process_iter() if 'serve_wsgi' in ' '.join(p.cmdline) and p.parent.pid==1]"
Lovely.
So our Apache configuration is torturous and hard to understand. It needs a database hit for every request. It leaves lots of zombies lying around. And it's preventing us from doing some things we want to do. So, what is the new shiny that will solve ALL OUR PROBLEMS?
Remember the Apache config file from earlier? Here's the equivalent for nginx+uwsgi:
Nginx config:
server {
listen 80;
server_name ~(?<domain>.+)$;
location / {
root /var/log/nginx/user_logs;
access_log /var/log/nginx/user_logs/$domain/$domain.access.log;
include uwsgi_params;
uwsgi_pass unix:/var/www/socket/$domain/socket;
}
}
uWSGI config:
[uwsgi]
uid = {{ uid }}
gid = {{ gid }}
chroot = /mnt/chroots/{{ user.username }}
touch-reload = /var/www/wsgi.py
socket = /var/www/socket/{{ host }}/socket
chdir = /var/www
file = /bin/user_wsgi_wrapper.py
Just a tad simpler, I'm sure you'll agree?
uWSGI's emperor+vassals setup means that any config files in a specified
directory are dynamically loaded with no need for a restart
. Adding a new
site is as simple as dropping in a new file, and modifying an existing config
takes effect as soon as the file is updated.
And, since it's no longer a single one-size-fits-all config file, each user / domain has their own config file so it's much easier to customise different settings for different users.
And all this with no database hit
In Apache, if we'd wanted static files, we'd have to look at something like this:
RewriteCond /mnt/user_storage/homedirs/%{ENV:subdomain}/var/www/static%{REQUEST_URI} -f
RewriteRule ^/(.*) /mnt/user_storage/homedirs/%{ENV:subdomain}/var/www/static/$1 [L]
<DirectoryMatch /mnt/user_storage/homedirs/(.*)/var/www/static/>
Order Allow,Deny
Allow from All
Options SymLinksIfOWnerMatch
</DirectoryMatch>
<Directory /mnt/user_storage/homedirs>
Options SymLinksIfOWnerMatch
</Directory>
In our new setup, it's a one-line addition to the uwsgi vassal config:
check-static=/var/www/static
If that all seemed to good to be true, it's because it is. Remember what I said about Apache loading workers dynamically, and uwsgi wanting to load them all up-front? That was a big no-no for us, since each web server may have hundreds of users on it, and loading a worker or multiple workers for each one would have made the uwsgi spin-up time way too slow. So, here's the hack:
server {
listen 80;
server_name ~(?<domain>.+)$;
location / {
root /var/log/nginx/user_logs;
access_log /var/log/nginx/user_logs/$domain/$domain.access.log;
error_page 502 = @fallback;
include uwsgi_params;
uwsgi_pass unix:/var/www/socket/$domain/socket;
}
location @fallback {
proxy_pass https://www.pythonanywhere.com/initialize_webapp/$scheme/$domain/$uri?$query_string;
}
}
Essentially, if there's no uwsgi worker for a particular domain, nginx will
consider that a "502 bad gateway" error, but our fallback
tells it to go and
call an initialize API on our web server.
The handler for the initialize_webapp
call looks a bit like this (in pseudocode):
find user from domain
if domain not recognised: return 404
build chroot for user, including personal file storage
write user vassal file to /etc/uwsgi/vassals
wait for vassal worker to spin up
redirect back to users' webapp
Aside from that, logging is still a bit of a pain, but about half as much of a
pain as it was under Apache. Nginx gives us access logs as a one-line config,
which is definitely a win (we get rid of spitter.py
), but error logs still
need a custom WSGI wrapper -- but that's no more complicated than the old
serve_wsgi.py
. For the curious, you can find it in your PythonAnywhere
sandbox at /bin/user_wsgi_wrapper.py
Preliminary benchmarking results
On apache, we were looking at median times (in milliseconds) of:
pythonanywhere.com Connect: 391 Processing: 139 Waiting: 136 Total: 542
user webapp: Connect: 365 Processing: 532 Waiting: 427 Total: 921
For a standard static site hosted in the same amazon zone, you'd expect about 100/100/100 for a total time of 200.
With nginx+uwsgi, we got
pythonanywhere.com Connect: 346 Processing: 227 Waiting: 226 Total: 580
user webapp: Connect: 366 Processing: 809 Waiting: 684 Total: 1175
So no wonderful news there - in fact, it's a 10% deterioration on the main site, and a 20% deterioration for users web apps. BUT - that's for the "median" request. What we did find was that it really has improved things for the slower requests - in other words, it's really improved the consistency of performance:
Apache - user webapp tests:
Time taken for tests: 191.538 seconds
Percentage of the requests served within a certain time (ms)
50% 921
66% 1799
75% 1918
80% 2746
90% 3853
95% 5950
98% 9989
99% 11979
100% 17177 (longest request)
Nginx/uWSGI -
Time taken for tests: 61.882 seconds
Percentage of the requests served within a certain time (ms)
50% 580
66% 602
75% 618
80% 630
90% 712
95% 908
98% 1067
99% 1151
100% 2473 (longest request)
So, under Apache, the slowest 20% of requests were taking between 3 and 10 times as long as the median request. Under nginx, the slowest 20% of requests are not that much slower than the median. So that's definitely a win - the site feels more reliable and consistent to our users.