Friday, April 11

DNS

Lynn Root
roguelynn.com
roguelynn-spy.herokuapp.com

Use scapy python library for sniffing network traffic. Chrome does one DNS request for each autocomplete guess. Interesting.

DNS names end with a dot?

example.com vs. example.com.

Relative vs. absolute (FQDN) if your local resolution has something funky going on.

dig is your friend.

dig +trace python.org

. is the root DNS server. Queries resolve down the hierarchy. . -> org -> python -> www.

Show all records for the name:

$ dig +nocmd +noqr +nostats pyladies.com -t ANY

DNS relies on caching so that the root servers aren't hammered. So we start at our local DNS and go out from there until we get a result, which is then cached for a TTL.

Query -> Local Cache -> "closer" name server -> authoritative name server.

TTL is a balancing act. Too long, and caching takes forever to propagate. Too short, and the authoritative server gets hammered.

You can't get the entire zone file usually. dnsmap brute-forces subdomain lookup to retrieve extant subdomains.

You can run a DNS server from Twisted. Cool.

Unicast, multicast, etc. Anycast. One-to-nearest association. Google uses this. Someone in Australia looking up 8.8.8.8 to get the same response from a nearer server.

DANE (DNS Authentication of Named Entities) uses DNSSEC. Apparently, firewalls can intercept HTTPS traffic and fake your secure connection.

Service Discovery. SRV records. Spotify clients do SRV lookups to get a service access point to the web API.

DHT: distributed hash table. Key-value store in DNS, distributed through the network. Spotify does this using TXT records.

Cache me if you can

Guillaume Ardaud
@gardaud
gardaud.fr
[email protected]

Memcached:

Key/object store.
O(1) everything.
Primary data in relational database
data you can lose, regenerates slowly: persistent storage (mongo, redis)
data you can lose, regenerates quickly: RAM store (memcached)

Expiries given in seconds. USE CONSTANTS.

Use set_many/get_many/delete_many. Slowest part of operation is network latency.

incr/decr good for counters.

add good for not clobbering existing keys.

Facebook has a fork that lets you dump memcached to the disk. Huh.

Key Naming

ASCII based
Not crazy long (they have to be hashed--few dozens of characters)
Explicit! e.g. json.users.<user_id>

Bad naming: md5(sql_query) Don't use user input for cache names!

Memcached is not queryable. There is a debug interface, but don't use it!

Memcached cluster

With multiple nodes, the client decides which node to go to using hashing.

Memcached subtleties

Key stored at 8am w/ 2 hour expiration. What happens at 10am? Nothing. It gets removed if the client tries to fetch after 10am.

Memcached has a fixed amount of memory using an LRU cache. Objects can be evicted before expiration if memory fills up. Objects w/ oldest timestamp get evicted.

Sometimes memcached returns None if the expiration hasn't been reached AND memory isn't full. Memcached pages its memory, and chunks its pages into chunks of various slab classes. If all pages of the needed class are full, get a free page and give it the needed class. If there are no more free pages, the LRU kicks in and evicts data. Each slab class has its own LRU.

memcached -v  # verbose output
	memcached -M  # doesn't evict when out of memory, but errors
	memcached -I1k  # change slab page size

	memcached -f1.5  # change growth factor

	man memcached # is your friend

Common practices

Add a cache_name property to Django models. Use model versioning to invalidate cache names automatically.

get_many returns a dict, which may not have all the keys you requested. You'll have to fill those in yourself.

Common problems

Thundering Herd problem: on a cache miss, if it's expensive to rebuild the object, a flurry of simultaneous requests will bury the application server in simultaneous builds. Solve the thundering herd with a lock object.

Caching large values: say a large number of objects. Instead of caching them as one object, do a 2-phase fetch. Store the list of IDs, then store each object w/ set_many.

Paginated cache:

Break big objects into smaller slices.
Store each slice as a separate object
Store the list of slices.

If some of the chunks get evicted, well, there you are.

Bootstrapping: Porting to Python 3

Tres Seaver
@tresseaver
[email protected]

Objectives

Straddling Python 2/3 in a single codebase
Choosing target Python versions
Porting as an iterative process
- Ordering components by dependencies
Adding test coverage to reduce risk (if you don't have good tests, you will lose)
Covering C extensions

Background

Ported Zope3, ZODB, WebOb, Pyramid, other dependencies to Python3. 180kLOC Python, ~25 kLOC C.

Porting strategies

-Port once, abandon Python2-

Not the subject here
Customers / users still need Python 2
More feasible for applications than libraries
2to3 may be useful starting point.

-"Fix up" at installation using 2to3-

Python2 users unaffected
Python3 source "drifts" from canonical version. Bug reports don't match.
2to3 painfully slow on large codebases.

"Straddling" in a single codebase. (Thought to be impossible, initially.)

Use compatible subset of Python syntax
Conditional imports mask stdlib changes
six module can help (but you might not need it)

Targeting Python Versions

Syntax changes make Python2 < 2.5 hard

No b'' literals
No except Exception as e:
Much more cruft/pain

Python 2.6 is the bare reasonable minimum. 2.4/2.5 are long past EOL. But some folks need system Python in "enterprisey" systems.

Incompatibilities make Python 3 < 3.2 hard.

PEP 3333 fixes WSGI in Py3k
callable() restored in 3.2
3.3 restores u'' literals.
3.2 is "system Python3" on some LTS systems.

Summary: support 2.6, 2.7, 3.2+

Managing Risks

Ports are great opportunities for bug injection!
Fear of breaking working software is the barrier, even more than the effort required.
Some mitigations also improve your software.
- Improved testing
- Modernized idioms in Python2
- Clarity in text vs. bytes.

Bottom-up Porting

Port packages with no dependencies first. Then port packages with already-ported dependencies. Note the Python versions supported by dependencies. Lather, rinse, repeat. Finally, port the application.

Common subset idioms

Read Lennart's book!
`python2.7 -3' can point out problem areas.
Modernize idioms in Python2 code. e.g. exception syntax, with open() as fi.
Distinguish bytes vs. text. Use b''/u'' for all literals. Quit letting Python promote things to unicode for you.
Adopt new syntax.
- E.g. extept ... as ...
- print()
Use new stdlib facilities, e.g. io.BytesIO vs. StringIO.StringIO

Testing Avoids Hair-Loss

Untested code is where the bugs go to hide.
100% coverage is ideal before porting.
- Unit testing preferable for libraries
- Functional testing best for applications
- Subtle bugs hide in the libraries
Measure test coverage: pypi/coverage
Work to improve assertions as well as coverage
Assert contracts, not implementation details
- Don't assert against exception types/formats, things that change between versions!
- If at all feasible, convert doctests to Sphinx examples. Sphinx can run examples to make sure they don't break.
Automate running tests
tox helps ensure that tests pass under all supported versions. (Also test pypy!)
Don't run coverage on all your tests. Coverage is really slow. Just use it on one separate tox target on one version of python.

Considerations for C extensions

Testing C is harder!
http://python3porting.com/cextensions.html (Lennart's book)
Maintain a Python reference implementation
- Easier to test
- Supports PyPy
Design for same API as C
100% coverage for Python
Ensure C version passes same tests.

Hygiene

Signal supported versions using Trove classifiers in PyPI
Consider bumping the major version. Allow users to stick with "safe" versions as you iterate.
Apply continuous integration
- Travis CI
- Jenkins
- Shining Panda for Windows

Resources

python3porting.com
testrun.org/tox/latest
pypi.python.org/pypi/six
python.org/wiki has common idioms for Python 2/3˘

Enough Machine Learning to Make Hacker News Readable Again

Ned Jackson Lovely
@nedjl
[email protected]
slides at http://www.njl.us

An achievable goal: a personalized filter for Hacker News.

Machine learning is just applying statistics to big piles of data, using it to understand the data better or make predictions.

Get data
Engineer the data
Train and tune models (SCIENCE!)
Apply model to new data

Use scikit-learn. The documentation is fantastic. The hard part is installing SciPy.

The terminology is daunting. When you don't understand the math, go "blah blah blah" and keep on reading.

Supervised learning is when you have input data and output data. Unsupervised learning is about understanding your data; visualization, grouping, etc.

We'll focus on supervised learning.

Good books:

NLP with Python
Building machine learning systems w/ Python
Learning scikit-learn with Python
Building Collective Intelleginence

Parallel arrays: (x, y): x is article, y is category.

Set aside a validation set. Learning data your machine hasn't seen. Take 25% of your data. Use it at the end to validate your learning.

Hyper-parameters are magical for tuning. GridSearch is a great tool.

You'll see a lot of these functions:

transform()
fit()
predict() # SCIENCE!

transform(X, [, y])

fit(X, y): X, what it gets; y, what the result should be.

predict(X): predict based on the fit() training.

Get the Data: the Hard Part

requests & lxml

Classifying Dreck and Non-Dreck: he wrote a web app to classify 5,000 articles. 20% were Non-Dreck.

Data: Title, URL, Submitter, Content of Link, Rank, Votes, Comments, Time of Day, Dreck or Not.

Turning that messy data to normalized Numpy arrays: "Time flies like an arrow, fruit flies like bananas"

Bag of Words : count occurrences: "flies": 2, "arrow": 1, "bitcoin": 0
n-grams: time flies, flies like, like an, like bananas
Normalization: stemming
Stop words: cut out the useless words (articles, etc.)
TF-IDF: Term-Frequency, Inverse Document Frequency. (e.g. an article about bitcoin has more refs to bitcoin than an article not about it)

Engineering Features

Pull out the relevant text (readability package)
Roll your own features (e.g. bump up long-form content)
Combine features (pipeline w/ TF-IDF with long-form feature)
Hostname pipeline: extract hostnames into numpy array. Pickle your built classifier (save the data to recreate it) then use the classifier to predict.

What's possible?

Use unsupervised learning.
Predict numerical scores.
Watch an RSS feed.
Auto-submit it!

How to get started w/ Machine Learning

Melanie Warrick nyghtowl.io @nyghtowl

Hackbright Academy
Zipfian Academy

Covering:

Machine Learning Overview
AI, data science, big data relationships
Example code, linear regression
Algorithms & tools
Skills and resources

"Computers... ability to learn without... explicit programing. Arthur Samuel (1959)

Build a model that finds patterns and/or predicts results
Apply algorithms
Pick best result for pattern match or prediction

Ex: spam detection, weather prediction

What is a model?

Linear regression (line fitting)

y = mx + b

Find best fit m & b algorithm to predict/pattern match. (e.g. plotting High School GPA vs. University GPA)

Handwritten address recognition
Search engines (Google, Bing)
Twitter and Facebook friend recommendations, Netflix
Fraud detection
Weather prediction
Face detection
AI, helping machines make better decisions. Intelligence exhibited by machines or software
Data Science, helping people make better decisions. Get knowledge from data & create products
Big Data challenges both AI and Data Science. Data volumes beyond ability of common tech to capture and curate. (2 GB == 20 yrds of books, 50 PB = entire written works of humankind)

Project flow

Define goal and metrics
Gather and clean data
Explore and analyze
Id Algorithm or method (ML)
Build model (ML)
Evaluate results (ML)
Iterate
Create data product, visualization.
Make decisions.

Ex: Linear Regression

Using pandas for data frames and scikit-learn.

Predict brain weight from head size. Head size is x, brain weight is y.

Cross-validation: hold out a certain percentage of the training data for testing and evaluating the model.

Metrics for evaluating a model: R-squared, where 1 is a perfect prediction.

Use matplotlib and seaborn for visualization (seaborn for prettification). Visualization helps you understand how a model is working.

Machine Learning Algorithms

Unsupervised, continuous: clustering and dimensionality reduction (SVD, PCA, K-means)
Unsupervised, categorical: association analysis (Apriori, FP-Growth), Hidden Markov model
Supervised, continuous: regression (linear, polynomial), decision trees, random forests.
Supervised, categorical: Classification (KNN, trees, logistic regression, Naive-Bayes, SVM)

Machine Learning Key Tools

Test Model: Scikit Learn, Matplotlib,
Explore Data: Pandas, StatsModels, Matplotlib, numPy, Unix
Build Model: Scikit, NumPy, Pandas, scipy
Visualize: D3, matplotlib, vincent, vega, ggplot

Machine Learning Skills to Build

Algorithms
Statistics (probability, inferential, descriptive)
Linear Algebra (vectors & matrices)
Data analysis (intuition)
SQL, Python, R, Java, Scala (programming)
Databases & APIs (get data)

Machine Learning Resources

Andrew Ng's Machine Learning on Coursera
Khan Academy (linear algebra, stats)
"Think Stats" - Allen Downey
Zipfian's practical intro to data science
Metacademy
Open Source data science masters
StackOverflow, Data Tau, Kaggle
Mentors!

Getting Started w/ Salt

Peter Baumgartner, Founder of Lincoln Loop
lincolnloop.com
@ipmb

SaltStack is: configuration management. Version control your servers, self-documenting, repeatable, reusable.

Saltstack is: remote execution. Deploy your code, run one-off scripts, critical package updates, system monitoring.

Why SaltStack?

Familiar tools: Python/YAML/Jinja2.
Community: Great documentation, insanely responsive (IRC, GitHub), backed by for-profit org.

Why Not SaltStack?

Young project
Moves fast
Not SSH (new SSH support is "alpha")

Learning Salt

Vocabulary lesson

Chef: knife, recipe, cookbook
Puppet: terminus, metaparameters
Ansible: playbook, inventory
Master: server that manages the whole stack
Minion: a server controlled by the master
State: a declarative rep. of system state
Grain: static information about a minion (RAM, CPUs, OS, etc.)
Pillar: variables for one or more minions (ports, file paths, config parameters)
Top file: matches states or pillars to minions
Highstate: all the state data for a minion

Getting started

Binaries for most distros
Pip install (bleeding edge)
http://bootstrap.saltstack.org (probably what you want)

Master server: apt-get install salt-master ... or run masterless

Minion: apt-get install salt-minion; echo "salt 10.10.1.1" > /etc/hosts

Accept the minion key on the master.

Advanced topics

Salt-cloud
Custom modules
Scheduler
Renderers
Returners (return to email, sentry, syslog)
Reactor

Tips & tricks:

In minion conf, output_mode: mixed
Jinja2 is powerful. Don't go nuts.
Update often, and review the change log.
Test before you deploy. Make friends w/ Vagrant or Docker.

Castle Anthrax: Dungeon Generation Techniques

James King
@agentultra
[email protected]

Designing procedurally generated content for games.

How do we represent tiles?

List of lists?
Single list? (Row Major Order) offset = (row * num_cols) + column

Mazes

Depth-first search (backtrack): long, twisty corridors
"Prim's algorithm" (sp?) A*-style search: short, blocky corridors

Placing rooms: binary tree

Space-partitioning algorithm

Start with an open grid Split it Recurse on the sub-parts. Stop at minimum size or maximum depth. From lowest, widest portion of tree, ascend and connect nodes.

Techniques for placing things or generating terrain:

Poisson Disks: distributing equidistant points across a space e.g. item placement
Cellular Automata: e.g. caves!
Perlin noise (simplex): e.g. pits!

Using constraint solvers

Take a bunch of variables with continuous or discrete ranges (finite). Define a constraint on a few variables, e.g. 8 queens problem.

Ex: I've got all these rooms, here's the exit. I don't care where the boss is, but he must be at least one room away from the exit. These enemies need a place. And there should be a health potion near the beginning.

Optimize!

Represent sets as bitmasks
Undo-stack
Use as few variables as possible to stay in the discrete/finite domain as possible.
Rogue Basin
How to build a constraint Propagator in a Weekend
goo.gl/sdrbkJ
Horton goo.gl/xpLTFB
PyGame, PyAngband

Lightning Talks

Structlog

Certificate-based SSH

Provides controlled, audited access to servers. Not a key-based solution!

Launch instances w/ cert authority
Users that need access request a cert
Security officer uses he ssh-ca tool
ssh-ca generates audit trail in S3: who, when, why, how long
Certificates include restrictions on use (time-based!)
OpenSSH logs the key id (email address)

Instead of host certificates, sign the host key.

github.com/cloudtools/ssh-ca
CERTIFICATES section of man ssh-keygen(1)

DIY Stuffed animals

github.com/caretdashcaret/Patternfy
Make magazine vol 38

Saturday, April 12

Lightning Talks

Erik Rose: pip install peep https://pypi.python.org/pypi/peep/1.1

Docker.io

Amjith Ramanujam

"Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient contianer that will run virtually anywhere*."

(* Anywhere meaning any reasonably modern Linux machine.)

What?

Like Chroot, BSD Jails, etc.
Uses Linux Containers
AUFS: Union Filesystem
Git-like versioning
REST API
Don't need a full Guest OS like in virtualization
Multiple containers share the same underlying libraries read-only. Any changes trigger copy-on-write.

Why?

Lightweight
Isolated instances
Faster than VMs (ofter under a second startup time)

Setup

docs.docker.io
OSX: boot2docker (minimal linux VM) + docker client

Terminology

Image: Read-only snapshot
Container: an instance of the image
Registry: PyPI for docker images
Repository: Projects in the Registry

Automation

Dockerfile: a series of commands

Network

Django port forwarding: docker run -d -p host:container django-docker

Misc

Volumes: mount folders host/container: docker run -v host_path:container_path django
Links: Service discovery through env vars docker run --link mysql:db --name webapp django
REST API
- docker daemon is also a server

Solution

Postgres in a container
Django/nginx in a container
celery in a container

Makes testing very easy: Jenkins can run things in parallel that once had to be separated. Containers can be run on local machines.

Apparently, Docker.io recommends against running in production until v1.0.

Developing Django's Migrations

Andrew Godin
Author of South, Django's new migrations
@andrewgodwin
http://aeracode.org

South was good for its time. But some bad initial design decisions and core underlying problems. south.hacks.

The Initial Plan

Django: schema backend, ORM hooks
South 2: Migration handling, user interface

Revised plan

Django: Schema backend, ORM Hooks, Migration handling, User interface
South 2: Backport for 1.4-1.6

Logical separation:

SchemaEditor: schema backend, ORM hooks
Migrations: migration handling, UI

Not moving south into Django, instead, adding migrations to Django. Complete rewrite of South. New file format, many other things.

SchemaEditor

Abstracts schema operations across DBs.
Works in terms of Django fields/models.
Contains per-database workarounds.

django.db.migrations:

Migration file reader/writer
Dependency resolver
Autodetector
Applied/unapplied tracking

A new format

More concise
Declarative
Introspectable

In-memory running

Creates models from migration sets
Autodetector diffs created from on-disk
Used to feed SchemaEditor / ORM

DB peculiarities

Postgres: it's great

MySQL:

No transactional DDL
No CHECK constraints
Conflates UNIQUE and INDEX

Oracle:

Different SQL syntax
Picky about names
Can't convert to/from TextField (LOB)

SQLite:

AAAAAAAAAAAHHHHHHHHH
Altering tables? Schema introspection? What?

Backwards Compatibility

Django generally very good at this
Auto-applies first migration if tables exist
Ignores South-style migrations

Lessons Learnt

Explicit is better than implicit.
Abstracting DBs is hard. Wouldn't do it from scratch.
Composability rocks. It's simplified the code so much.
Feedback is vital. I'm just not mad enough to do nasty things to my code. Users always find your edge cases.

Designing Poetic APIs

Auden: "A -poet- programmer is, before anything else, a person who is passionately in love with language."

Programming is inventing new language. "Go learn Lisp or Haskell, it will change how you think about programming."

Sapir-Whorf hypothesis. Yes, it's fallen out of favor, esp. the strong form. But language still has flavor. Language influences how we think.

Wittgenstein: "The limits of my language are the limits of my world."

Having a symbol for something makes it mentally lighter-weight. Mental abstractions. Extracting symbols is the root of all human language. And software engineering!

Intellectual Intelligibility

Fowler: "Any fool can write code that a computer can understand. Good programmers write code that humans can understand."

Capture existing symbols, and use them in your API design. Take requests vs. urllib2 as a good example.

Principle 1: Don't be an Architecture Astronaut

Robert Storm Petersen: "It's hard to make predictions, especially about the future."

The first step of designing a new library is: don't design a new library. The best libraries are extracted, not invented.

Example: blessings extracted from nose plugin.

Identify the tasks
Identify language constructs
Identify patterns, protocols, interfaces, conventions

Principle 2: Consistency

Yeats: "Think like a wise man, but communicate in the language of the people."

The culture you are in has spent a lot of time building up conventions. Use them. Don't be weird or clever. This shows respect to your users.

Ex: Macintosh Human Interface Guidelines. When you've learned one program, you've learned them all.

Principle of Least Astonishment: Try not to surprise the user.

get(key, default) vs. fetch(default, key)

Warning Signs

Frequent references to your own docs or source
Feeling syntactically clever (novel syntax)

Brevity

George Eliot: "The finest language..."

Warning Signs

Copying and pasting when writing against your own API
Typing something irrelevant while grumbling "Why can't it just assume the obvious thing?"
Long arg list, suggesting a lack of sane defaults

Composability

"Perfection is achieved not there is nothing left to add but when there is nothing left to take away.

Two ways to go about this, one of them wrong.

print_formatted(...)
	print_formatted(..., out=some_file)  # WRONG!!!

	print formatted(...)  # CORRECT!

Warning Signs

Classes w/ lots of state. Lots of little classes struggling to get out. Ex. ElasticSearch PenaltyBox. Didn't add to Connection, new class.
Deep inheritance hierarchies. Inheritance inherits invariant baggage from above, and must tiptoe around them.
Violations of the Law of Demeter. "One dot rule." A.b is OK. A.b.c is not. A.b.c.d is right out.
Mocking in tests. Your code may have too many dependencies! Testable code is decoupled code. Some mocking may be necessary, if your framework requires it. Mocking not intrinsically evil, but a code smell.
Bolt-on options.

Plain Data

Churchill: "All the great things are simple, and many can be expressed in a single word..."

Reduce barriers to re-use. Ex: ConfigParser. Not idiomatic python. Dictionaries would be the expected result, but it forces you to use its own API for anything. Can't substitute anything else.

MyClass.read(filename)  # NO!
	MyClass.parse(string)   # YES!

warning Signs

Users immediately transform output to another format
Instantiating one object just to pass it to another
Rewriting language-provided things

Grooviness

Talmud,: Ta'anith 7b: "The bad teacher's words fall on his pupils like harsh rain; the good teacher's, as gently as dew."

Sloping sides that nudge you to the center. Cut grooves in your APIs.

Avoid nonsense representations, e.g. optional kwargs that are actually required, one or the other.

Fail shallowly!

Resource acquisition is initialization

Don't have invariants that aren't invariant. Ex: Designing a PoppableBalloon class. Require filling in initialization.

Compelling examples: MacPaint. Nintendo platformers. Set a good example, and people will follow it forever. Users are docile: they will do what you tell them to do.

Warning Signs

Representable nonsense. You shouldn't even be able to say nonsense.
Invariants that aren't.
Lack of a clear starting point.
Long, complicated documentation.

Safety

More safety than grooves. More danger, higher walls, mean guard dogs in front.

rm *.pyc
	rm *
	rm -f *

How to report errors: Exceptions > Return values

Warning Signs

Docs that say "remember to..." or "make sure you...". Docs that say "before" or "after", add a context manager.
Surprisingly few will report safety errors. People will blame themselves. Don't electrify the door knob.

Orderability

With orthogonality at the center, the flowchart divides into lingual and mathematical halves of a Venn Diagram, with the left hand helping humans to read and use them, and the right to better computability.

Q&A

Book: Making Software: chapters on API usability and linguistic influence.
Book: RESTful web APIs (by author of BeautifulSoup)
I like my code to read like English. Ex: not using verbs as function names, but nouns describing what's returned. sorted() as an example.
Decoupling has its tendrils in many places
How does change management fit into all this? Use semantic versioning! Compatibility is a place to bolt on an argument. Composability is one way to do it, via decoupling. Compatibility puts us in 4-dimensional space, and we get into time-based coupling.
Another principle: Fractalness, an API can be used at any level of abstraction.

Getting Started With Testing

Ned Batchelder
@nedbat http://bit.ly/pytest0

Goals:

Show you a way to test
Remove mystery

Why test?

Know if your code works
Save time
Better code (more modularity, separation of concerns)
Remove fear, turn it into boredom
"Debugging is hard, testing is easy."

Yes, testing is hard.

A lot of work.
People (you) won't want to
But: it pays off
Fight chaos!

Roadmap

Growing tests
unittest
Mocks

First principles: Growing tests

First attempt: interactive.

Good: testing the code
Bad: not repeatable
Bad: labor intensive
Bad: is it right?

Second attempt: standalone python module exercising the code.

Good: testing the code
Better: repeatable
Better: low effort
Bad: are the results right?

Third attempt: print expected results

Good: repeatable w/ low effort
Better: explicit expected results
Bad: Have to check manually

Fourth attempt: check results automatically, print and assert

Good: repeatable with low effort
Good: explicit expected results
Good: results checked automatically
Bad: failure stops tests

Getting complicated!

Tests will grow
Real programs
real engineering

Good tests are:

Automated
Fast
Reliable
Informative
Focused

unittest

python stdlib
infrastructure for well-structured tests
patterned on xUnit

Test isolation

every test gets a new test object
tests can't affect each other
failure doesn't stop the next test

setUp and tearDown

Establish context
Common pre- or post- work

Test engineering

Treat your test code like real code. Engineer it.
Pro tip: use your own base TestCase subclass.
TestCase.assertRaises() works as a context manager!
Make your tests expressive. Refactor.
Extract repetitive boilerplate to setUp().

Tests are real code!

Helper functions, classes, etc.
Can become significant!
Might need their own tests!

Mocks

Testing small amounts of code

Systems are built on layers

Dependencies are bad

More suspect code in each test
Slow components
Unpredictable components

Enter test doubles:

replace a component's dependencies
Focus on one component

Question should be;

assuming this outside service is working,
do my tests work?

Be careful not to skip code that needs to be tested when mocking!

Instead of stubbing our method, we fake urllib.urlopen instead.

Stdlib is stubbed
All our code is run
No web access during test

Don't do all this yourself: use a mock object library, like mock or mox.

Mock objects:

automatic chameleons
act like any object
record what happened
patch context manager! with mock.patch('urllib.urlopen') as urlopen: urlopen.return_value = fake_yahoo

Test Doubles:

powerful: isolates code
focuses tests
removes speed bumps and randomness
BUT: fragile tests!
Another, better way to do this: dependency injection

Tools

addCleanup: nicer than tearDown
doctest: only for testing docs
nose, py.test ,
ddt: data-driven tests
coverage
Selenium: browser tests
jenkins, Travis: ci

TDD: tests before code? BDD: describe external behavior Integration tests: bigger chunks load tests: how much traffic is OK?

Summing up

Complicated
Important
Worthy
Rewarding

Q&A

python-unittest-skeleton on github
TESTING IS ENGINEERING.

Unit-testing makes your code better

Greg Ward
@gergdotca
[email protected]

Assumption # 1:

You've at least started to drink the Kool-Aid.

either you're already writing unit tests
or you're ready to start, with or w/o this talk

Assumption #2:

Corollary of #1: You already get that unit testing helps make code more correct.

I'm talking about better on a higher plane: aesthetics, elegance, beauty.

Beautiful code is better code:

easier to understand
easier to extend
easier to reuse

Plan

Real life case study:

examine some untested code
work through adding tests
understand how imperfect design -> hairy tests
modify the design for simpler tests -> better code

Background

what is this code?
why does it exist?
where does it come from?
what requirements does it meet?

What is the code?

we measure the internet
we ping all your public IPs every couple of months
we traceroute everything
result: ~200m traces/day
throw it all in plain text!

Staying sane w/ plain text:

keep it simple, stupid
restrict the data tightly to avoid escaping
stay consistent even as data and requirements evolve

T3 files contain one record for each trace. Variable number of fields, just to keep things interesting.

TIP1 files contain one record summarizing all traces sent to a single target.

Lots of similarities. Common format, common library.

dozens of similar formats
writing new parser for each would be nuts
hence, GenericLineParser
with many subclasses: T3Parser, TIP1Parser, etc.

Requirements

structured
fast
flexible

Good news: when we start testing, the code meets all requirements.

Where to start?

You can't test an object if you can't construct it. So, start w/ the constructor. This goes double in cases like this, with a non-trivial constructor (complex internal logic, sometimes does I/O).

512 code paths through the constructor, based on args! Required only 6 test cases for one method, but definitely a code smell.

Constructors should be dead simple. Take arguments, store them. Be done.

Line parsers parse lines. Something else should open files. Convenience functions to the rescue!

Refactor the constructor, 6 tests to 3 tests.

Progress so far

constructor simpler and shorter
other code can use zopen(), uzopen()
now supports gz files for free
less test code to maintain
fewer code paths to worry about, fewer code paths == fewer, simpler tests == better code

So I refactored some messy code. So what?

writing tests made me look deeper
made me read the code very carefully
made me see both the good side and the bad side

The courage to refactor

This is what unit-testing zealots like to boast about:

sounds hokey
sounds like something from a self-help bok
but it's true!
absolutely no fear about tearing the line parser to pieces and putting it back together again, even though I didn't write it.

No happy ending... yet.

The applications that use this code are completely untested. I'm afraid to refactor.

easy to adapt existing clients of line parser to use uzopen()
...

Costs of not testing

incorrect code (bugs caught late in the cycle)
fear of refactoring
code duplication (-> bug duplication!)
insufficient code reuse

Don't let this get you down

1000 tests are better than 999 tests
1 test vastly better than 0 tests
unit tests will never cover everything (don't try!). cover almost everything.
you'll be surprised how much you can cover w/ effort.

Trojan horse time

Extreme programming!
Test-driven development!
Agile manifesto!
etc.

Conclusions

duh. Water is wet.
less obvious: writing unit tests make code more beautiful
beautiful code is better, more reusable, more maintainable, more pleasant

Pushing Python: Lessons Learned Building a High Throughput Service in Python

Kevin Ballard
@misterkgb
[email protected]
github.com/tellapart/taba

Taba

Distributed event aggregation service
built w/ python, gevent, cython
10,000,000 events/sec, 50,000 metrics, 1000 clients, 100 processors

Lesson #1: Get the data model right

Once you've committed to a model, it's very difficult to change it.

The model, the way you flow data through the system, makes a big difference in the performance of the system.

Lesson #2: State is hard

Don't reinvent the wheel. Offload state into db systems designed to handle it, or offload to client.

Centralize your state. Make request handlers stateless. Handlers are now resistant to failure and scalable up and down. Also makes deployments easier.

Lesson #3: Generators + Greenlets = Awesome

Asynchronous iterator!!!!! Fan out, fan in!

In iterator -> in queue -> worker greenlets -> out queue -> out iterator

JIT processing
Automatically switches through I/O

Lesson #4: CPython suffers from memory fragmentation

Fragmentation is when a process's heap is inefficiently used
The GC may report a low memory footprint, but the OSS reports a much larger RSS.

Ways to fight fragmentation:

Avoid large numbers of small objects (esp. combo of many small objects and a few large objects)
Minimize in-flight data (less used, less fragmented). Generators are great for this.
Reference, don't copy

Hybrid memory management:

Use Cython to allocate page-sized blocks of pointers into incoming chunk
Hand-off the whole thing to the GC to handle normally
For JSON, the resulting deserialized object points back to the large blob.

Ratcheting

Ratcheting is a pathological case of fragmentation, caused by the fact that the heap must be contiguous (a limitation of CPython that it cannot compact memory)
Large object at end of heap, small object added after, large object freed, but heap can't be shrunk until small object is freed.

Fighting this:

Avoid persistent objects (sockets common offenders)
Anything that has to be persistent should be created as soon as possible at app startup, before processing data
Avoid letting the heap grow in the first place

Slow Python, Fast Python

Alex Gaynor
https://speakerdeck.com/alex

What is performance? How fast things go. Fast websites sell more widgets on amazon.com.

Benchmarks are full of lies and nonsense.

Performance is specialization. We can achieve performance in our own apps by specializing.

Systems performance: what is the difference between micro and macro benchmarks? We understand unit and functional tests.

What is Python? It's the language we all get when we type python. Python is abstract now, with dozens of difference machines. CPython is a specific machine.

Python isn't

Cython
C
Numba
RPython

These can make our apps faster, but they can't make our Python faster.

Untrue: Python is slow. Dynamic languages are slow.

Optimizing dynamic languages is simply different from optimizing typed languages.

You can monkey patch anything. How can you optimize that? Solved problem. Make assumptions, and make cheap checks.

Slow vs. harder to optimize. True, Python programs may run slow, but they can be optimized.

PyPy is an implementation of Python. It often runs your code faster than CPython.

Here's the deal: performance is about specialization. You choose good algorithms, I'll make them run fast. We have excellent strategies for optimizing dynamic code.

Use objects for objects, not dictionaries. Classes are more specialized than Dictionaries.

Specialize your code for the use case. Python makes using general tools easier.

Strings: don't copy when you don't have to.

Zero Buffer. Work with strings in a sized buffer, manipulate w/o copying.

More myths:

Function call are really expensive
Use builtins because they're fast
Don't write python in C or Java style

These get you to a local maxima. Might have worked on CPython of yore. Try PyPy first.

One Python: conventions are key. Use care with the conventions you use. Use fast conventions.

Q&A

cProfile
pypi/line_profiler
optimize the algorithm first.
wiki page has algorithm time/complexity annotations of most python builtins
dicts are not always dicts in PyPy!

Performance Testing and Profiling: A Virtuous Cycle

Dan Crosta
http://late.am
@lazlofruvous

Works at Magnetic, an online advertising provider. Lots of thousands and requests/second.

Overview

Performance testing web apps
Profiling w/ the standard library
Instrumentation
The Virtuous Cycle

Performance testing basics

Generate requests against your app (record and replay production)
Measure response time and error rate

Types of testing:

stress test
load test (not talking about:
spike test
soak test

Stress testing

Generate excessive load
- lots of requests
- slow/difficult requests
- adversarial testing
"How much can it take?"
Identify breaking point (esp. if you control synthetic load)

Not very good for identifying problems

Load testing

Generate specific, constant load
- Expected conditions
- Exaggerated conditions
"What if?"
Capacity planning

Best practices

Isolate testing from external influences
- Use dedicated load testing environments
- "scaled down" copies of all components
- results are extrapolatable
Generate load consistently
- Random considered harmful
- Automate, automate, automate! One click!

Profiling

Batteries included:

cProfile, pstats
documentation not really included

Goofy, horrible API. Avoid run(), runctx()

import cProfile
	profiler = cProfile.Profile()
	profiler.enable()
	... do stuff
	profiler.disable()
	profiler.dump_stats("myprogram.prof")

Then:

	import pstats
	stats = pstats.Stats("myprogram.prof")
	stats.sort_stats("calls").print_stats()
	stats.sort_stats("calls").print_stats("webapp.py")
	stats.sort_stats("calls").print_callees('webapp.py:8(login)')
	stats.sort_stats("calls").print_callers('hashpw')

Use filters!

Profiling in practice

"Why is it slow?"
Good for identifying un-optimized code
- tight loops, recursion, lots of function calls
- these are candidates for optimizations
Good for identifying bottlenecks
- distinguish between slow external resource and slow app code

Other profilers

line_profiler: function decorator. prints at you.
yappi: profiles code across multiple threads; measure wall clock or CPU time. outputs profile data for pstats.

Instrumentation

Use statsd to collect time-series metrics (lightweight, low-overhead, always-on profiling)
Two key instruments:
- counters let you know how many things happened
- timers let you know how long they each took
Learn what's normal for your app (bonus: alert when things are not normal)
"Does the real world match expectations?"

Virtuous Cycle

Instrumentation & Alerting -> Performance Testing -> Profiling -> Performance optimization -> I&A

Lightning Talks

Writing Good commit messages

Why?

memory (short and long time scale)
collaboration

Two fundamental purposes of VC

remind your future self why you made that change
tells your colleagues why you made that change
tells your future colleagues and successors why you made that change!

Assume the person who will be maintaining your code in two years is an axe-wielding maniac who knows where you live.

Tell me WHY (and WHAT) you changed.

Rules:

Tell me what you did and why. What is obvious from the diff.
Brevity is the soul of wit. Keep the novels out of the commit log.
Pick a style and stick w/ it (real sentences or telegraph english)
Pick a grammatical mode and stick with it: present tense imperative is best and most fun.
Spelling counts. As do grammar and punctuation. Pick a style and be consistent (and comprehensible).
Teamwork counts. You are not working alone. Everybody should follow the same rules.
TELL ME WHY. WHYYYYY?

Things You Didn't Know

Larry Hastings

Command line quoting subprocess.Popen(string or list, ...) string = shlex.quote(list) # 3.3+ list = shlex.split(string)

Liskov Violation Violation of the Liskov Substitution Principle. Type T -> property P Subtype s(T) -> property P (Rect -> Square)? Liskov Violation!

Distributing your Python Game

Why write a game in Python?

Game jams
books
Raspberry Pi
school
why not?

Libraries and frameworks: http://wiki.python.org/moin/PythonGameLibraries

Build a MS Windows exe:

py2exe
pyinstaller
cx_freeze
cython

Server Security 101

Kevin Veroneau
@kveroneau
Pythondiary.com
Debiandiary.com
iamkevin.ca

Basics

Install fail2ban
Use IPTables to block IPs
Disable password auth in sshd_config
Always use priv/pub keys to connect
Disable SSHv1, only use SSHv2
Minimal packages
Configure and customize PAM

NEVER ALLOW ROOT TO LOGIN VIA SSH

Cannot stress enough
Always have a personal account and su to root.
Never have admins share the same accounts
Only give out root when absolutely needed
Configure sudo with commands the user may need to run
Have ITIL system in place to grant access

Simpler solutions

Use modularity when possible
Web server on separate user/process than other components
If one service exploited, it limits the damage.

Configure pam_limits

configure /etc/security/limits.conf
Protect against fork bombs by limiting resources
Personal website uses an ncurses python app to render the page in a vt102 terminal uses this to limit the python processes forked to only 20

Partition the hard disc

If possible (not in the cloud), however you can mount a loop file system
Make sure the following is set up in fstab (/home, /var, /tmp noexec, nosuid, nodev)

Extreme

Mount the rootfs as RO
Build a live system in RAM!

Google Crisis Map

googlecrisismap.googlecode.com
Ka-Ping Yee
[email protected]
@zestyping

Publishing maps for disaster and humanitarian aid

xuanhan863/pycon_2014.md

Friday, April 11

DNS

Cache me if you can

Key Naming

Memcached cluster

Memcached subtleties

Common practices

Common problems

Bootstrapping: Porting to Python 3

Objectives

Background

Porting strategies

Targeting Python Versions

Managing Risks

Bottom-up Porting

Common subset idioms

Testing Avoids Hair-Loss

Considerations for C extensions

Hygiene

Resources

Enough Machine Learning to Make Hacker News Readable Again

Get the Data: the Hard Part

Engineering Features

What's possible?

How to get started w/ Machine Learning

What is a model?

Project flow

Ex: Linear Regression

Machine Learning Algorithms

Machine Learning Key Tools

Machine Learning Skills to Build

Machine Learning Resources

Getting Started w/ Salt

Why SaltStack?

Why Not SaltStack?

Learning Salt

Vocabulary lesson

Getting started

Advanced topics

Tips & tricks:

Castle Anthrax: Dungeon Generation Techniques

Mazes

Placing rooms: binary tree

Using constraint solvers

Lightning Talks

Structlog

Certificate-based SSH

DIY Stuffed animals

Saturday, April 12

Lightning Talks

Docker.io

What?

Why?

Setup

Terminology

Automation

Network

Misc

Solution

Developing Django's Migrations

The Initial Plan

Revised plan

SchemaEditor

A new format

In-memory running

DB peculiarities

Backwards Compatibility

Lessons Learnt

Designing Poetic APIs

Intellectual Intelligibility

Principle 1: Don't be an Architecture Astronaut

Principle 2: Consistency

Warning Signs

Brevity

Warning Signs

Composability

Warning Signs

Plain Data

warning Signs