- Lynn Root
- roguelynn.com
- roguelynn-spy.herokuapp.com
Use scapy
python library for sniffing network traffic. Chrome does one DNS request for each autocomplete guess. Interesting.
DNS names end with a dot?
example.com vs. example.com.
Relative vs. absolute (FQDN) if your local resolution has something funky going on.
dig
is your friend.
dig +trace python.org
.
is the root DNS server. Queries resolve down the hierarchy. .
-> org
-> python
-> www
.
Show all records for the name:
$ dig +nocmd +noqr +nostats pyladies.com -t ANY
DNS relies on caching so that the root servers aren't hammered. So we start at our local DNS and go out from there until we get a result, which is then cached for a TTL.
Query -> Local Cache -> "closer" name server -> authoritative name server.
TTL is a balancing act. Too long, and caching takes forever to propagate. Too short, and the authoritative server gets hammered.
You can't get the entire zone file usually. dnsmap
brute-forces subdomain lookup to retrieve extant subdomains.
You can run a DNS server from Twisted. Cool.
Unicast, multicast, etc. Anycast. One-to-nearest association. Google uses this. Someone in Australia looking up 8.8.8.8 to get the same response from a nearer server.
DANE (DNS Authentication of Named Entities) uses DNSSEC. Apparently, firewalls can intercept HTTPS traffic and fake your secure connection.
Service Discovery. SRV records. Spotify clients do SRV lookups to get a service access point to the web API.
DHT: distributed hash table. Key-value store in DNS, distributed through the network. Spotify does this using TXT records.
- Guillaume Ardaud
- @gardaud
- gardaud.fr
- [email protected]
Memcached:
-
Key/object store.
-
O(1) everything.
-
Primary data in relational database
-
data you can lose, regenerates slowly: persistent storage (mongo, redis)
-
data you can lose, regenerates quickly: RAM store (memcached)
Expiries given in seconds. USE CONSTANTS.
Use set_many/get_many/delete_many. Slowest part of operation is network latency.
incr/decr good for counters.
add good for not clobbering existing keys.
Facebook has a fork that lets you dump memcached to the disk. Huh.
- ASCII based
- Not crazy long (they have to be hashed--few dozens of characters)
- Explicit! e.g.
json.users.<user_id>
Bad naming: md5(sql_query)
Don't use user input for cache names!
Memcached is not queryable. There is a debug interface, but don't use it!
With multiple nodes, the client decides which node to go to using hashing.
Key stored at 8am w/ 2 hour expiration. What happens at 10am? Nothing. It gets removed if the client tries to fetch after 10am.
Memcached has a fixed amount of memory using an LRU cache. Objects can be evicted before expiration if memory fills up. Objects w/ oldest timestamp get evicted.
Sometimes memcached returns None if the expiration hasn't been reached AND memory isn't full. Memcached pages its memory, and chunks its pages into chunks of various slab classes. If all pages of the needed class are full, get a free page and give it the needed class. If there are no more free pages, the LRU kicks in and evicts data. Each slab class has its own LRU.
memcached -v # verbose output
memcached -M # doesn't evict when out of memory, but errors
memcached -I1k # change slab page size
memcached -f1.5 # change growth factor
man memcached # is your friend
Add a cache_name property to Django models. Use model versioning to invalidate cache names automatically.
get_many returns a dict, which may not have all the keys you requested. You'll have to fill those in yourself.
Thundering Herd problem: on a cache miss, if it's expensive to rebuild the object, a flurry of simultaneous requests will bury the application server in simultaneous builds. Solve the thundering herd with a lock object.
Caching large values: say a large number of objects. Instead of caching them as one object, do a 2-phase fetch. Store the list of IDs, then store each object w/ set_many.
Paginated cache:
- Break big objects into smaller slices.
- Store each slice as a separate object
- Store the list of slices.
If some of the chunks get evicted, well, there you are.
- Tres Seaver
- @tresseaver
- [email protected]
- Straddling Python 2/3 in a single codebase
- Choosing target Python versions
- Porting as an iterative process
- Ordering components by dependencies
- Adding test coverage to reduce risk (if you don't have good tests, you will lose)
- Covering C extensions
Ported Zope3, ZODB, WebOb, Pyramid, other dependencies to Python3. 180kLOC Python, ~25 kLOC C.
-Port once, abandon Python2-
- Not the subject here
- Customers / users still need Python 2
- More feasible for applications than libraries
- 2to3 may be useful starting point.
-"Fix up" at installation using 2to3-
- Python2 users unaffected
- Python3 source "drifts" from canonical version. Bug reports don't match.
- 2to3 painfully slow on large codebases.
"Straddling" in a single codebase. (Thought to be impossible, initially.)
- Use compatible subset of Python syntax
- Conditional imports mask stdlib changes
six
module can help (but you might not need it)
Syntax changes make Python2 < 2.5 hard
- No b'' literals
- No
except Exception as e:
- Much more cruft/pain
Python 2.6 is the bare reasonable minimum. 2.4/2.5 are long past EOL. But some folks need system Python in "enterprisey" systems.
Incompatibilities make Python 3 < 3.2 hard.
- PEP 3333 fixes WSGI in Py3k
- callable() restored in 3.2
- 3.3 restores u'' literals.
- 3.2 is "system Python3" on some LTS systems.
Summary: support 2.6, 2.7, 3.2+
- Ports are great opportunities for bug injection!
- Fear of breaking working software is the barrier, even more than the effort required.
- Some mitigations also improve your software.
- Improved testing
- Modernized idioms in Python2
- Clarity in text vs. bytes.
Port packages with no dependencies first. Then port packages with already-ported dependencies. Note the Python versions supported by dependencies. Lather, rinse, repeat. Finally, port the application.
- Read Lennart's book!
- `python2.7 -3' can point out problem areas.
- Modernize idioms in Python2 code. e.g. exception syntax,
with open() as fi
. - Distinguish bytes vs. text. Use b''/u'' for all literals. Quit letting Python promote things to unicode for you.
- Adopt new syntax.
- E.g. extept ... as ...
- print()
- Use new stdlib facilities, e.g. io.BytesIO vs. StringIO.StringIO
- Untested code is where the bugs go to hide.
- 100% coverage is ideal before porting.
- Unit testing preferable for libraries
- Functional testing best for applications
- Subtle bugs hide in the libraries
- Measure test coverage: pypi/coverage
- Work to improve assertions as well as coverage
- Assert contracts, not implementation details
- Don't assert against exception types/formats, things that change between versions!
- If at all feasible, convert doctests to Sphinx examples. Sphinx can run examples to make sure they don't break.
- Automate running tests
tox
helps ensure that tests pass under all supported versions. (Also test pypy!)- Don't run coverage on all your tests. Coverage is really slow. Just use it on one separate tox target on one version of python.
- Testing C is harder!
- http://python3porting.com/cextensions.html (Lennart's book)
- Maintain a Python reference implementation
- Easier to test
- Supports PyPy
- Design for same API as C
- 100% coverage for Python
- Ensure C version passes same tests.
- Signal supported versions using Trove classifiers in PyPI
- Consider bumping the major version. Allow users to stick with "safe" versions as you iterate.
- Apply continuous integration
- Travis CI
- Jenkins
- Shining Panda for Windows
- python3porting.com
- testrun.org/tox/latest
- pypi.python.org/pypi/six
- python.org/wiki has common idioms for Python 2/3˘
- Ned Jackson Lovely
- @nedjl
- [email protected]
- slides at http://www.njl.us
An achievable goal: a personalized filter for Hacker News.
Machine learning is just applying statistics to big piles of data, using it to understand the data better or make predictions.
- Get data
- Engineer the data
- Train and tune models (SCIENCE!)
- Apply model to new data
Use scikit-learn. The documentation is fantastic. The hard part is installing SciPy.
The terminology is daunting. When you don't understand the math, go "blah blah blah" and keep on reading.
Supervised learning is when you have input data and output data. Unsupervised learning is about understanding your data; visualization, grouping, etc.
We'll focus on supervised learning.
Good books:
- NLP with Python
- Building machine learning systems w/ Python
- Learning scikit-learn with Python
- Building Collective Intelleginence
Parallel arrays: (x, y): x is article, y is category.
Set aside a validation set. Learning data your machine hasn't seen. Take 25% of your data. Use it at the end to validate your learning.
Hyper-parameters are magical for tuning. GridSearch is a great tool.
You'll see a lot of these functions:
- transform()
- fit()
- predict() # SCIENCE!
transform(X, [, y])
fit(X, y): X, what it gets; y, what the result should be.
predict(X): predict based on the fit() training.
requests & lxml
Classifying Dreck and Non-Dreck: he wrote a web app to classify 5,000 articles. 20% were Non-Dreck.
Data: Title, URL, Submitter, Content of Link, Rank, Votes, Comments, Time of Day, Dreck or Not.
Turning that messy data to normalized Numpy arrays: "Time flies like an arrow, fruit flies like bananas"
- Bag of Words : count occurrences: "flies": 2, "arrow": 1, "bitcoin": 0
- n-grams: time flies, flies like, like an, like bananas
- Normalization: stemming
- Stop words: cut out the useless words (articles, etc.)
- TF-IDF: Term-Frequency, Inverse Document Frequency. (e.g. an article about bitcoin has more refs to bitcoin than an article not about it)
- Pull out the relevant text (
readability
package) - Roll your own features (e.g. bump up long-form content)
- Combine features (pipeline w/ TF-IDF with long-form feature)
- Hostname pipeline: extract hostnames into numpy array. Pickle your built classifier (save the data to recreate it) then use the classifier to predict.
- Use unsupervised learning.
- Predict numerical scores.
- Watch an RSS feed.
- Auto-submit it!
Melanie Warrick nyghtowl.io @nyghtowl
- Hackbright Academy
- Zipfian Academy
Covering:
- Machine Learning Overview
- AI, data science, big data relationships
- Example code, linear regression
- Algorithms & tools
- Skills and resources
"Computers... ability to learn without... explicit programing. Arthur Samuel (1959)
- Build a model that finds patterns and/or predicts results
- Apply algorithms
- Pick best result for pattern match or prediction
Ex: spam detection, weather prediction
Linear regression (line fitting)
y = mx + b
Find best fit m & b algorithm to predict/pattern match. (e.g. plotting High School GPA vs. University GPA)
-
Handwritten address recognition
-
Search engines (Google, Bing)
-
Twitter and Facebook friend recommendations, Netflix
-
Fraud detection
-
Weather prediction
-
Face detection
-
AI, helping machines make better decisions. Intelligence exhibited by machines or software
-
Data Science, helping people make better decisions. Get knowledge from data & create products
-
Big Data challenges both AI and Data Science. Data volumes beyond ability of common tech to capture and curate. (2 GB == 20 yrds of books, 50 PB = entire written works of humankind)
- Define goal and metrics
- Gather and clean data
- Explore and analyze
- Id Algorithm or method (ML)
- Build model (ML)
- Evaluate results (ML)
- Iterate
- Create data product, visualization.
- Make decisions.
Using pandas
for data frames and scikit-learn.
Predict brain weight from head size. Head size is x, brain weight is y.
Cross-validation: hold out a certain percentage of the training data for testing and evaluating the model.
Metrics for evaluating a model: R-squared, where 1 is a perfect prediction.
Use matplotlib and seaborn for visualization (seaborn for prettification). Visualization helps you understand how a model is working.
- Unsupervised, continuous: clustering and dimensionality reduction (SVD, PCA, K-means)
- Unsupervised, categorical: association analysis (Apriori, FP-Growth), Hidden Markov model
- Supervised, continuous: regression (linear, polynomial), decision trees, random forests.
- Supervised, categorical: Classification (KNN, trees, logistic regression, Naive-Bayes, SVM)
- Test Model: Scikit Learn, Matplotlib,
- Explore Data: Pandas, StatsModels, Matplotlib, numPy, Unix
- Build Model: Scikit, NumPy, Pandas, scipy
- Visualize: D3, matplotlib, vincent, vega, ggplot
- Algorithms
- Statistics (probability, inferential, descriptive)
- Linear Algebra (vectors & matrices)
- Data analysis (intuition)
- SQL, Python, R, Java, Scala (programming)
- Databases & APIs (get data)
- Andrew Ng's Machine Learning on Coursera
- Khan Academy (linear algebra, stats)
- "Think Stats" - Allen Downey
- Zipfian's practical intro to data science
- Metacademy
- Open Source data science masters
- StackOverflow, Data Tau, Kaggle
- Mentors!
- Peter Baumgartner, Founder of Lincoln Loop
- lincolnloop.com
- @ipmb
SaltStack is: configuration management. Version control your servers, self-documenting, repeatable, reusable.
Saltstack is: remote execution. Deploy your code, run one-off scripts, critical package updates, system monitoring.
- Familiar tools: Python/YAML/Jinja2.
- Community: Great documentation, insanely responsive (IRC, GitHub), backed by for-profit org.
- Young project
- Moves fast
- Not SSH (new SSH support is "alpha")
-
Chef: knife, recipe, cookbook
-
Puppet: terminus, metaparameters
-
Ansible: playbook, inventory
-
Master: server that manages the whole stack
-
Minion: a server controlled by the master
-
State: a declarative rep. of system state
-
Grain: static information about a minion (RAM, CPUs, OS, etc.)
-
Pillar: variables for one or more minions (ports, file paths, config parameters)
-
Top file: matches states or pillars to minions
-
Highstate: all the state data for a minion
- Binaries for most distros
- Pip install (bleeding edge)
- http://bootstrap.saltstack.org (probably what you want)
Master server: apt-get install salt-master
... or run masterless
Minion: apt-get install salt-minion; echo "salt 10.10.1.1" > /etc/hosts
Accept the minion key on the master.
- Salt-cloud
- Custom modules
- Scheduler
- Renderers
- Returners (return to email, sentry, syslog)
- Reactor
- In minion conf,
output_mode: mixed
- Jinja2 is powerful. Don't go nuts.
- Update often, and review the change log.
- Test before you deploy. Make friends w/ Vagrant or Docker.
- James King
- @agentultra
- [email protected]
Designing procedurally generated content for games.
How do we represent tiles?
- List of lists?
- Single list? (Row Major Order) offset = (row * num_cols) + column
- Depth-first search (backtrack): long, twisty corridors
- "Prim's algorithm" (sp?) A*-style search: short, blocky corridors
Space-partitioning algorithm
Start with an open grid Split it Recurse on the sub-parts. Stop at minimum size or maximum depth. From lowest, widest portion of tree, ascend and connect nodes.
Techniques for placing things or generating terrain:
- Poisson Disks: distributing equidistant points across a space e.g. item placement
- Cellular Automata: e.g. caves!
- Perlin noise (simplex): e.g. pits!
Take a bunch of variables with continuous or discrete ranges (finite). Define a constraint on a few variables, e.g. 8 queens problem.
Ex: I've got all these rooms, here's the exit. I don't care where the boss is, but he must be at least one room away from the exit. These enemies need a place. And there should be a health potion near the beginning.
Optimize!
-
Represent sets as bitmasks
-
Undo-stack
-
Use as few variables as possible to stay in the discrete/finite domain as possible.
-
Rogue Basin
-
How to build a constraint Propagator in a Weekend
-
goo.gl/sdrbkJ
-
Horton goo.gl/xpLTFB
-
PyGame, PyAngband
Provides controlled, audited access to servers. Not a key-based solution!
- Launch instances w/ cert authority
- Users that need access request a cert
- Security officer uses he ssh-ca tool
- ssh-ca generates audit trail in S3: who, when, why, how long
- Certificates include restrictions on use (time-based!)
- OpenSSH logs the key id (email address)
Instead of host certificates, sign the host key.
- github.com/cloudtools/ssh-ca
- CERTIFICATES section of man ssh-keygen(1)
- github.com/caretdashcaret/Patternfy
- Make magazine vol 38
Erik Rose: pip install peep https://pypi.python.org/pypi/peep/1.1
Amjith Ramanujam
"Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient contianer that will run virtually anywhere*."
(* Anywhere meaning any reasonably modern Linux machine.)
- Like Chroot, BSD Jails, etc.
- Uses Linux Containers
- AUFS: Union Filesystem
- Git-like versioning
- REST API
- Don't need a full Guest OS like in virtualization
- Multiple containers share the same underlying libraries read-only. Any changes trigger copy-on-write.
- Lightweight
- Isolated instances
- Faster than VMs (ofter under a second startup time)
- docs.docker.io
- OSX: boot2docker (minimal linux VM) + docker client
- Image: Read-only snapshot
- Container: an instance of the image
- Registry: PyPI for docker images
- Repository: Projects in the Registry
Dockerfile: a series of commands
Django port forwarding: docker run -d -p host:container django-docker
- Volumes: mount folders host/container: docker run -v host_path:container_path django
- Links: Service discovery through env vars docker run --link mysql:db --name webapp django
- REST API
- docker daemon is also a server
- Postgres in a container
- Django/nginx in a container
- celery in a container
Makes testing very easy: Jenkins can run things in parallel that once had to be separated. Containers can be run on local machines.
Apparently, Docker.io recommends against running in production until v1.0.
- Andrew Godin
- Author of South, Django's new migrations
- @andrewgodwin
- http://aeracode.org
South was good for its time. But some bad initial design decisions and core underlying problems. south.hacks
.
- Django: schema backend, ORM hooks
- South 2: Migration handling, user interface
- Django: Schema backend, ORM Hooks, Migration handling, User interface
- South 2: Backport for 1.4-1.6
Logical separation:
- SchemaEditor: schema backend, ORM hooks
- Migrations: migration handling, UI
Not moving south into Django, instead, adding migrations to Django. Complete rewrite of South. New file format, many other things.
- Abstracts schema operations across DBs.
- Works in terms of Django fields/models.
- Contains per-database workarounds.
django.db.migrations:
- Migration file reader/writer
- Dependency resolver
- Autodetector
- Applied/unapplied tracking
- More concise
- Declarative
- Introspectable
- Creates models from migration sets
- Autodetector diffs created from on-disk
- Used to feed SchemaEditor / ORM
Postgres: it's great
MySQL:
- No transactional DDL
- No CHECK constraints
- Conflates UNIQUE and INDEX
Oracle:
- Different SQL syntax
- Picky about names
- Can't convert to/from TextField (LOB)
SQLite:
- AAAAAAAAAAAHHHHHHHHH
- Altering tables? Schema introspection? What?
- Django generally very good at this
- Auto-applies first migration if tables exist
- Ignores South-style migrations
- Explicit is better than implicit.
- Abstracting DBs is hard. Wouldn't do it from scratch.
- Composability rocks. It's simplified the code so much.
- Feedback is vital. I'm just not mad enough to do nasty things to my code. Users always find your edge cases.
- Erik Rose
- @ErikRose
- www.grinchcentral.com
- [email protected]
Auden: "A -poet- programmer is, before anything else, a person who is passionately in love with language."
Programming is inventing new language. "Go learn Lisp or Haskell, it will change how you think about programming."
Sapir-Whorf hypothesis. Yes, it's fallen out of favor, esp. the strong form. But language still has flavor. Language influences how we think.
Wittgenstein: "The limits of my language are the limits of my world."
Having a symbol for something makes it mentally lighter-weight. Mental abstractions. Extracting symbols is the root of all human language. And software engineering!
Fowler: "Any fool can write code that a computer can understand. Good programmers write code that humans can understand."
Capture existing symbols, and use them in your API design. Take requests
vs. urllib2
as a good example.
Robert Storm Petersen: "It's hard to make predictions, especially about the future."
The first step of designing a new library is: don't design a new library. The best libraries are extracted, not invented.
Example: blessings
extracted from nose plugin.
- Identify the tasks
- Identify language constructs
- Identify patterns, protocols, interfaces, conventions
Yeats: "Think like a wise man, but communicate in the language of the people."
The culture you are in has spent a lot of time building up conventions. Use them. Don't be weird or clever. This shows respect to your users.
Ex: Macintosh Human Interface Guidelines. When you've learned one program, you've learned them all.
Principle of Least Astonishment: Try not to surprise the user.
get(key, default)
vs. fetch(default, key)
- Frequent references to your own docs or source
- Feeling syntactically clever (novel syntax)
George Eliot: "The finest language..."
- Copying and pasting when writing against your own API
- Typing something irrelevant while grumbling "Why can't it just assume the obvious thing?"
- Long arg list, suggesting a lack of sane defaults
"Perfection is achieved not there is nothing left to add but when there is nothing left to take away.
Two ways to go about this, one of them wrong.
print_formatted(...)
print_formatted(..., out=some_file) # WRONG!!!
print formatted(...) # CORRECT!
- Classes w/ lots of state. Lots of little classes struggling to get out. Ex. ElasticSearch PenaltyBox. Didn't add to Connection, new class.
- Deep inheritance hierarchies. Inheritance inherits invariant baggage from above, and must tiptoe around them.
- Violations of the Law of Demeter. "One dot rule."
A.b
is OK.A.b.c
is not.A.b.c.d
is right out. - Mocking in tests. Your code may have too many dependencies! Testable code is decoupled code. Some mocking may be necessary, if your framework requires it. Mocking not intrinsically evil, but a code smell.
- Bolt-on options.
Churchill: "All the great things are simple, and many can be expressed in a single word..."
Reduce barriers to re-use. Ex: ConfigParser. Not idiomatic python. Dictionaries would be the expected result, but it forces you to use its own API for anything. Can't substitute anything else.
MyClass.read(filename) # NO!
MyClass.parse(string) # YES!
- Users immediately transform output to another format
- Instantiating one object just to pass it to another
- Rewriting language-provided things
Talmud,: Ta'anith 7b: "The bad teacher's words fall on his pupils like harsh rain; the good teacher's, as gently as dew."
Sloping sides that nudge you to the center. Cut grooves in your APIs.
Avoid nonsense representations, e.g. optional kwargs that are actually required, one or the other.
Fail shallowly!
Resource acquisition is initialization
Don't have invariants that aren't invariant. Ex: Designing a PoppableBalloon class. Require filling in initialization.
Compelling examples: MacPaint. Nintendo platformers. Set a good example, and people will follow it forever. Users are docile: they will do what you tell them to do.
- Representable nonsense. You shouldn't even be able to say nonsense.
- Invariants that aren't.
- Lack of a clear starting point.
- Long, complicated documentation.
More safety than grooves. More danger, higher walls, mean guard dogs in front.
rm *.pyc
rm *
rm -f *
How to report errors: Exceptions > Return values
- Docs that say "remember to..." or "make sure you...". Docs that say "before" or "after", add a context manager.
- Surprisingly few will report safety errors. People will blame themselves. Don't electrify the door knob.
With orthogonality at the center, the flowchart divides into lingual and mathematical halves of a Venn Diagram, with the left hand helping humans to read and use them, and the right to better computability.
- Book: Making Software: chapters on API usability and linguistic influence.
- Book: RESTful web APIs (by author of BeautifulSoup)
- I like my code to read like English. Ex: not using verbs as function names, but nouns describing what's returned.
sorted()
as an example. - Decoupling has its tendrils in many places
- How does change management fit into all this? Use semantic versioning! Compatibility is a place to bolt on an argument. Composability is one way to do it, via decoupling. Compatibility puts us in 4-dimensional space, and we get into time-based coupling.
- Another principle: Fractalness, an API can be used at any level of abstraction.
- Ned Batchelder
- @nedbat http://bit.ly/pytest0
- Show you a way to test
- Remove mystery
- Know if your code works
- Save time
- Better code (more modularity, separation of concerns)
- Remove fear, turn it into boredom
- "Debugging is hard, testing is easy."
Yes, testing is hard.
- A lot of work.
- People (you) won't want to
- But: it pays off
- Fight chaos!
- Growing tests
unittest
- Mocks
First attempt: interactive.
- Good: testing the code
- Bad: not repeatable
- Bad: labor intensive
- Bad: is it right?
Second attempt: standalone python module exercising the code.
- Good: testing the code
- Better: repeatable
- Better: low effort
- Bad: are the results right?
Third attempt: print expected results
- Good: repeatable w/ low effort
- Better: explicit expected results
- Bad: Have to check manually
Fourth attempt: check results automatically, print and assert
- Good: repeatable with low effort
- Good: explicit expected results
- Good: results checked automatically
- Bad: failure stops tests
Getting complicated!
- Tests will grow
- Real programs
- real engineering
- Automated
- Fast
- Reliable
- Informative
- Focused
- python stdlib
- infrastructure for well-structured tests
- patterned on xUnit
- every test gets a new test object
- tests can't affect each other
- failure doesn't stop the next test
setUp and tearDown
- Establish context
- Common pre- or post- work
- Treat your test code like real code. Engineer it.
- Pro tip: use your own base TestCase subclass.
- TestCase.assertRaises() works as a context manager!
- Make your tests expressive. Refactor.
- Extract repetitive boilerplate to setUp().
Tests are real code!
- Helper functions, classes, etc.
- Can become significant!
- Might need their own tests!
Testing small amounts of code
- Systems are built on layers
Dependencies are bad
- More suspect code in each test
- Slow components
- Unpredictable components
Enter test doubles:
- replace a component's dependencies
- Focus on one component
Question should be;
- assuming this outside service is working,
- do my tests work?
Be careful not to skip code that needs to be tested when mocking!
Instead of stubbing our method, we fake urllib.urlopen instead.
- Stdlib is stubbed
- All our code is run
- No web access during test
Don't do all this yourself: use a mock object library, like mock
or mox
.
Mock objects:
- automatic chameleons
- act like any object
- record what happened
- patch context manager! with mock.patch('urllib.urlopen') as urlopen: urlopen.return_value = fake_yahoo
Test Doubles:
- powerful: isolates code
- focuses tests
- removes speed bumps and randomness
- BUT: fragile tests!
- Another, better way to do this: dependency injection
- addCleanup: nicer than tearDown
- doctest: only for testing docs
- nose, py.test ,
- ddt: data-driven tests
- coverage
- Selenium: browser tests
- jenkins, Travis: ci
TDD: tests before code? BDD: describe external behavior Integration tests: bigger chunks load tests: how much traffic is OK?
- Complicated
- Important
- Worthy
- Rewarding
- python-unittest-skeleton on github
- TESTING IS ENGINEERING.
- Greg Ward
- @gergdotca
- [email protected]
You've at least started to drink the Kool-Aid.
- either you're already writing unit tests
- or you're ready to start, with or w/o this talk
Corollary of #1: You already get that unit testing helps make code more correct.
I'm talking about better on a higher plane: aesthetics, elegance, beauty.
Beautiful code is better code:
- easier to understand
- easier to extend
- easier to reuse
Real life case study:
- examine some untested code
- work through adding tests
- understand how imperfect design -> hairy tests
- modify the design for simpler tests -> better code
- what is this code?
- why does it exist?
- where does it come from?
- what requirements does it meet?
- we measure the internet
- we ping all your public IPs every couple of months
- we traceroute everything
- result: ~200m traces/day
- throw it all in plain text!
Staying sane w/ plain text:
- keep it simple, stupid
- restrict the data tightly to avoid escaping
- stay consistent even as data and requirements evolve
T3 files contain one record for each trace. Variable number of fields, just to keep things interesting.
TIP1 files contain one record summarizing all traces sent to a single target.
Lots of similarities. Common format, common library.
- dozens of similar formats
- writing new parser for each would be nuts
- hence, GenericLineParser
- with many subclasses: T3Parser, TIP1Parser, etc.
Requirements
- structured
- fast
- flexible
Good news: when we start testing, the code meets all requirements.
You can't test an object if you can't construct it. So, start w/ the constructor. This goes double in cases like this, with a non-trivial constructor (complex internal logic, sometimes does I/O).
512 code paths through the constructor, based on args! Required only 6 test cases for one method, but definitely a code smell.
Constructors should be dead simple. Take arguments, store them. Be done.
Line parsers parse lines. Something else should open files. Convenience functions to the rescue!
Refactor the constructor, 6 tests to 3 tests.
- constructor simpler and shorter
- other code can use zopen(), uzopen()
- now supports gz files for free
- less test code to maintain
- fewer code paths to worry about, fewer code paths == fewer, simpler tests == better code
So I refactored some messy code. So what?
- writing tests made me look deeper
- made me read the code very carefully
- made me see both the good side and the bad side
This is what unit-testing zealots like to boast about:
- sounds hokey
- sounds like something from a self-help bok
- but it's true!
- absolutely no fear about tearing the line parser to pieces and putting it back together again, even though I didn't write it.
The applications that use this code are completely untested. I'm afraid to refactor.
- easy to adapt existing clients of line parser to use uzopen()
- ...
- incorrect code (bugs caught late in the cycle)
- fear of refactoring
- code duplication (-> bug duplication!)
- insufficient code reuse
- 1000 tests are better than 999 tests
- 1 test vastly better than 0 tests
- unit tests will never cover everything (don't try!). cover almost everything.
- you'll be surprised how much you can cover w/ effort.
- Extreme programming!
- Test-driven development!
- Agile manifesto!
- etc.
- duh. Water is wet.
- less obvious: writing unit tests make code more beautiful
- beautiful code is better, more reusable, more maintainable, more pleasant
- Kevin Ballard
- @misterkgb
- [email protected]
- github.com/tellapart/taba
- Distributed event aggregation service
- built w/ python, gevent, cython
- 10,000,000 events/sec, 50,000 metrics, 1000 clients, 100 processors
Once you've committed to a model, it's very difficult to change it.
The model, the way you flow data through the system, makes a big difference in the performance of the system.
Don't reinvent the wheel. Offload state into db systems designed to handle it, or offload to client.
Centralize your state. Make request handlers stateless. Handlers are now resistant to failure and scalable up and down. Also makes deployments easier.
Asynchronous iterator!!!!! Fan out, fan in!
In iterator -> in queue -> worker greenlets -> out queue -> out iterator
- JIT processing
- Automatically switches through I/O
- Fragmentation is when a process's heap is inefficiently used
- The GC may report a low memory footprint, but the OSS reports a much larger RSS.
Ways to fight fragmentation:
- Avoid large numbers of small objects (esp. combo of many small objects and a few large objects)
- Minimize in-flight data (less used, less fragmented). Generators are great for this.
- Reference, don't copy
- Use Cython to allocate page-sized blocks of pointers into incoming chunk
- Hand-off the whole thing to the GC to handle normally
- For JSON, the resulting deserialized object points back to the large blob.
- Ratcheting is a pathological case of fragmentation, caused by the fact that the heap must be contiguous (a limitation of CPython that it cannot compact memory)
- Large object at end of heap, small object added after, large object freed, but heap can't be shrunk until small object is freed.
Fighting this:
- Avoid persistent objects (sockets common offenders)
- Anything that has to be persistent should be created as soon as possible at app startup, before processing data
- Avoid letting the heap grow in the first place
- Alex Gaynor
- https://speakerdeck.com/alex
What is performance? How fast things go. Fast websites sell more widgets on amazon.com.
Benchmarks are full of lies and nonsense.
Performance is specialization. We can achieve performance in our own apps by specializing.
Systems performance: what is the difference between micro and macro benchmarks? We understand unit and functional tests.
What is Python? It's the language we all get when we type python
. Python is abstract now, with dozens of difference machines. CPython is a specific machine.
Python isn't
- Cython
- C
- Numba
- RPython
These can make our apps faster, but they can't make our Python faster.
Untrue: Python is slow. Dynamic languages are slow.
Optimizing dynamic languages is simply different from optimizing typed languages.
You can monkey patch anything. How can you optimize that? Solved problem. Make assumptions, and make cheap checks.
Slow vs. harder to optimize. True, Python programs may run slow, but they can be optimized.
PyPy is an implementation of Python. It often runs your code faster than CPython.
Here's the deal: performance is about specialization. You choose good algorithms, I'll make them run fast. We have excellent strategies for optimizing dynamic code.
Use objects for objects, not dictionaries. Classes are more specialized than Dictionaries.
Specialize your code for the use case. Python makes using general tools easier.
Strings: don't copy when you don't have to.
Zero Buffer. Work with strings in a sized buffer, manipulate w/o copying.
More myths:
- Function call are really expensive
- Use builtins because they're fast
- Don't write python in C or Java style
These get you to a local maxima. Might have worked on CPython of yore. Try PyPy first.
One Python: conventions are key. Use care with the conventions you use. Use fast conventions.
- cProfile
- pypi/line_profiler
- optimize the algorithm first.
- wiki page has algorithm time/complexity annotations of most python builtins
- dicts are not always dicts in PyPy!
- Dan Crosta
- http://late.am
- @lazlofruvous
Works at Magnetic, an online advertising provider. Lots of thousands and requests/second.
- Performance testing web apps
- Profiling w/ the standard library
- Instrumentation
- The Virtuous Cycle
- Generate requests against your app (record and replay production)
- Measure response time and error rate
Types of testing:
- stress test
- load test (not talking about:
- spike test
- soak test
- Generate excessive load
- lots of requests
- slow/difficult requests
- adversarial testing
- "How much can it take?"
- Identify breaking point (esp. if you control synthetic load)
Not very good for identifying problems
- Generate specific, constant load
- Expected conditions
- Exaggerated conditions
- "What if?"
- Capacity planning
- Isolate testing from external influences
- Use dedicated load testing environments
- "scaled down" copies of all components
- results are extrapolatable
- Generate load consistently
- Random considered harmful
- Automate, automate, automate! One click!
Batteries included:
cProfile
,pstats
- documentation not really included
Goofy, horrible API. Avoid run()
, runctx()
import cProfile
profiler = cProfile.Profile()
profiler.enable()
... do stuff
profiler.disable()
profiler.dump_stats("myprogram.prof")
Then:
import pstats
stats = pstats.Stats("myprogram.prof")
stats.sort_stats("calls").print_stats()
stats.sort_stats("calls").print_stats("webapp.py")
stats.sort_stats("calls").print_callees('webapp.py:8(login)')
stats.sort_stats("calls").print_callers('hashpw')
Use filters!
- "Why is it slow?"
- Good for identifying un-optimized code
- tight loops, recursion, lots of function calls
- these are candidates for optimizations
- Good for identifying bottlenecks
- distinguish between slow external resource and slow app code
- line_profiler: function decorator. prints at you.
- yappi: profiles code across multiple threads; measure wall clock or CPU time. outputs profile data for pstats.
- Use statsd to collect time-series metrics (lightweight, low-overhead, always-on profiling)
- Two key instruments:
- counters let you know how many things happened
- timers let you know how long they each took
- Learn what's normal for your app (bonus: alert when things are not normal)
- "Does the real world match expectations?"
Instrumentation & Alerting -> Performance Testing -> Profiling -> Performance optimization -> I&A
Why?
- memory (short and long time scale)
- collaboration
Two fundamental purposes of VC
- remind your future self why you made that change
- tells your colleagues why you made that change
- tells your future colleagues and successors why you made that change!
Assume the person who will be maintaining your code in two years is an axe-wielding maniac who knows where you live.
Tell me WHY (and WHAT) you changed.
Rules:
- Tell me what you did and why. What is obvious from the diff.
- Brevity is the soul of wit. Keep the novels out of the commit log.
- Pick a style and stick w/ it (real sentences or telegraph english)
- Pick a grammatical mode and stick with it: present tense imperative is best and most fun.
- Spelling counts. As do grammar and punctuation. Pick a style and be consistent (and comprehensible).
- Teamwork counts. You are not working alone. Everybody should follow the same rules.
- TELL ME WHY. WHYYYYY?
- Larry Hastings
Command line quoting subprocess.Popen(string or list, ...) string = shlex.quote(list) # 3.3+ list = shlex.split(string)
Liskov Violation Violation of the Liskov Substitution Principle. Type T -> property P Subtype s(T) -> property P (Rect -> Square)? Liskov Violation!
Why write a game in Python?
- Game jams
- books
- Raspberry Pi
- school
- why not?
Libraries and frameworks: http://wiki.python.org/moin/PythonGameLibraries
Build a MS Windows exe:
- py2exe
- pyinstaller
- cx_freeze
- cython
- Kevin Veroneau
- @kveroneau
- Pythondiary.com
- Debiandiary.com
- iamkevin.ca
- Install fail2ban
- Use IPTables to block IPs
- Disable password auth in sshd_config
- Always use priv/pub keys to connect
- Disable SSHv1, only use SSHv2
- Minimal packages
- Configure and customize PAM
- Cannot stress enough
- Always have a personal account and su to root.
- Never have admins share the same accounts
- Only give out root when absolutely needed
- Configure sudo with commands the user may need to run
- Have ITIL system in place to grant access
- Use modularity when possible
- Web server on separate user/process than other components
- If one service exploited, it limits the damage.
- configure /etc/security/limits.conf
- Protect against fork bombs by limiting resources
- Personal website uses an ncurses python app to render the page in a vt102 terminal uses this to limit the python processes forked to only 20
- If possible (not in the cloud), however you can mount a loop file system
- Make sure the following is set up in fstab (/home, /var, /tmp noexec, nosuid, nodev)
- Mount the rootfs as RO
- Build a live system in RAM!
- googlecrisismap.googlecode.com
- Ka-Ping Yee
- [email protected]
- @zestyping
Publishing maps for disaster and humanitarian aid