9/13/11 --- I have finished the research
PHP:
New Visitors:
Time elapsed: 0.56040978431702
Ads per second: 178.44085310158
Time per ad: 0.0056040978431702
100 x Returning Visitors:
Time elapsed: 5.7932071685791
Ads per second: 1726.159570857
Time per ad: 0.00057932071685791
Python:
New Visitors:
Time Elapsed: 1.23879694939
Ads per second: 80.7234793801
Time per ad: 0.0123879694939
100 x Returning Visitors:
Time Elapsed: 9.27730107307
Ads per second: 1077.89969531
Time per ad: 0.000927730107307
9/10/11 --- Initial Testing
I need to test the performance of PHP vs Python for a basic adserving scenario. I will create a basic Visitor-Demographic relationship model, and a simple HTTP (python) server to act as a dummy datastore. I will also build an HTTP server to represent the ads running on the system. Both datastores will be designed with configurable delays to emulate a slow component in real-world scenarios. Then I will build an adserver in PHP and in python, using nearest-possible-equivalant implementations and measure the performance of each against each other. I will setup unit tests to verify the correctness of results returned by each system.
A visitor is identified by a GUID. For a given GUID, our datastore is expected to return a map of properties pertaining to:
- String properties: including gender, city, state, country, browser
- Metric properties (float): including age, page views, unknown?
- Tagging properties (list of strings): including interests, conditions and location data
A publisher will be able to target visitors by the above properties. Only visitors who match all of the targeted properties (matching exact strings, ranges for metrics, and any of the tags) will be served ads. Each targeting configuration will be hashsummed to a unique ID that is a unique identifier for the target.
Will be leveraged heavily to discourage long lookups. 99%ile results should be served in <100ms for all tested scenarios.
Most cached times will be configurable or based on common values as follows:
VISITOR_TIME: 15m, info about each specific visitor
ACTIVE_ADS_TIME: 1m, the systemwide set of active ads
AD_PERF_TIME: 5m, info about each ads performance
The mAdserver will:
Receive a request from a client for an ad to serve: Clients will connect via HTTP, passing a visitor ID to the mAdserver.
Identify eligible ads for the visitor: If there is no set of eligible ads already cached, [very fast]
Connect to the demographic store to retrieve visitor parameters [potentially slow, once per visitor per VISITOR_TIME]
Connect to the ad store to retrieve a set of all running ads [potentially slow, once per ACTIVE_ADS_TIME]
Filter by targeting options, then cache the results for VISITOR_TIME.
Determine ad priorities: Get the (cached) ad performance metrics for each ad (impressions, clicks, bid, ctr, budget_left, impressions_since_last_flush, freq_cap, impressions_for_VISITOR_ID) [potentially slow, once per ad per AD_PERF_TIME]
Filter out frequency capped ads, ads exceeding daily partial budget [fast]
Weight each remaining ad based on performance decisions: [fast]
- Find eCPM by either (cpm_bid) or (cpc_bid * ctr * 1000)
- Get confidence score: http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
- Ads with less than 10,000 impressions get a +20% score
Memoize new ad performance metrics for chosen ad [fast]
Return list of prioritized ads As computed above. IFF the "display_count" parameter has been set, we will assume all these ads were displayed, and increment their impression counts in the cached hash (otherwise we're presumably only generating a list of possible ads)