Skip to content

Instantly share code, notes, and snippets.

@sleroux
Last active October 14, 2015 19:57
Show Gist options
  • Save sleroux/0340f37154ec915a563b to your computer and use it in GitHub Desktop.
Save sleroux/0340f37154ec915a563b to your computer and use it in GitHub Desktop.
Frecency/Top Sites Optimization

The frecency query [1] we're making for our search recommendations/top sites is painfully slow. The query can up to a few seconds before completing. This is noticeable in places such as Top Sites where we run this query every time the user visits the panel. The items in Top Sites don’t render until the query completes. When taking a closer look at the query we're running to determine frecency, there are a number of factors which are causing it to be slow. Unlike desktop, frecency on mobile is calculated in real time instead of being pre-calculated. This results in a complex query that touches various tables with inner joins and computationally expensive ORDER/COALESCE operations. The query is then run on a potentially large data set since it touches the history table. In cases where a user has attached their FxA account, their desktop history will also be part of this.

After some discussion, there are 2 approaches we can take to optimize the way we handle frecency:

  1. Tweak/modify the existing query. There may be opportunities in the current query that we can take advantage of to make it a bit quicker. For example, instead of doing a full table scan of history items, we can ignore items in the table that have no local visits right off the bat in our first query to reduce the data set we perform the joins/frecency calculations against. Modifying the existing query requires less coding work but could have larger affects on the UX of some areas of the app such as Top Sites. For example, ignoring remote visits from frecency will result in top sites not being affected by syncing with your FxA account.

  2. Split the calculation and retrieval of frecency data into two different operations. This approach would follow more closely with desktop in that frecency values would be calculated not in real time, but according to some sort of heuristic. For example, desktop calculates frecency on a daily basis where as on mobile we would need to determine one that would fit our needs. The advantage of this approach is that it moves the long operation of calculating frecency into an non-blocking task and removes the need for spending the time calculating it when we need to present data to the user. On the other hand, because frecency would be calculated ‘lazily’, the data now has a ‘stale’ state which would require us to add in a invalidation heuristic. The other downside is that this would changes to the database schema and altered queries.

[1] Frecency Query: https://github.com/mozilla/firefox-ios/blob/master/Storage/SQL/SQLiteHistory.swift#L306

@rnewman
Copy link

rnewman commented Oct 14, 2015

The expense of the current approach really lies in two related things: having to consider the entire set of history URLs in order to select a very small set of domains, and having to sort the full set of rows in memory.

Our query time is approximately linear in the number of history items; because the inner query reduces down to a table walk with some simple indexed joins, it's approximately c × N × o, where o is the time spent to compute the order value in memory.

(Obviously we join against visits, and outside we join against domains and against icons… but the real cost is that we need to join everything and scoop it all into RAM to do the work, so this approximation is mostly true.)

Most approaches to making this faster address these two things:

  • We can exclude items from the set before we have to sort them — e.g., exclude URLs with no local visits.
  • We can pre-compute some parts of the joins or the ordering function: e.g., eliminate the join against visits by keeping min/max/counts denormalized into history.
  • We can turn the full table scan + in-memory sort into an index walk by storing a time-insensitive frecency value, recomputing it on change or periodically. This is what desktop does.

Notably we're in this situation because frecency is a function of stored data and the current time — it drops off over time. Making this a stored value is an approximation.

Probably the right solution for us is to persist a frecency value, and to be clever about invalidating/recomputing it.

That requires a lot of work, so in addition to the first two above we can also explore some alternative/simpler solutions:

  • Can we increase any sqlite page cache or other parameters in order to make most users' data fit in RAM?
  • Are there any indices we're missing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment