-
-
Save farhanpatel/eae0d901c539f3dd816fc11b9368f227 to your computer and use it in GitHub Desktop.
SELECT *, metadataDB.page_metadata.title AS metadata_title | |
FROM | |
( | |
SELECT * , id as historyID, max(length(url)), | |
(SELECT count(1) FROM visits WHERE visits.siteID = history.id) as visitCount | |
FROM history WHERE id IN ( | |
SELECT DISTINCT siteID AS historyID | |
FROM visits WHERE date > 1491243277100158 | |
AND NOT EXISTS (SELECT * FROM visits WHERE siteID = historyID AND date > 1493055877100158) | |
) | |
AND domain_id NOT IN (SELECT cached_top_sites.domain_id FROM cached_top_sites) | |
AND title NOT NULL AND title != '' | |
AND domain_id NOT IN (SELECT domains.id FROM domains WHERE domains.domain IN (?,?)) | |
group by domain_id | |
order by visitCount | |
limit 100 | |
) | |
LEFT JOIN view_history_id_favicon ON view_history_id_favicon.id = historyID | |
LEFT OUTER JOIN metadataDB.page_metadata ON metadataDB.page_metadata.site_url = url | |
ORDER BY COALESCE(iconURL, media_url) NOT NULL DESC /* Favour sites with some metadata */ | |
LIMIT 8 |
See what happens if you leave out the limit 100
. It might not be necessary, or you might need to keep it because you're working against your database.
There's also a little bit of a flaw in the high-level approach here.
Say you have history IDs 3
, 4
, 5
, all with domain ID 99
.
3
has the most visits.
4
has the longest URL, but only one visit.
5
has the best metadata.
You pick the one with the longest URL, returning 4
as historyID
. You sort by visits, so this is in position 85 out of 100 — but it makes it. You join it against page_metadata
, where it gets sorted to the back for lacking metadata, and is pruned by the LIMIT 8
.
I think this will be very common: ever clicked on a tracking link in email marketing, which dumps one row in your history with a long URL and no icon or thumbnail, then redirects you to a short URL full of content?
This whole setup is bad for two reasons. Firstly, if you'd arbitrarily picked 5
maybe it would have sorted right to the front, and been the result the user wanted to see. Secondly, you're doing all that grouping for nothing.
Try to work in the metadata join further inside the query, because apparently that's totally key.
Just as you're only fetching sites within a certain timeframe, you should try to yield sites from the inner query biasing towards those with metadata. That'll reduce the need to generate 100 inner results then filter down to (you hope) 8 — just generate the right 8 up front! — and it'll perhaps also give you a better set of results than grabbing the longest URL with the most visits.
You could even query the metadata table by freshness — when was the metadata recorded? — and walk it in order to find candidate pages. Might be quicker than chewing on visits then doing a string join.
In order to do any of these things you might need to improve the schema — page_metadata.site_url
is the worst possible way to index a database, yet the page metadata is one of your key inputs!
You should think carefully about the kinds of queries you need to run, and decide whether you would be better off with denormalized data (can you store the domain_id
in page_metadata
, so SQLite can join more easily, perhaps?) or a materialized view.
Totally agree. I think trying to grab as much data as possible before making a decision on what to drop is better. This is how desktop does it. It fetches 500 history items and then ranks them all based on metadata/url/title as one step before pruning the list.
But (and I know this is a cop out) I wanted this PR to just make things slightly better than before. The role of highlights and its use is still being tested so for the MVP I don't want to spend too much more time on this.
At this point as a baseline, it'll be helpful to compare against whatever we come up next.
I'm actually suggesting doing the opposite of what desktop does. Desktop grabs everything into memory and does the work there. We aren't well-placed to do that — the query is more expensive for us, and we have less memory.
But yes, figure out what you can do that's better than the current query, and measure against large real-world data sets. If it's quicker and has lower peak memory use, and the query is clearer, there's no reason not to ship it.
So if this is too slow. Steph has been working on https://bugzilla.mozilla.org/show_bug.cgi?id=1326564 which should allow us to hide the loading time.
We also have UI that we can show while highlights is loading so the user doesnt need to wait for highlights to load.