We are a non-profit organization that helps newsrooms, researchers, and public interest organizations host and share primary source documents. On the backend, we currently have 7.8 million documents, 131 million pages of searchable text, and 600 thousand individual shared notes in our database on production.
Each document has a somewhat complex permission system, which allows a document be set to private (only the owner can see it), shared (other members of their organization can see it), or public (publicly viewable and searchable), as well as a secondary set of permissions that allow documents to be added to project collections that also can grant access.
Users can put in a search on a centralized search bar that will parse through all of our documents and return matches that the user has permission to (if a person is not logged in, they can search through all public documents and get relevant results).
Currently, while the full text of a document is searchable, the next of individual notes a user leaves on documents is not. Similar to documents, notes can be public, private or shared only with those with edit access to the document. In addition, you must have view access to the note's parent document to be able to view it, even if the note is set to public. Querying for notes must only return and search through note content which the current user has access to.
A recent implementation of this feature worked in our staging server but failed in production, and we're looking for an experienced Solr Database expert's assistance to help think through the design and implementation, as well as troubleshoot specific issues that have come up.
If there's interest, we would also be potentially interested in help identifying and implementing additional optimizations to increase the speed of queries, support more complex queries, and ensure the service's resiliency. We would also be open to ongoing work to help us monitor and manage the database.
When indexing documents, we store the IDs for the organization and projects it belongs to, which allows us to query the appropriate documents based on the organizations and projects the user belongs to. We index notes as nested child documents to their documents.
To query on notes, we add two subqueries to each query. One searches through all public notes and notes you own for notes on documents you have view access to for the given query text. The other searches through all “edit access” level notes on only the notes you have edit access to for the given query text.
To retrieve the associated notes for a document, I use the subquery document transformer. I use this instead of the child document transformer in order to be able to do advanced filtering needed to filter out the notes based on the permissions. I issue two subqueries, one for public notes and notes you own, and one for notes with “edit access” permissions. The edit access permission notes are merged with the other notes in the application code if you have edit access. (I tried to keep most of the processing in Solr if possible for efficiency, but couldn’t figure out how to do this in Solr).
This approach seemed to work fine in development and staging, but had significant performance issues in production. Indexing documents with notes seemed about an order of magnitude slower than indexing documents had been previously. Also, our current database administrator (a contractor which manages issues that we currently work with) said the subqueries were using up a large amount of Heap memory and CPU processing power, causing responsiveness issues with Solr.
Our Solr server is hosted on Heroku along with the services primary backend, which is built on Django.