Created
January 17, 2022 09:21
-
-
Save fulmicoton/d7d31a35139a712ec556e2e09ffdc63c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
First of all I am not very familiar with Elasticsearch settings, | |
but fairly familiar with what stock Lucene does. | |
I haven't any experience of NRTDirectory... | |
Lucene simply writes immutable segments files. | |
The NRTDirectory does not sync, in order to minimize the cost of commit. | |
The OS will eventually flush these pages to disk. | |
On the read side, the file is mapped into memory. | |
On first access, the OS will experience a page fault. | |
The page is in page cache, so the OS does not need to read anything | |
from the disk, it will just bind map the virtual memory of the process to | |
the page frame of the page cache. This is not an horrible event. | |
Now in what kind of event does one end up access a page that the | |
kernel evinced from the page cache? | |
In the Indeed case, I talked about two case: | |
1) there wasn't enough memory on the server to begin with. | |
That's a no brainer. If you don't have enough memory, you will experience | |
page cache. Still, this never happened on the term dictionary (which was in | |
anonymous memory I believe, nor posting list. This only happened on positions... | |
And the pages evinced where on positions two.) | |
2) Indeed had a process where they would take all segments, and merge everything. | |
The index would go from | |
- [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A] : ~40GB | |
to | |
- [seg1-B] : ~40GB | |
Instantly, as the `Searcher` would use the new data, the service would experience a | |
massive amount of page faults. | |
Warming up the new segment is not a solution, as the real problem is that the | |
system requires twice the amount of RAM for 30s when the switch to the single | |
big merged segment occurs. | |
A more reasonable merge policy puts a cap on the size of a segment. | |
For instance, if the merge policy aims at producing segment up to 5GB, the | |
required extra margin is 5GB. | |
In our example above, our segment merge would replace only a subset of our segments | |
- [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A] | |
- [seg1-A ] [ seg2-A ] [ seg3-A] .. [seg6-A][seg-1B] | |
That's something that lucene enforces in its merge policy. | |
While having 1000 segments is hurting performance, | |
merging all segments down to one segment is not really useful. | |
If an index is large, having one segment or a dozen of segments gives about the same perf. | |
As you noted, NRT induces write amplification. | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
*aparte* | |
The phenomenon is a bit counterintuitive, because it is problem that gets a little | |
bit smaller as indexing throughput increases. | |
If I soft commit every seconds, and I have one incoming document every second, | |
I will produce segments with 1 document, as you described. | |
With a merge factor of 8 (8 segments get merged together), I will end up with a write amplification of around 8. | |
If however I receive 1k documents per second, I will end up with write amplification of 4 or so. | |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
The tiny segments induced by NRT will hit the page cache. | |
They have a very short life expectancy, get merged rapidly, and get deleted. | |
Like the large segments I discussed, they DO induce an extra margin on the amount | |
of RAM required... But I don't see how this margin could be big. | |
Shouldn't 30s worth of segment files be sufficient? | |
The linux kernel page reclaiment is not an LRU, but it is a good | |
first approximation. | |
Assuming we access each page of our running index at least once every 30s, | |
what could be the reason for our kernel to evince a page of the live/hot index before | |
evincing a page of one of these short lives segments that were deleted 30s ago? | |
So what are we exactly talking about here? | |
- Is the phenomenon you are describing, about page that are seldom access to | |
induce major page fault on some phrase query? | |
- Is it about the necessary extra RAM margin being large than what I think it is? | |
- Is it some weird side effect of the JVM not munmaping files? That used to be | |
a thing... Amplified because users would typically increase their heap size when experiencing, | |
only making the problem worst. | |
Is it in combination to the kernel not being able to reclaim those page, especially | |
considering they might still be dirty? | |
I think most JVM support unmap nowadays, and Lucene uses those weird unsafe methods. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment