fulmicoton · January 17, 2022 09:21
diff --git a/gistfile1.txt b/gistfile1.txt
 First of all I am not very familiar with Elasticsearch settings,
 but fairly familiar with what stock Lucene does.
 I haven't any experience of NRTDirectory...

 Lucene simply writes immutable segments files.
 The NRTDirectory does not sync, in order to minimize the cost of commit.
 The OS will eventually flush these pages to disk.

 On the read side, the file is mapped into memory.
 On first access, the OS will experience a page fault.
 The page is in page cache, so the OS does not need to read anything
 from the disk, it will just bind map the virtual memory of the process to 
 the page frame of the page cache. This is not an horrible event.

 Now in what kind of event does one end up access a page that the
 kernel evinced from the page cache?

 In the Indeed case, I talked about two case:

 1) there wasn't enough memory on the server to begin with.
 That's a no brainer. If you don't have enough memory, you will experience 
 page cache. Still, this never happened on the term dictionary (which was in
 anonymous memory I believe, nor posting list. This only happened on positions...
 And the pages evinced where on positions two.)

 2) Indeed had a process where they would take all segments, and merge everything.
 The index would go from 
 - [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A] : ~40GB
 to 
 - [seg1-B] : ~40GB

 Instantly, as the `Searcher` would use the new data, the service would experience a 
 massive amount of page faults.

 Warming up the new segment is not a solution, as the real problem is that the 
 system requires twice the amount of RAM for 30s when the switch to the single
 big merged segment occurs.

 A more reasonable merge policy puts a cap on the size of a segment.
 For instance, if the merge policy aims at producing segment up to 5GB, the 
 required extra margin is 5GB.

 In our example above, our segment merge would replace only a subset of our segments 
 - [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A] 
 - [seg1-A ] [ seg2-A ] [ seg3-A] .. [seg6-A][seg-1B]

 That's something that lucene enforces in its merge policy.

 While having 1000 segments is hurting performance,
 merging all segments down to one segment is not really useful.
 If an index is large, having one segment or a dozen of segments gives about the same perf.

 As you noted, NRT induces write amplification.

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *aparte*

 The phenomenon is a bit counterintuitive, because it is problem that gets a little 
 bit smaller as indexing throughput increases.

 If I soft commit every seconds, and I have one incoming document every second, 
 I will produce segments with 1 document, as you described.
 With a merge factor of 8 (8 segments get merged together), I will end up with a write amplification of around 8. 
 If however I receive 1k documents per second, I will end up with write amplification of 4 or so.
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The tiny segments induced by NRT will hit the page cache.
 They have a very short life expectancy, get merged rapidly, and get deleted.

 Like the large segments I discussed, they DO induce an extra margin on the amount 
 of RAM required... But I don't see how this margin could be big.
 Shouldn't 30s worth of segment files be sufficient?
 The linux kernel page reclaiment is not an LRU, but it is a good 
 first approximation. 

 Assuming we access each page of our running index at least once every 30s,
 what could be the reason for our kernel to evince a page of the live/hot index before
 evincing a page of one of these short lives segments that were deleted 30s ago?

 So what are we exactly talking about here?

 - Is the phenomenon you are describing, about page that are seldom access to 
 induce major page fault on some phrase query?

 - Is it about the necessary extra RAM margin being large than what I think it is?

 - Is it some weird side effect of the JVM not munmaping files? That used to be 
 a thing... Amplified because users would typically increase their heap size when experiencing,
 only making the problem worst. 
 Is it in combination to the kernel not being able to reclaim those page, especially
 considering they might still be dirty?
 I think most JVM support unmap nowadays, and Lucene uses those weird unsafe methods.
	First of all I am not very familiar with Elasticsearch settings,
	but fairly familiar with what stock Lucene does.
	I haven't any experience of NRTDirectory...

	Lucene simply writes immutable segments files.
	The NRTDirectory does not sync, in order to minimize the cost of commit.
	The OS will eventually flush these pages to disk.

	On the read side, the file is mapped into memory.
	On first access, the OS will experience a page fault.
	The page is in page cache, so the OS does not need to read anything
	from the disk, it will just bind map the virtual memory of the process to
	the page frame of the page cache. This is not an horrible event.

	Now in what kind of event does one end up access a page that the
	kernel evinced from the page cache?

	In the Indeed case, I talked about two case:

	1) there wasn't enough memory on the server to begin with.
	That's a no brainer. If you don't have enough memory, you will experience
	page cache. Still, this never happened on the term dictionary (which was in
	anonymous memory I believe, nor posting list. This only happened on positions...
	And the pages evinced where on positions two.)

	2) Indeed had a process where they would take all segments, and merge everything.
	The index would go from
	- [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A] : ~40GB
	to
	- [seg1-B] : ~40GB

	Instantly, as the `Searcher` would use the new data, the service would experience a
	massive amount of page faults.

	Warming up the new segment is not a solution, as the real problem is that the
	system requires twice the amount of RAM for 30s when the switch to the single
	big merged segment occurs.

	A more reasonable merge policy puts a cap on the size of a segment.
	For instance, if the merge policy aims at producing segment up to 5GB, the
	required extra margin is 5GB.

	In our example above, our segment merge would replace only a subset of our segments
	- [seg1-A ] [ seg2-A ] [ seg3-A]..[seg 10-A]
	- [seg1-A ] [ seg2-A ] [ seg3-A] .. [seg6-A][seg-1B]

	That's something that lucene enforces in its merge policy.

	While having 1000 segments is hurting performance,
	merging all segments down to one segment is not really useful.
	If an index is large, having one segment or a dozen of segments gives about the same perf.

	As you noted, NRT induces write amplification.

	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	aparte

	The phenomenon is a bit counterintuitive, because it is problem that gets a little
	bit smaller as indexing throughput increases.

	If I soft commit every seconds, and I have one incoming document every second,
	I will produce segments with 1 document, as you described.
	With a merge factor of 8 (8 segments get merged together), I will end up with a write amplification of around 8.
	If however I receive 1k documents per second, I will end up with write amplification of 4 or so.
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	The tiny segments induced by NRT will hit the page cache.
	They have a very short life expectancy, get merged rapidly, and get deleted.

	Like the large segments I discussed, they DO induce an extra margin on the amount
	of RAM required... But I don't see how this margin could be big.
	Shouldn't 30s worth of segment files be sufficient?
	The linux kernel page reclaiment is not an LRU, but it is a good
	first approximation.

	Assuming we access each page of our running index at least once every 30s,
	what could be the reason for our kernel to evince a page of the live/hot index before
	evincing a page of one of these short lives segments that were deleted 30s ago?

	So what are we exactly talking about here?

	- Is the phenomenon you are describing, about page that are seldom access to
	induce major page fault on some phrase query?

	- Is it about the necessary extra RAM margin being large than what I think it is?

	- Is it some weird side effect of the JVM not munmaping files? That used to be
	a thing... Amplified because users would typically increase their heap size when experiencing,
	only making the problem worst.
	Is it in combination to the kernel not being able to reclaim those page, especially
	considering they might still be dirty?
	I think most JVM support unmap nowadays, and Lucene uses those weird unsafe methods.