more notes for oracle/opengrok#4317

Using the code fixed to perform correct Path comparisons in both places (on the set of files from FileCollector as well as in IndexDatabase#processFileIncremental(), the script number 3 (oracle/opengrok#4317 (comment)) this still leads to cases with duplicate documents. This does not happen every time, it might be timing related.

The script goes through solaris-userland Git repository using particular sequence of changesets that were previously known to reproduce the duplicate documents (before the Path comparator fixes): 3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8

Changes to the code:

diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
index aea4f3e9f..f7cbea525 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
@@ -835,19 +835,25 @@ public class IndexDatabase {
         Statistics stat = new Statistics();
         LOGGER.log(Level.FINEST, "traversing the documents in {0} to collect uids of deleted documents",
                 indexDirectory);
+        StoredFields storedFields = reader.storedFields();
         for (int i = 0; i < reader.maxDoc(); i++) {
+            Document doc = storedFields.document(i, LIVE_CHECK_FIELDS);  // use limited-field version
+            IndexableField field = doc.getField(QueryBuilder.U);
             if (!liveDocs.get(i)) {
-                StoredFields storedFields = reader.storedFields();
-                Document doc = storedFields.document(i, LIVE_CHECK_FIELDS);  // use limited-field version
-                IndexableField field = doc.getField(QueryBuilder.U);
                 if (field != null) {
                     if (LOGGER.isLoggable(Level.FINEST)) {
                         String uidString = field.stringValue();
-                        LOGGER.log(Level.FINEST, "adding ''{0}'' at {1} to deleted uid set",
-                                new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+                        LOGGER.log(Level.FINEST, "adding ''{0}'' ({2}) at {1} to deleted uid set",
+                                new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
                     }
                     deletedUids.add(field.stringValue());
                 }
+            } else {
+                if (field != null) {
+                    String uidString = field.stringValue();
+                    LOGGER.log(Level.FINEST, "live doc: ''{0}'' ({2}) at {1}",
+                            new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
+                }
             }
         }
         stat.report(LOGGER, Level.FINEST, String.format("found %s deleted documents in %s",
@@ -1675,17 +1681,27 @@ public class IndexDatabase {
     void processFileIncremental(IndexDownArgs args, File file, String path) throws IOException {
         final boolean fileExists = file.exists();
 
+        LOGGER.log(Level.FINEST, "processing ''{0}''", path);
+
         // Traverse terms until reaching document beyond path of given file.
         while (uidIter != null && uidIter.term() != null
                 && uidIter.term().compareTo(emptyBR) != 0
                 && FILEPATH_COMPARATOR.compare(
                         Path.of(Util.uid2url(uidIter.term().utf8ToString())), Path.of(path)) <= 0) {
 
+            String uidString = uidIter.term().utf8ToString();
+            LOGGER.log(Level.FINEST, "XXX: ''{0}'' at {1}",
+                    new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+
             if (deletedUids.contains(uidIter.term().utf8ToString())) {
                 logIgnoredUid(uidIter.term().utf8ToString());
                 BytesRef next = uidIter.next();
                 if (next == null) {
                     uidIter = null;
+                } else {
+                    uidString = uidIter.term().utf8ToString();
+                    LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+                            new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
                 }
                 continue;
             }
@@ -1712,6 +1728,10 @@ public class IndexDatabase {
             BytesRef next = uidIter.next();
             if (next == null) {
                 uidIter = null;
+            } else {
+                uidString = uidIter.term().utf8ToString();
+                LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+                        new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
             }
         }

Will be tracking changes for /solaris-userland/components/curl/test/results-64.master (the file) that was known to appear in the list of duplicate documents.

Going through the changesets backwards:

index at `4f8fe9ee8`

the file was changed (the log contains writing history entries for the file).
there was a deleted document for the file in the index and also a live path
- FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
- FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473

Due to the same timestamp the uid of these 2 documents was the same and thus the traversal skipped the deleted document and merely added new document, leading to the duplicate paths found by the index check (newlines added):

index '/var/tmp/data.opengrok-issue-4317/index/solaris-userland' failed document check: 
duplicate paths = {/solaris-userland/components/curl/test/results-64.master=2, 
/solaris-userland/components/curl/test/results-32.master=2, 
/solaris-userland/components/curl/curl.p5m=2, 
/solaris-userland/components/curl/Makefile=2}

Looking at the indexer logs with added debug prints in the IndexDatabase#processFileIncremental():

19:07:13 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:07:13 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: ignoring deleted document for '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:07:13 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'

Then later on in the log, in the 2nd phase of indexing the file was indeed added:

19:07:14 FINEST: '/solaris-userland/components/curl/test/results-64.master': chosen by suffix: XMLAnalyzerFactory
19:07:14 FINE: Add: '/solaris-userland/components/curl/test/results-64.master' (XMLAnalyzer)

which lead to the duplicate document.

In order to see why there was a deleted document and live document with the same uid, it is necessary to look into the logs of the previous indexer run.

index at `653378bce`

Firstly, the check of the duplicate documents passed for this one.

the file was had a corresponding deleted document
- FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
and also a live doc:
- FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473

There was no change for the file, need to move backwards.

index at `341f9beb2`

ditto

index at `4a5e3cb85`

ditto

index at `823f2c28e`

ditto

index at `c073248f7`

ditto

index at `794af3182`

ditto

index at `154009177`

ditto

index at `07e01a4e6`

ditto

index at `69a8daead`

ditto

index at `a609ca278`

ditto

index at `52c217090`

ditto

index at `80cc6ae18`

the file was changed
- FINEST: writing history entries to '/var/tmp/data.opengrok-issue-4317/historycache/solaris-userland/components/curl/test/results-64.master': [ceaa5b4086fdfbfdaf05a5fdce656329d1fc127a, 1797f715e4ec60309db858a7376bd31ff4831fda]
there was a live doc
- FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473

the file was updated:

19:05:29 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:05:29 FINE: Remove: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINER: Removed: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:05:29 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'

and this is where the deleted document came into existence. The new/added document had presumably the same time stamp because checking the indexer logs of the "previous" (next in time) indexer run this is indeed the case:

$ grep 'curl/test/results-64.master' /var/tmp/opengrok-issue-4317-52c217090.log 
19:05:36 FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
19:05:36 FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,722) at 20240131180432473
...

$ ls -1tr opengrok-issue-4317-*.log | while read f; do echo "########### $f"; grep 20240131180432473 $f; done | less

vladak/opengrok-issue-4317.md

more notes for oracle/opengrok#4317

index at 4f8fe9ee8

index at 653378bce

index at 341f9beb2

index at 4a5e3cb85

index at 823f2c28e

index at c073248f7

index at 794af3182

index at 154009177

index at 07e01a4e6

index at 69a8daead

index at a609ca278

index at 52c217090

index at 80cc6ae18

index at `4f8fe9ee8`

index at `653378bce`

index at `341f9beb2`

index at `4a5e3cb85`

index at `823f2c28e`

index at `c073248f7`

index at `794af3182`

index at `154009177`

index at `07e01a4e6`

index at `69a8daead`

index at `a609ca278`

index at `52c217090`

index at `80cc6ae18`