more notes for oracle/opengrok#4317
Using the code fixed to perform correct Path comparisons in both places (on the set of files from FileCollector as well as in IndexDatabase#processFileIncremental()
, the script number 3 (oracle/opengrok#4317 (comment)) this still leads to cases with duplicate documents. This does not happen every time, it might be timing related.
The script goes through solaris-userland Git repository using particular sequence of changesets that were previously known to reproduce the duplicate documents (before the Path comparator fixes): 3b039b16f 170ebb43b 80cc6ae18 52c217090 a609ca278 69a8daead 07e01a4e6 154009177 794af3182 c073248f7 823f2c28e 4a5e3cb85 341f9beb2 653378bce 4f8fe9ee8
Changes to the code:
diff --git a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
index aea4f3e9f..f7cbea525 100644
--- a/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
+++ b/opengrok-indexer/src/main/java/org/opengrok/indexer/index/IndexDatabase.java
@@ -835,19 +835,25 @@ public class IndexDatabase {
Statistics stat = new Statistics();
LOGGER.log(Level.FINEST, "traversing the documents in {0} to collect uids of deleted documents",
indexDirectory);
+ StoredFields storedFields = reader.storedFields();
for (int i = 0; i < reader.maxDoc(); i++) {
+ Document doc = storedFields.document(i, LIVE_CHECK_FIELDS); // use limited-field version
+ IndexableField field = doc.getField(QueryBuilder.U);
if (!liveDocs.get(i)) {
- StoredFields storedFields = reader.storedFields();
- Document doc = storedFields.document(i, LIVE_CHECK_FIELDS); // use limited-field version
- IndexableField field = doc.getField(QueryBuilder.U);
if (field != null) {
if (LOGGER.isLoggable(Level.FINEST)) {
String uidString = field.stringValue();
- LOGGER.log(Level.FINEST, "adding ''{0}'' at {1} to deleted uid set",
- new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+ LOGGER.log(Level.FINEST, "adding ''{0}'' ({2}) at {1} to deleted uid set",
+ new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
}
deletedUids.add(field.stringValue());
}
+ } else {
+ if (field != null) {
+ String uidString = field.stringValue();
+ LOGGER.log(Level.FINEST, "live doc: ''{0}'' ({2}) at {1}",
+ new Object[]{Util.uid2url(uidString), Util.uid2date(uidString), i});
+ }
}
}
stat.report(LOGGER, Level.FINEST, String.format("found %s deleted documents in %s",
@@ -1675,17 +1681,27 @@ public class IndexDatabase {
void processFileIncremental(IndexDownArgs args, File file, String path) throws IOException {
final boolean fileExists = file.exists();
+ LOGGER.log(Level.FINEST, "processing ''{0}''", path);
+
// Traverse terms until reaching document beyond path of given file.
while (uidIter != null && uidIter.term() != null
&& uidIter.term().compareTo(emptyBR) != 0
&& FILEPATH_COMPARATOR.compare(
Path.of(Util.uid2url(uidIter.term().utf8ToString())), Path.of(path)) <= 0) {
+ String uidString = uidIter.term().utf8ToString();
+ LOGGER.log(Level.FINEST, "XXX: ''{0}'' at {1}",
+ new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
+
if (deletedUids.contains(uidIter.term().utf8ToString())) {
logIgnoredUid(uidIter.term().utf8ToString());
BytesRef next = uidIter.next();
if (next == null) {
uidIter = null;
+ } else {
+ uidString = uidIter.term().utf8ToString();
+ LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+ new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
}
continue;
}
@@ -1712,6 +1728,10 @@ public class IndexDatabase {
BytesRef next = uidIter.next();
if (next == null) {
uidIter = null;
+ } else {
+ uidString = uidIter.term().utf8ToString();
+ LOGGER.log(Level.FINEST, "XXX next: ''{0}'' at {1}",
+ new Object[]{Util.uid2url(uidString), Util.uid2date(uidString)});
}
}
Will be tracking changes for /solaris-userland/components/curl/test/results-64.master
(the file) that was known to appear in the list of duplicate documents.
Going through the changesets backwards:
- the file was changed (the log contains
writing history entries
for the file). - there was a deleted document for the file in the index and also a live path
FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473
Due to the same timestamp the uid of these 2 documents was the same and thus the traversal skipped the deleted document and merely added new document, leading to the duplicate paths found by the index check (newlines added):
index '/var/tmp/data.opengrok-issue-4317/index/solaris-userland' failed document check:
duplicate paths = {/solaris-userland/components/curl/test/results-64.master=2,
/solaris-userland/components/curl/test/results-32.master=2,
/solaris-userland/components/curl/curl.p5m=2,
/solaris-userland/components/curl/Makefile=2}
Looking at the indexer logs with added debug prints in the IndexDatabase#processFileIncremental()
:
19:07:13 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:07:13 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: ignoring deleted document for '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:07:13 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:07:13 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'
Then later on in the log, in the 2nd phase of indexing the file was indeed added:
19:07:14 FINEST: '/solaris-userland/components/curl/test/results-64.master': chosen by suffix: XMLAnalyzerFactory
19:07:14 FINE: Add: '/solaris-userland/components/curl/test/results-64.master' (XMLAnalyzer)
which lead to the duplicate document.
In order to see why there was a deleted document and live document with the same uid, it is necessary to look into the logs of the previous indexer run.
Firstly, the check of the duplicate documents passed for this one.
- the file was had a corresponding deleted document
FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
- and also a live doc:
FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,531) at 20240131180432473
There was no change for the file, need to move backwards.
ditto
ditto
ditto
ditto
ditto
ditto
ditto
ditto
ditto
ditto
- the file was changed
FINEST: writing history entries to '/var/tmp/data.opengrok-issue-4317/historycache/solaris-userland/components/curl/test/results-64.master': [ceaa5b4086fdfbfdaf05a5fdce656329d1fc127a, 1797f715e4ec60309db858a7376bd31ff4831fda]
- there was a live doc
FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473
the file was updated:
19:05:29 FINEST: processing '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX: '/solaris-userland/components/curl/test/results-64.master' at 20240131180432473
19:05:29 FINE: Remove: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINER: Removed: '/solaris-userland/components/curl/test/results-64.master'
19:05:29 FINEST: XXX next: '/solaris-userland/components/cvs/Makefile' at 20240131180432473
19:05:29 FINEST: processing '/solaris-userland/components/desktop/firefox/Makefile'
and this is where the deleted document came into existence. The new/added document had presumably the same time stamp because checking the indexer logs of the "previous" (next in time) indexer run this is indeed the case:
$ grep 'curl/test/results-64.master' /var/tmp/opengrok-issue-4317-52c217090.log
19:05:36 FINEST: adding '/solaris-userland/components/curl/test/results-64.master' (5,971) at 20240131180432473 to deleted uid set
19:05:36 FINEST: live doc: '/solaris-userland/components/curl/test/results-64.master' (9,722) at 20240131180432473
...
$ ls -1tr opengrok-issue-4317-*.log | while read f; do echo "########### $f"; grep 20240131180432473 $f; done | less