What this does:
- Insert rows to a datasource from postgres
- Calculate duplicated rows with a pipe
- Write those to a secondary datasource
- Delete duplicates in the primary datasource that are present in the secondary DS
- Truncate the secondary datasource
With this approach, if you run the delete scripts right after each insert operation, you'll have duplicates only for a couple of seconds.
With ReplacingMergeTrees you'll always have duplicated data (merges only happen once per day or so). If you need no duplicates at all you'll need either a MV that writes to an AggregatingMergeTree with ArgMaxState(...) funcs or something like this. With this, queries are simpler as you don't need ArgMaxMerge funcs and so on. Other approaches like ReplacingMergeTree + adding a FINAL clause to queries won't work with big data as that puts the full datasource in RAM.