Created
August 11, 2022 07:56
-
-
Save deeTEEcee/62d9f9c25bb932b46ef62dec1446fcdc to your computer and use it in GitHub Desktop.
Scratch notes + code for a csvdiff tool
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
How to compare 2 csv files such that I can identify: | |
* Added/Modified/Removed Rows but know specifically which header changed | |
There are two ways to look at diffs: | |
1. Line-by-line diffs | |
Line-by-line diffs are dumb and can't tell what a "Modified" item is. | |
2. Diffs with primary keys. | |
If we analyze the two csv files and include the types of headers they are included, we can identify the "Modified" set based | |
on primary keys which could be a single header or the joining of multiple headers. | |
High level logic (csv1, csv2 where csv2 is the newer one): | |
1. Have 2 arrays of dicts and we assume the headers match. A primary key will be created where we simply join multiple header values together. | |
2. Process both csv files and fill them into the data structure for #1 | |
3. Iterate through csv1, check deleted items that don't exist in csv2. These represent "Deletions" | |
4. Iterate through csv2, check new items that were not in csv1. These represent "Additions" | |
5. During #4, we can also just check primary key matches and then check each header that changed. | |
Questions: | |
1. What is processing time? Does this work for up to a million rows for csv1 and csv2? | |
O(n * m) Where n is number of rows and m is number of headers. Assume both csv files are relatively similar. | |
2. How to optimize? | |
""" | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment