deeTEEcee · August 11, 2022 07:56
diff --git a/csv_diff.py b/csv_diff.py
 """
 How to compare 2 csv files such that I can identify:
 * Added/Modified/Removed Rows but know specifically which header changed

 There are two ways to look at diffs:
 1. Line-by-line diffs

 Line-by-line diffs are dumb and can't tell what a "Modified" item is.

 2. Diffs with primary keys.

 If we analyze the two csv files and include the types of headers they are included, we can identify the "Modified" set based
 on primary keys which could be a single header or the joining of multiple headers. 

 High level logic (csv1, csv2 where csv2 is the newer one):
 1. Have 2 arrays of dicts and we assume the headers match. A primary key will be created where we simply join multiple header values together.
 2. Process both csv files and fill them into the data structure for #1
 3. Iterate through csv1, check deleted items that don't exist in csv2. These represent "Deletions" 
 4. Iterate through csv2, check new items that were not in csv1. These represent "Additions"
 5. During #4, we can also just check primary key matches and then check each header that changed. 

 Questions:
 1. What is processing time? Does this work for up to a million rows for csv1 and csv2?
 O(n * m) Where n is number of rows and m is number of headers. Assume both csv files are relatively similar.

 2. How to optimize? 
 """
	"""
	How to compare 2 csv files such that I can identify:
	* Added/Modified/Removed Rows but know specifically which header changed

	There are two ways to look at diffs:
	1. Line-by-line diffs

	Line-by-line diffs are dumb and can't tell what a "Modified" item is.

	2. Diffs with primary keys.

	If we analyze the two csv files and include the types of headers they are included, we can identify the "Modified" set based
	on primary keys which could be a single header or the joining of multiple headers.

	High level logic (csv1, csv2 where csv2 is the newer one):
	1. Have 2 arrays of dicts and we assume the headers match. A primary key will be created where we simply join multiple header values together.
	2. Process both csv files and fill them into the data structure for #1
	3. Iterate through csv1, check deleted items that don't exist in csv2. These represent "Deletions"
	4. Iterate through csv2, check new items that were not in csv1. These represent "Additions"
	5. During #4, we can also just check primary key matches and then check each header that changed.

	Questions:
	1. What is processing time? Does this work for up to a million rows for csv1 and csv2?
	O(n * m) Where n is number of rows and m is number of headers. Assume both csv files are relatively similar.

	2. How to optimize?
	"""
No results found