How to dedup files of MSISDNs

How I dedup files of MSISDNs

This assumes we're starting with a CSV file containing the MSISDNs and other fields. If you have a file that contains only MSISDNs, with one MSISDN per line, skip to Step 4.

Step 1: Look at the file

In a terminal type:

head myfile.csv

This shows the first ten lines of the file.

Note which column contains the MSISDN (for the rest of this we'll assume it's column 3).

Step 2: Extract just the column with the MSISDN

Type:

cut -d , -f 3 myfile.csv > msisdns.txt

The -d , specifies that fields are separated by commas. The -f 3 selects column 3 (if your MSISDNs are in a different number column, use that instead).

Check that the result looks good by typing:

head msisdns.txt

Step 3: Remove the header line

The CSV file likely has a header row with column names. Remove that using:

tail -n +2 msisdns.txt > msisdns-nohdr.txt

Step 4: Sort and count uniques

Type:

sort -u msisdns-nohdr.txt > msisdns-unique.txt

You now have a file of unique MSISDNs.

Step 5: Count the MSISDNs

Type:

wc msisdns-unique.txt

Which outputs three numbers -- the number of lines (i.e. unique MSISDNs), the number of words (should be the same) and the number of characters (which we can ignore).

Extras

Combine files

Given two files of unique MSISDNs, do the following:

cat msisdns-unique-1.txt msisdns-unique-2.txt | sort -u > msisdns-unique-combined.txt

That will concatenate the two files into one, sort the result keeping only unique entries, and output the result to msisdns-unique-combined.txt.

hodgestar/dedup.rst

How I dedup files of MSISDNs

Step 1: Look at the file

Step 2: Extract just the column with the MSISDN

Step 3: Remove the header line

Step 4: Sort and count uniques

Step 5: Count the MSISDNs

Extras

Combine files