Skip to content

Instantly share code, notes, and snippets.

@hodgestar
Created April 13, 2015 07:26
Show Gist options
  • Save hodgestar/6e542d0bcbce9d4c468f to your computer and use it in GitHub Desktop.
Save hodgestar/6e542d0bcbce9d4c468f to your computer and use it in GitHub Desktop.
How to dedup files of MSISDNs

How I dedup files of MSISDNs

This assumes we're starting with a CSV file containing the MSISDNs and other fields. If you have a file that contains only MSISDNs, with one MSISDN per line, skip to Step 4.

Step 1: Look at the file

In a terminal type:

head myfile.csv

This shows the first ten lines of the file.

Note which column contains the MSISDN (for the rest of this we'll assume it's column 3).

Step 2: Extract just the column with the MSISDN

Type:

cut -d , -f 3 myfile.csv > msisdns.txt

The -d , specifies that fields are separated by commas. The -f 3 selects column 3 (if your MSISDNs are in a different number column, use that instead).

Check that the result looks good by typing:

head msisdns.txt

Step 3: Remove the header line

The CSV file likely has a header row with column names. Remove that using:

tail -n +2 msisdns.txt > msisdns-nohdr.txt

Step 4: Sort and count uniques

Type:

sort -u msisdns-nohdr.txt > msisdns-unique.txt

You now have a file of unique MSISDNs.

Step 5: Count the MSISDNs

Type:

wc msisdns-unique.txt

Which outputs three numbers -- the number of lines (i.e. unique MSISDNs), the number of words (should be the same) and the number of characters (which we can ignore).

Extras

Combine files

Given two files of unique MSISDNs, do the following:

cat msisdns-unique-1.txt msisdns-unique-2.txt | sort -u > msisdns-unique-combined.txt

That will concatenate the two files into one, sort the result keeping only unique entries, and output the result to msisdns-unique-combined.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment