I have been meaning to note down my *nix checklist of commands (For MacOS) which are very handy for basic operations on data. I will modify this post as and when I remember or come across something that fits here. These *nix commands are specifically tested for Mac OS.
Uniques
uniq - This is the unix unique function which can be primarily used to remove duplicates from a file amongst other things. The file has to be pre sorted for uniq to work
Consider file test which contains the following
$ cat test
aa
bb
bb
cc
cc
cc
Remove duplicates
$uniq test
aa
bb
cc
Count occurences of each item
$ uniq -c test
1 aa
2 bb
3 cc
Print only duplicate items in file
$ uniq -d test
bb
cc
Print only unique lines
$ uniq -u test
aa
Consider test now contains
$cat test
aa
bb
cc
AA
cC
Remove duplicate case insensitive. This file is not sorted though. So it has to be sorted first before uniq. -i flag is for case in sensitive
$ sort test | uniq -i
AA
bb
cC
Sort a fixed width file by a field which begins from 10th byte and ends at 20th
sort -k1.10,1.20 file | head -10
Case conversion
Convert all upper case in fileA to lower case and output as fileB
$ tr '[:upper:]' '[:lower:]' < fileA.txt > fileB.txt
Using tr to replace a string/char in file Convert all carriage returns to newline chars
$ tr '^M' '\n' < input.csv > output.csv
Delete All CR+LF chars from file
$ tr -d '^M\n' < inpfile.txt > outfile.txt
Remove extra spaces in a file
tr -s " " < file.txt > fileout.txt
File comparision
Compare two files and keep strings present in fileA but not in fileB
$ comm -23 fileA fileB
Compare two files and keep strings present in fileB but not in fileA
$ comm -13 fileA fileB
Compare two files and keep only strings which are present in both files
$ comm -3 fileA fileB
Sed
Primary purpose of sed is string replacement or pattern replacement.
Consider the following file as input
$ cat file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.
- Replacing or substituting string
$ sed 's/unix/linux/' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.
By default, the sed command replaces the first occurrence of the pattern in each line and it won't replace the second, third...occurrence in the line. Here the "s" specifies the substitution operation. The "/" are delimiters. The "unix" is the search pattern and the "linux" is the replacement string. If you miss a delimiter then the expression errors out as below
$ sed 's/unix/linux' file.txt
sed: 1: "s/unix/linux": unterminated substitute in regular expression
2 Replacing the nth occurrence of a pattern in a line. Use the /1, /2 etc flags to replace the first, second occurrence of a pattern in a line. The below command replaces the second occurrence of the word "unix" with "linux" in a line.
$ sed 's/unix/linux/2' file.txt
unix is great os. linux is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.
Here is the first occurence which is the default option
$ sed 's/unix/linux/1' file.txt
linux is great os. unix is opensource. unix is free os.
learn operating system.
linuxlinux which one you choose.
And the third occurence
$ sed 's/unix/linux/3' file.txt
unix is great os. unix is opensource. linux is free os.
learn operating system.
unixlinux which one you choose.
To replace all the occurence use 'g' (global replacement)
$ sed 's/unix/linux/g' file.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linuxlinux which one you choose.
To make the search case insensitive sed on mac does not have a flag but you can use plain regex to achieve it. For example modify the file.txt to below
$ vi file.txt
unix is great os. Unix is opensource. unix is free os.
learn operating system.
Unixlinux which one you choose.
sed 's/[Uu]nix/linux/g' file.txt
linux is great os. linux is opensource. linux is free os.
learn operating system.
linuxlinux which one you choose.
How to find a string in all the files contained in a directory. You could use grep or find.
grep -lr searchStr mydir
grep --recursive --ignore-case --files-with-matches “searchStr" mydir
find mydir -type f | xargs grep -l searchStr
To find/replace multiple strings use the -e flag.
sed -e 's/unix/linux/g' -e 's/Unix/Linux/g' file.txt
linux is great os. Linux is opensource. linux is free os.
learn operating system.
Linuxlinux which one you choose.
To replace a string that begins with a pattern use the regex for it alongwith sed
sed 's/^learn/learn to use/g' file.txt
unix is great os. Unix is opensource. unix is free os.
learn to use operating system.
Unixlinux which one you choose
To remove whitespace characters at end of the line
sed 's/[<spc><tab>]*|/|/g' file.txt
Unix command to know if your file has whitespace or tab characters
vi file.txt
:set list
Unix command to remove BOM (Byte Order Mark) characters from your file Open the file in binary mode using -b flag to verify if you have BOM. And then remove them
vi -b file.txt
:set nobomb
:wq
Use the -i flag to overwrite the existing file and create a backup of the original file. For example to remove all white spaces in a file.
sed 's/ //g' file.txt
cat file.txt
unixisgreatos.Unixisopensource.unixisfreeos.
learnoperatingsystem.
Unixlinuxwhichoneyouchoose
This will create a backup file called file.txt.bak with the original file contents and overwrite file.txt with no spaces To remove only the trailing spaces in a line use *$. The * character means "any number of the previous character" and $ refers to end of line.
sed -i .bak 's/ *$//g' file.txt
Verify the trailing whitespaces are removed by :set list
vi file.txt
:set list
unix is great os. Unix is opensource. unix is free os.$
learn operating system.$
Unixlinux which one you choose.$
To remove whitespaces between xml tags only.
sed -i .bak -e 's/> *</></g' file.xml
To replace a blank line with something else. You can match a blank line by specifying an end-of-line immediately after a beginning-of-line, i.e. with ^$
vi file.txt
unix is great os. Unix is opensource. unix is free os.
learn operating system.
Unixlinux which one you choose.
sed 's/^$/this used to be a blank line/' file.txt
unix is great os. Unix is opensource. unix is free os.
this used to be a blank line
learn operating system.
Unixlinux which one you choose.
To remove tabs at the end of a line. Ex: Add a tab to the end of first line, so :set list will show ^I
vi file.txt
unix is great os. Unix is opensource. unix is free os.^I$
learn operating system.$
Unixlinux which one you choose.$
To create a tab in your sed command. use ctrl + v and then ctrl + i
sed -i.bak 's/ *$//' file.txt
vi file.txt
:set list
unix is great os. Unix is opensource. unix is free os.$
learn operating system.$
Unixlinux which one you choose.$
Consider file test which contains the following
$ cat test
(firstname).aa
(firstname).bb
(firstname).bb
(firstname).cc
(firstname).CC
(lastname).hh
(lastname).jj
(lastname).ll
To extract the content after firstname
sed -En 's/.*firstname\)\.([A-Za-z]+).*/\1/p' test
aa
bb
bb
cc
CC
To extract everything before some content
sed -En 's/(.*)somecontent/\1/p' > output.file
or
sed 's/somecontent.*//'
To split by separator '_' and take the first part
awk -F '_' '{print $1}' file.txt
To add a comma after every word (space separated) in a file
sed -i.bak 's/ /, /g' file.txt
To add a comma at the end of every line in a text file
sed -i'.bak' 's/$/,/g' file.txt
To remove last comma from each line on file
sed -i.bak 's/,$//' File
To remove all double quotes in a file
sed -i'.bak' 's/\"//g' file.txt
To remove all single quotes in a file
sed -i'.bak' "s/'//g" file.txt
To remove everything after first comma in lines of file
awk -F ',' '{print $1}' file.txt > file_temp.txt && mv file_temp.txt file.txt
or with sed
sed -i.bak 's/,.*$//' file.txt && rm file.txt.bak
To extract everything between first and second comma in a file
awk -F ',' '{print $2}' file.txt
To add a character at beginning of every line in a file
sed -i.bak 's/^/prefix/' file.txt
To add quotes around first word of every line. Here , is the delimiter between words. $1 represents first word is to be selected. & is the content of first word. sub is a substitute function. See here for more details https://superuser.com/questions/664125/unix-surround-first-column-of-csv-with-double-quotes
awk -F, '{sub($1, "\"&\""); print}' file.txt
To copy records from a large file containing a string 'FOO' and adding those records with 'FOO' replaced by 'BAR'. Example:
cat fileA.txt
aaaa
bbb
ccccFOO
ddddFOO
First create another file with BAR records and then merge the two files keeping unique.
sed -i.bak 's/FOO/BAR/gi' fileA.txt
This creates a fileA.txt.bak
cat fileA.txt.bak
aaaa
bbb
ccccBAR
ddddBAR
To verify the correct number of records exists and have been copied. You can use following commands
grep -c 'FOO' fileA.txt
grep -c 'BAR' fileA.txt.bak
Also to get the num lines of each file
wc -l fileA.txt
wc -l fileA.txt.bak
Now merge the two files keeping only unique records.
sort -u fileA.txt fileA.txt.bak > fileA.txt_o | mv fileA.txt_o fileA.txt
Now fileA.txt should have everything. You can use the grep -c and wc -l to verify this file.
cat fileA.txt
aaaa
bbb
ccccBAR
ccccFOO
ddddBAR
ddddFOO
Search Strings
Total occurences of searchStr in current directory
grep -ro searchStr . | wc -l | xargs echo "Total matches :"
Total number of files where searchStr occurs in current directory
grep -lor searchStr . | wc -l | xargs echo "Total matches :"
To get an exact word match use the -w flag.
grep -lwr searchStr mydir
Recursively replace string original with replacement in all files under OSx directory mydir recursively(Excludes hidden files and folders)
find mydir \( ! -regex '.*/\..*' \) -type f -exec sed -i '' 's/original/replacement/g' {} \;
OR
find mydir \( ! -regex '.*/\..*' \) -type f -exec sed -i '' 's/original/replacement/g' {} +
The regex excludes all hidden files and folders which is particularly important if you want to avoid messing up your .DS_Store or .git files unknowningly. if you use zsh then the following would also work
sed -i -- 's/original/replacement/g' **/*(D*)
This isnt exlcuding hidden files though. The **/(D) is basically zsh way of saying recursively go through all sub directories and all files.
Delete all files of a certain type under current directory
find . -name "*.pyc" -exec rm -f {} \;
Replace a string with another string in all files under current directory
find . -name '*.sh' -exec sed -i 's/foo/bar/g' {} \;
or
find <path-to-directory> -type f -print0 | xargs -0 sed -i 's/foo/bar/g'
Remove everthing after first space in line. (Or extract first word from line)
awk '{ print $1 }' < input > output
Vi see line numbers
:set number
sed -n '105830,106694p;106695q' logile > output
starting line number: 105830,
ending line number: 106694