Skip to content

Instantly share code, notes, and snippets.

@crazyhottommy
Last active October 3, 2016 18:06
Show Gist options
  • Select an option

  • Save crazyhottommy/ad479528c2cbc2679c3afb6439453dbd to your computer and use it in GitHub Desktop.

Select an option

Save crazyhottommy/ad479528c2cbc2679c3afb6439453dbd to your computer and use it in GitHub Desktop.

A dummy example for testing

cat DATA.tsv 
ID	head1	head2	head3	head4
1	25.5	1364.0	22.5	13.2
2	10.1	215.56	1.15	22.2

cat LIST.TXT 
ID
head1
head4

I need to extract column ID, head1 and head4 from DATA.tsv.

## the column number to be extracted

head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT |  sed 's/:.*$//'
1
2
5

### save to a variable and format it to 1,2,5 for cut command

cols=$(head -1 DATA.tsv | tr "\t" "\n" | grep -nf LIST.TXT | sed 's/:.*$//' | tr "\n" "," | sed 's/,$//')

## cut out

cut -f "${cols}" DATA.tsv 
ID	head1	head4
1	25.5	13.2
2	10.1	22.2

benchmarking for my 26G file:

time cut -f "${cols}" myfile.tsv > mysubset.txt

real    32m10.947s
user    31m42.511s
sys     0m26.686s


## memory usage very low!
top -M

top - 17:03:17 up 86 days,  4:43, 56 users,  load average: 13.99, 13.72, 13.05
Tasks: 754 total,   2 running, 742 sleeping,   5 stopped,   5 zombie
Cpu(s): 13.8%us,  5.2%sy,  0.0%ni, 80.3%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Mem:    31.354G total, 6535.461M used,   24.971G free,  274.668M buffers
Swap:   32.000G total, 2132.094M used,   29.918G free, 1367.434M cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                         
18042 mtang1    20   0  102m 4808  604 R 100.0  0.0   5:41.71 cut                                                                       
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment