Skip to content

Instantly share code, notes, and snippets.

@iracooke
Last active January 23, 2020 00:02
Show Gist options
  • Save iracooke/a8d4d2d1fbc85bd75d5f5477abff211a to your computer and use it in GitHub Desktop.
Save iracooke/a8d4d2d1fbc85bd75d5f5477abff211a to your computer and use it in GitHub Desktop.
Split a fasta file

Split a Fasta file

This method relies on bioawk . First make sure you have bioawk installed. Then download the file split_fasta.awk from this repository. Instructions below assume you have this file available in your working directory

Installing bioawk (instructions specific for JCU HPC)

  1. Make a bin directory if you haven't already
cd ~
mkdir bin
  1. Put this directory on your path (if you haven't already)
echo "export PATH=${PATH}:${HOME}/bin" >> ~/.bash_profile
  1. Clone bioawk
git clone https://github.com/lh3/bioawk.git
  1. Build bioawk and copy to ~/bin
cd bioawk
make
cp bioawk maketab ../bin/
  1. Cleanup
cd ~
rm -r bioawk

Usage

To split a file with default parameters

cat input.fasta | bioawk -c fastx -f split_fasta.awk

To customise the prefix

cat input.fasta | bioawk -c fastx -v prefix="mycustom_" -f split_fasta.awk

To customise the number of records per chunk

cat input.fasta | bioawk -c fastx -v prefix="mycustom_" -v nrec=5000 -f split_fasta.awk
BEGIN{
if( prefix == ""){
prefix="chunk_";
}
if( nrec == ""){
nrec=1000
}
}
{
if( (NR-1)%nrec==0 ){
file=sprintf("%s%d.fa",prefix,(NR-1));
}
printf(">%s\t%s\n%s\n",$name,$comment,$seq) >> file
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment