Skip to content

Instantly share code, notes, and snippets.

@darencard
Last active July 19, 2024 08:44
Show Gist options
  • Save darencard/e1933e544c9c96234e86d8cbccc709e0 to your computer and use it in GitHub Desktop.
Save darencard/e1933e544c9c96234e86d8cbccc709e0 to your computer and use it in GitHub Desktop.
Quick one-liner commands that are useful for genomics

Useful Genomics Oneliners

The following commands sometimes require non-standard software like bioawk and seqtk.

rename scaffold headers with sequential numbers and lengths ("scaffold-N ")

bioawk -c fastx '{ print ">scaffold-" ++i" "length($seq)"\n"$seq }' < genome.fasta > new_genome.fasta

make association table of old and renamed scaffold names after above renaming command

paste <(grep ">" old_genome.fasta) <(grep ">" new_genome.fasta) | sed 's/>//g'

wrap fasta sequence lines to desired length (usually 80bp)

seqtk seq -l 80 seqs.fasta > wrapped_seqs.fasta

calculate sequence N50 length

seqtk comp genome.fasta | cut -f 2 | sort -rn | \
awk '{ sum += $0; print $0, sum }' | tac | \
awk 'NR==1 { halftot=$2/2 } lastsize>halftot && $2<halftot { print $1 } { lastsize=$2 }'

keep lines in a file if a certain field repeats at least N times. for example: if you wanted to calculate scaffold-wide stats only from scaffolds with 10 or greater samples, you'd use this to filter away lines from scaffolds with less than 10 samples. Just vary each $1 to the desired column number and change the 10 in the term if (c[i] >= 10) to desired theshold.

cat file.txt | awk 'BEGIN { FS="\t" } { c[$1]++; l[$1,c[$1]]=$0 } END { for (i in c) { if (c[i] >= 10) for (j = 1; j <= c[i]; j++) print l[i,j] } }'

remove FASTA sequence lines from a GFF file (sometimes they are included)

sed '/^##FASTA$/,$d' <genome.fasta> > <genome.noseq.fasta>

rename FASTA headers using a 2-column lookup table output format: >New|Old lookup table: tab-delimited 2-column text file with Old ID in first column and New ID in second

awk 'FNR==NR { a[">"$1]=$2; next } $1 in a { sub(/>/,">"a[$1]"|",$1)}1' lookup.txt seqs.fasta

run a blast and automatically create GFF format output must output blast in format 6

blast -db <db> -query <query> -outfmt 6 | \
  awk -v OFS="\t" '{ if ($10 > $9) print $2, "tblastn", "match", $9, $10, $12, "+", ".", "ID="$1; \
    else print $2, "tblastn", "match", $10, $9, $12, "-", ".", "ID="$1 }'

average of a set of sequential column per row adjust 'X' in two places below if not all columns are needed

awk -v OFS="\t" '{ sum = 0; for(i=X; i<=NF; i++) sum += $i; print $0, sum, sum / (NF-(X-1)) }'

produce a lookup table of protein symbols from a FASTA sequence this relies on bioawk for parsing FASTA and jq for parsing json the command basically queries mygene using NCBI accession IDs and parses the output can then use this lookup table to rename the FASTA sequence with seqkit

bioawk jq seqkit

# create lookup table
while read id; \
do paste \
<(echo ${id}) \
<(wget -qO- "http://mygene.info/v3/query?q=${id}&species=all&fields=name,symbol,taxid" | jq '.hits[0] | "\(._id)_\(.symbol)"' | sed 's/"//g'); \
done < \
<(bioawk -c fastx '{ print $name }' GCF_000090745.1_AnoCar2.0_protein.faa)

# create renamed fasta
bioawk -c fastx '{ print ">"$name; print $seq }' GCF_000090745.1_AnoCar2.0_protein.faa | \
seqkit replace -p '^(.+)$' -r "\${1}_{kv}" -k GCF_000090745.1_AnoCar2.0_protein_lookup.txt \
> GCF_000090745.1_AnoCar2.0_protein_rename.faa
@kaydadaemon
Copy link

This is amazing. Thank you.

Wondering how "awk 'FNR==NR { a[">"$1]=$2; next } $1 in a { sub(/>/,">"a[$1]"|",$1)}1' lookup.txt seqs.fasta" will be written if Old ID in second column and New ID in first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment