Skip to content

Instantly share code, notes, and snippets.

View genomewalker's full-sized avatar

Antonio Fernandez-Guerra genomewalker

View GitHub Profile
#include <igraph/igraph.h>
#include <math.h>
int compare (const void * a, const void * b)
{
if (*(double*)a > *(double*)b) return 1;
else if (*(double*)a < *(double*)b) return -1;
else return 0;
}

Exploring PFAM similarities using consensus sequences

We want to know how the different clusters created by MMSEQS2 aggregate together at the domain architecture level.

  1. Get all domain architectures
  2. Get clusters that belong to the domain architecture
  3. Get consensus sequences
  4. Calculate SSN per domain architecture

We need the files:

Create FFINDEX files for each class

Get clustering file

We need the MMSEQS2 file with the clusters in fasta format, the data and index files:

  • marine_hmp_db_03112017_clu_fa
  • marine_hmp_db_03112017_clu_fa.index
  • marine_hmp_db_03112017_clu_rep.tsv
  • marine_hmp_db_03112017_clu_result2rep{.index}

Genomic unknowns in RefSeq genomes

We will screen the RefSeq genomes for genoimc unknowns. We will use the non-redundant set of genomes identified by MASH+t-SNE+PAM clustering.

Preparing RefSeq data

cd /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
mkdir genomes
rsync -Pauvz  /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/proteomes/downloads/ genomes/

Extract GU hits in refseq and plot genomic neighbourhood

Create needed folders

mkdir ft_files gb_files gb_files_filt gb_files_slice gb_files_ptt gb_files_geno

Get the assembly file from NCBI

@genomewalker
genomewalker / nog_id.sh
Last active February 13, 2018 13:59
Create TARA subsampled (at nt level) read files
#!/bin/bash
# 1. We read a folder and take all fastq files
# 2. We quality trim and remove short sequences
# 3. Run kaiju and generate output
#. "${MODULESHOME}"/init/bash
set -x
set -e
main(){

Exploration of the ubiquitous EUs

Not related to the niche breadth analysis, we just use those that are found everywhere.

Exporting the clusters belonging to the components found in all samples

super_cl <- read_tsv("/Users/ufo/Downloads/all_cluster_components.tsv", col_names = TRUE, trim_ws = TRUE) %>%
  filter(component %in% clstrs_comp_eu_ubi) %>%
  select(clstr_name) %>%
detachAllPackages <- function() {
basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
package.list <- setdiff(package.list,basic.packages)
if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
}

18S magicblast

dada2_input filtered denoised merged table no_chimeras perc_reads_survived
ERR1018543 6525 4002 3727 3629 3629 3076 47.1
ERR1018546 2421 1513 1276 1229 1229 1095 45.2

Total counts: 4,171

|Kingdom | n|

# Let’s get the non-merged reads ------------------------------------------
purrr::map_df(merged, tidyr::extract, col = "sequence", into = "sequence" ) %>%
filter(accept == FALSE) %>%
select(nmatch, nmismatch, nindel) %>%
skimr::skim()
concat <- mergePairs(dada_forward, derep_forward, dada_reverse, derep_reverse, justConcatenate = TRUE)
get_nonmerged <- function(X, merg = merg, conc = conc, fn = fn){
m <- merg[[X]]