Antonio Fernandez-Guerra genomewalker

Exploring PFAM similarities using consensus sequences

We want to know how the different clusters created by MMSEQS2 aggregate together at the domain architecture level.

Get all domain architectures
Get clusters that belong to the domain architecture
Get consensus sequences
Calculate SSN per domain architecture

We need the files:

Create FFINDEX files for each class

Get clustering file

We need the MMSEQS2 file with the clusters in fasta format, the data and index files:

marine_hmp_db_03112017_clu_fa
marine_hmp_db_03112017_clu_fa.index
marine_hmp_db_03112017_clu_rep.tsv
marine_hmp_db_03112017_clu_result2rep{.index}

Genomic unknowns in RefSeq genomes

We will screen the RefSeq genomes for genoimc unknowns. We will use the non-redundant set of genomes identified by MASH+t-SNE+PAM clustering.

Preparing RefSeq data

cd /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
mkdir genomes
rsync -Pauvz  /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/proteomes/downloads/ genomes/

Extract GU hits in refseq and plot genomic neighbourhood

Create needed folders

mkdir ft_files gb_files gb_files_filt gb_files_slice gb_files_ptt gb_files_geno

Get the assembly file from NCBI

Exploration of the ubiquitous EUs

Not related to the niche breadth analysis, we just use those that are found everywhere.

Exporting the clusters belonging to the components found in all samples

super_cl <- read_tsv("/Users/ufo/Downloads/all_cluster_components.tsv", col_names = TRUE, trim_ws = TRUE) %>%
  filter(component %in% clstrs_comp_eu_ubi) %>%
  select(clstr_name) %>%

18S magicblast

	dada2_input	filtered	denoised	merged	table	no_chimeras	perc_reads_survived
ERR1018543	6525	4002	3727	3629	3629	3076	47.1
ERR1018546	2421	1513	1276	1229	1229	1095	45.2

Total counts: 4,171

|Kingdom | n|

	#include <igraph/igraph.h>
	#include <math.h>

	int compare (const void * a, const void * b)
	{
	if ((double)a > (double)b) return 1;
	else if ((double)a < (double)b) return -1;
	else return 0;
	}

	#!/bin/bash

	# 1. We read a folder and take all fastq files
	# 2. We quality trim and remove short sequences
	# 3. Run kaiju and generate output

	#. "${MODULESHOME}"/init/bash
	set -x
	set -e
	main(){

	detachAllPackages <- function() {
	basic.packages <- c("package:stats","package:graphics","package:grDevices","package:utils","package:datasets","package:methods","package:base")
	package.list <- search()[ifelse(unlist(gregexpr("package:",search()))==1,TRUE,FALSE)]
	package.list <- setdiff(package.list,basic.packages)
	if (length(package.list)>0) for (package in package.list) detach(package, character.only=TRUE)
	}

	# Let’s get the non-merged reads ------------------------------------------
	purrr::map_df(merged, tidyr::extract, col = "sequence", into = "sequence" ) %>%
	filter(accept == FALSE) %>%
	select(nmatch, nmismatch, nindel) %>%
	skimr::skim()

	concat <- mergePairs(dada_forward, derep_forward, dada_reverse, derep_reverse, justConcatenate = TRUE)

	get_nonmerged <- function(X, merg = merg, conc = conc, fn = fn){
	m <- merg[[X]]