Skip to content

Instantly share code, notes, and snippets.

View genomewalker's full-sized avatar

Antonio Fernandez-Guerra genomewalker

View GitHub Profile

Extract GU hits in refseq and plot genomic neighbourhood

Create needed folders

mkdir ft_files gb_files gb_files_filt gb_files_slice gb_files_ptt gb_files_geno

Get the assembly file from NCBI

Genomic unknowns in RefSeq genomes

We will screen the RefSeq genomes for genoimc unknowns. We will use the non-redundant set of genomes identified by MASH+t-SNE+PAM clustering.

Preparing RefSeq data

cd /bioinf/home/afernand/SANDBOX/unk_vs_refseq/
mkdir genomes
rsync -Pauvz  /bioinf/projects/megx/UNKNOWNS/chiara/NETWORK/RefSeq82.ref.gen/proteomes/downloads/ genomes/

Create FFINDEX files for each class

Get clustering file

We need the MMSEQS2 file with the clusters in fasta format, the data and index files:

  • marine_hmp_db_03112017_clu_fa
  • marine_hmp_db_03112017_clu_fa.index
  • marine_hmp_db_03112017_clu_rep.tsv
  • marine_hmp_db_03112017_clu_result2rep{.index}

Exploring PFAM similarities using consensus sequences

We want to know how the different clusters created by MMSEQS2 aggregate together at the domain architecture level.

  1. Get all domain architectures
  2. Get clusters that belong to the domain architecture
  3. Get consensus sequences
  4. Calculate SSN per domain architecture

We need the files:

#include <igraph/igraph.h>
#include <math.h>
int compare (const void * a, const void * b)
{
if (*(double*)a > *(double*)b) return 1;
else if (*(double*)a < *(double*)b) return -1;
else return 0;
}
#!/bin/bash -l
#$ -cwd
#$ -j y
#$ -t 1-10:1
#$ -tc 25
#$ -N hmm_unk
#$ -pe threaded 2
#$ -V
# Where the models are
mmseqs concatdbs /PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/unkprot_db \
/PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/HMPSOAP_db \
/PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/unkprot_db.withNewSequences
mmseqs concatdbs /PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/unkprot_db_h \
/PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/HMPSOAP_db_h \
/PROCESSING/OSD/UNKNOWNS/MMSEQ_MPI/unkdb_update_hmp/unkprot_db.withNewSequences_h
mawk '{print NR"\t"$1}' unkprot_db.withNewSequences_h | sed 's/\x0//g' > unkprot_db.withNewSequences.lookup
big_dada <- function(X){
cat("Processing:", X, "\n")
cat("Derep F:", X, "\n")
derepF <- derepFastq(filtpathF[[X]])
cat("Inferring F:", X, "\n")
ddF <- dada(derepF, err=errF, multithread=TRUE)
cat("Derep R:", X, "\n")
derepR <- derepFastq(filtpathR[[X]])
cat("Inferring R:", X, "\n")
ddR <- dada(derepR, err=errR, multithread=TRUE)
@genomewalker
genomewalker / otu_table_analysis.r
Last active October 2, 2017 16:05
Scripts for analysing original alma data set
# UCLUST workflow ---------------------------------------------------------
library(tidyverse)
library(cowplot)
library(ggpubr)
setwd("~/Downloads/alma_original_test")
uclust_df <- read_tsv("16s.insilico_mock.tsv", col_names = T)
uclust_16S_summary <- uclust_df%>% dplyr::rename(otu_name = `#OTU ID`) %>%
@genomewalker
genomewalker / sqlite_blob.r
Last active September 22, 2017 06:37 — forked from jfaganUK/sqlite_blob.r
Storing R Objects as a SQLite Blob
#' ## Storing R Objects in a SQLite Database
#' Two packages we are using. The first is the ```RSQLite``` which will be used to create and manage an in-memory SQLite database. The second is ```igraph``` which I will use to create and visualize a random network. Some of the work I do is on network simulation. I often don't know the metrics I need from a simulated network when it's created, so I want to be able to store the networks that are created so that I can go back later and analyze them.
library(RSQLite)
library(igraph)
#' Create a database in memory.
con <- dbConnect(SQLite(), ":memory:")
#' The table has two columns, an *id* column and a column called *graph* which is a **blob** type. This type just stores binary data.