Sina Majidian sinamajidian

The input is the genotype data from Supplementary file S13 from the study by Edwards et al. There are 14112 genes for the 94 samples. I assume the first part of the sample name AA_SRR23446543#1 is the population ID and the last letter is the haplotype ID. So there are 15 samples for AI (A. insularis), 15 for AW (A. woodehouseii), 14 for AC (A. coerulescens), and 1 for each of AA (A. californica), CY (Cyanocorax yucatanicus), and CS (Cyanocitta cristata).

Note that coordiantes are arbitary, 100 bases per each gene. If we have the gene length, we could make it accurate, probably shouldn't matter. We could double check whether CNV length matters in the PoMoCNV framework.

f= "data_S13.csv"
f_read= open(f,'r')

Code is from Deep Learning AI course How Transformer LLMs Work

Set up

conda create -n llm python=3.12
conda activate llm
conda install conda-forge::transformers
conda install conda-forge::jupyterlab  
conda install pytorch::pytorch

In our Movi color study, we ran metabuli using the following script. We used two read datasets from CAMI (long and short read) for metagenomic classification.

Install

conda create -n metab python=3.12
conda activate metab

wget https://mmseqs.com/metabuli/metabuli-linux-avx2.tar.gz

import pyham
import logging

treeFile=fastoma_out+'/species_tree_checked.nwk'
orthoxmlFile=fastoma_out+'/FastOMA_HOGs.orthoxml'

logging.basicConfig(format='%(asctime)s %(levelname)-8s %(message)s', level=logging.INFO, datefmt='%Y-%m-%d %H:%M:%S')

PhylogeneticTree with FastOMA

Updating codes frm the F1000 paper on PhylogeneticTree on github for FastOMA

Impute-first pipeline for variant calling

This is bash script is based on the Impute-First github and the preprint.

Inputs:

Reference genome
HGSVC2 Reference panel
PLINK genetic mac
Novaseq HG002 sequencing reads

I faced with this error several times and searching on the net only results in using sudo apt-get.

/usr/bin/ld: cannot find -lcurl
/usr/bin/ld: cannot find -lbz2
collect2: error: ld returned 1 exit status

A diploid genome using SURVIVOR

Using the following bash code, you can create a diploid genome using SURVIVOR. Finally, you will have three files :

Two fasta file: sim1.fasta and sim2.fasta.
Truely phased vcf file: sim_e_merg.vcf.

Some lines of the intermediate files:

	import os
	import sys
	import pandas as pd
	import sys
	import numpy as np

	"""
	python eval_kraken kraken_out.1st4th.csv true.csv nodes.dmp taxa.ids > out .stat

	For comparing kraken output with truth and calculate F1 when true reads defined at species level

	#!/usr/bin/python3

	import numpy as np
	from sys import argv


	file_fq_input_addrss = argv[1]
	file_fq_output_addrss = argv[2]

	file_fq_input= open(file_fq_input_addrss,'r');