Skip to content

Instantly share code, notes, and snippets.

@RyanSchu
Last active February 8, 2024 01:38
Show Gist options
  • Save RyanSchu/301ea0a77a21414391b54193dfcea9e0 to your computer and use it in GitHub Desktop.
Save RyanSchu/301ea0a77a21414391b54193dfcea9e0 to your computer and use it in GitHub Desktop.
Convert FinalReport.txt files to plink lgen format

About

This gist details how to convert illumina final report genotype data to an input useable by PLINK. From there the genotype can be quality controlled and the end results exported to a .vcf file or otherwise parsed.

Lgen/PLINK format

lgen from the PLINK documentation "A text file with no header line, and one line per genotype call (or just not-homozygous-major calls if 'lgen-ref' was invoked) usually with the following five fields:

Family ID

Within-family ID

Variant identifier

Allele call 1 ('0' for missing)

Allele call 2"

Map format

Map from the PLINK documentation

"A text file with no header file, and one line per variant with the following 3-4 fields:

  1. Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.

  2. Variant identifier

  3. Position in morgans or centimorgans (optional; also safe to use dummy value of '0')

  4. Base-pair coordinate"

Fam format

Fam from the PLINK documentation "A text file with no header line, and one line per sample with the following six fields:

Family ID ('FID')

Within-family ID ('IID'; cannot be '0')

Within-family ID of father ('0' if father isn't in dataset)

Within-family ID of mother ('0' if mother isn't in dataset)

Sex code ('1' = male, '2' = female, '0' = unknown)

Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)"

Creating the lgen and map file

The attached Rscript illumina_to_lgen creates the associated lgen and map file used by plink. It can be saved into your directory as is to run with cli. To run it from the command line you must supply it with the illumina finalreport.txt file and an outpout directory. This will output two files into the output directory "mets.map" and "mets.lgen." Not that this will overwrite any existing file in the output directory that already bears these names. Also not that for its two allele calls thsi script chooses forward allele 1 and forward allele 2.

Example Run

Rscript illumina_to_lgen.R --illumina PATH/TO/FinalReport.txt --outputdir PATH/TO/OutputDirectory/

dummy values This script will generate dummy values to be used by certain columns in the output file. Within the lgen file, the "fid" column is a dummy value. It is simply a duplicate of the sample ids. Within the map file, the "Genetic Distance" aka the third column is a dummy column. Its contents are all set to 0.

Creating the fam file

Creation the the fam file was done in R. Sex data was read into R as shown in the Sex_to_Fam.R code.

dummy values within the fam file there are 4 columns of dummy values. These columns are "fid", "paternal id"," maternal id", and "phenotype" corresponding to columns one, three, four, and six. As in the lgen files, fid values are duplicates of the sample ids. "Paternal id"," maternal id", and "phenotype" are all set to zero.

Quality Control

Please look into the GWAS QC pipeline and wiki to see if this will meet your needs.

## Originally written by Ryan Schubert for the Wheeler Lab, github@RyanSchu Lab github@wheelerlab
## script to convert illumina final report gentoype data to lgen data. This script will generate the .lgen file and .map file used for plink format. For instructions on how to generate the .fam file please see the .md file
library(dplyr)
library(tidyr)
library(argparse)
parser <- ArgumentParser()
parser$add_argument("--illumina", help="file path of the sample list")
parser$add_argument("-o", "--outputdir", help="output directory")
args <- parser$parse_args()
Finalreport<-as.data.frame(read.table(file=args$illumina, sep='\t', skip = 9, header = T))
Finalreport<-filter(Finalreport, Allele1...Forward != 'I')
Finalreport["empty"]="0"
Finalreport['fid']<-Finalreport$Sample.ID
map<-select(Finalreport, Chr, SNP.Name, empty, Position)
map<-map[!duplicated(map),]
map<-map[complete.cases(map),]
lgen<-select(Finalreport, fid, Sample.ID, SNP.Name, Allele1...Forward, Allele2...Forward)
lgen<-lgen[!duplicated(lgen),]
lgen<-lgen[complete.cases(lgen),]
lgen<-filter(lgen, Allele1...Forward != "-" & Allele2...Forward != "-")
write.table(map, file = paste(args$outputdir,"/mets.map",sep=""), sep = "\t", col.names = F, row.names = F, quote = F)
write.table(lgen, file = paste(args$outputdir,"/mets.lgen",sep=""), sep = "\t", col.names = F, row.names = F, quote = F)
sex<-as.data.frame(read.table(file="Z:/BLayden-LL-UIC-MEGA-AA-Oct1-2018_CallRate-Gender.csv", sep=',', header = T))
sex["empty"]<-"0"
sex["empty1"]<-"0"
sex["empty2"]<-"0"
sex['fid']<-sex$Sample.ID
sex<-select(sex, fid, Sample.ID, empty, empty1, Gender, empty2)
sex$Gender<-gsub('F', '2', sex$Gender, fixed = T)
sex$Gender<-gsub('M', '1', sex$Gender, fixed = T)
write.table(sex, file = "Z:/mets_analysis/mets.fam", sep = "\t", col.names = F, row.names = F, quote = F)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment