Skip to content

Instantly share code, notes, and snippets.

@romgrk
Created April 29, 2021 13:10
Show Gist options
  • Select an option

  • Save romgrk/e87409b272f449dfe297b64fc01212a1 to your computer and use it in GitHub Desktop.

Select an option

Save romgrk/e87409b272f449dfe297b64fc01212a1 to your computer and use it in GitHub Desktop.

presentation

Sections:

#1 #2 • Hi everyone, today I will be presenting one project that I've been working on this last year, as well as some open source libraries that I have released alongside the various projects that I've been working on. • Question period after each section but don't hesitate to stop me at any point during the presentation

#3 - BGR • This first section will touch the bacterial-genome-reconstruction tool. It's a web-based tool, implemented by myself based on an idea by Ken Dewar and Matthew D'Iorio. Its main purpose is to provide an easy way to identify a bacterial genome and to optimize the read lengths of the input file.

#4 • Before I talk about what it does, I'll just mention the design principles that have guided the implementation of the tool: to make it intuitive, easy to use, and minimalist. Even though the underlying tools are available as command-line utilities, many people are not comfortable using the command line. The BGR tool is a way to make it less complicated for such users to have access to those resources.

#5 • Let's go in details to understand what it does. • It uses the (Basic Local Alignment Search Tool), also known as BLAST, to match the input files to the known bacterial genomes. • Its operations are separated into 4 steps.

#6 • The first step is to input the files, as either fasta or fastq format • The user also has the possibility to provide both their R1 and R2 files, but this is optional • The tool will run basic validation on the input files to automatically detect which is the file format, and report if there are any errors within the files

#7 • The second step is to identify the genus present in the input files. The matches are presented in order of likelihood.

#8 • The third step is to identify the exact species present in the input files. The matches are again presented in order of likelihood.

#9 • After identifying the exact species, the tool finds the optimal read length for each input file and returns all the results

#10 • The tool also implements a task queue to handle running jobs concurrently. The task runner provides a position and status indicator for each job, and tries to provide an accurate estimate of how long the task will take to run.

#11 • Questions?

#12 - nodejs modules • For the second section, I'll present some nodejs libraries that I have released while working on various projects.

#13 • The first one is a simple parser & writer for FASTA and FASTQ file formats. It supports multi-sequence files, and provides detailed error reports when it fails. It has been used in the context of BGR and BGS

#14 • The second library here is a fork of an existing nodejs HDF5 library. The original HDF5 implementation is in C and is sometimes a bit hard to use correctly. This alternative offers a more secure API, and more detailed error messages than the original project. It also adds a few missing HDF5 features.

#15 • This third library is an implementation of the loompy file format, that previously existed only in python. The loompy file format is built on top of HDF5, and this library uses the HDF5 library that was presented just before. It has been used in the context of the RNAGET API that is implemented by the IHEC Data Portal.

#16 • And the final batch of libraries I will present here is about the fzy algorithm. FZY is an algorithm used for fuzzy matching, the most common use case for it being autocompletion. It's an algorithm that I find very useful, however running it in an interpreted language such as python, javascript or lua always ends up running into performance issues because number crunching is something that is better left to low-level languages. For that reason, I've release over the years bindings for a few environments: luajit, postgresql, and nodejs. The postgresql bindings are used in the new freezeman LIMS, to provide an easy way to jump to any kind of item.

#17 • That's it for my presentation, if you have any questions please go ahead, and thank you for listening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment