TITLE:
Scaling Science: leveraging Dask for life sciences
SHORT ABSTRACT:
Managing the challenges associated with big data in life sciences can be difficult. Scalable scientific computing is required to cope with the increasing demands of modern biology and neuroscience. Dask is a python library for distributed computation. In this talk, we'll look at several case studies where Dask is used to scale up data processing for life sciences. It will include examples from statistical genetics, single cell analysis, and imaging visualization & analysis. This will give you a better understanding of how you can extend code with Dask to scale your analysis.
DESCRIPTION:
Advances in modern biology research bring with them an increasing demand on computational resources. We need ways to scale scientific computing, to meet the demands of big data in biology and neuroscience. This talk provides an overview of how Dask can be used as a tool for more effective computing, and how this can be integrated with other tools in the scientific python ecosystem.
We will walk through several case studies, taken from a diverse range of biology and neuroscience applications. This includes examples from:
- statistical genetics
- single cell analysis
- image analysis
Dask is an open source project for distributed computing in python. In addition to the main Dask library, there are a number of other specialized Dask repositories of interest to biologists and neuroscientists, including but not limited to: dask-distributed, dask-ml, dask-image. The Dask organization can be found on github at https://github.com/dask and documentation is available at https://dask.org/
In addition, we touch on a number of other packages in the scientific python ecosystem:
- xarray, a package for labelled multi-dimensional arrays
- napari, a python based viewer for out-of-core visualization
- sgkit, a statistical genetics toolkit
- scanpy, a single cell analysis toolkit
After this talk you'll be aware of the range of potential approaches for scaling up analysis in the life sciences, and be more equipped to implement some of these approaches in your own work.
Additional Material Presenter speaking samples:
- PyConAU 2020 talk - https://www.youtube.com/watch?v=MpjgzNeISeI&list=PLs4CJRBY5F1IEFq-wumrBDRCu2EqkpY-R&index=2
- SciPy 2019 talk - https://www.youtube.com/watch?v=ytEQl9xs8FQ&list=PLYx7XA2nY5GcDQblpQ_M1V3PQPoLWiDAC&index=79&t=0s
KEYWORDS:
- big data
- distributed computing
- life science