Introduction (Part I and II)
The first lectures aim to provide the theoretical bases required to face the research topics introduced in the course, as well as the main technological motivations of big data. The course is oriented to computer scientists, physicists, statisticians, genetic epidemiologists, bioinformaticians, genome biologists and aims to open a discussion on the challenges and opportunities in next-generation sequencing data analysis and massive data analysis.
Part I: Massive data, deep sequencing and indexing techniques. Software tools.
Part II: Moore's Law: current trends and the big data revolution. Approaches to work splitting: parallel algorithms, map reduce, data streaming.
Inferring genetic diversity from Next Generation sequencing
Niko Beerenwinkel, ETH Zurich, www.cbg.ethz.ch
Computational Biology Group (Svizzera)
Genetic diversity is a hallmark of evolution and it plays a key role in the pathogenesis and treatment of rapidly evolving pathogens, such as viruses, bacteria, and cancer cells.
With high-coverage next-generation sequencing (NGS), the genetic diversity of mixed samples can be probed at an unprecedented level of detail in a cost-effective manner. However, NGS reads tend to be erroneous and they are relatively short, complicating the detection of low-frequency variants and the reconstruction of long haplotype sequences. In this lecture, I will introduce computational and statistical challenges associated witgenetic diversity estimation from NGS data. I will discuss several approaches to their solution based on probabilistic graphical models and on combinatorial optimization techniques. Two major applications will be presented: the genetic diversity of HIV within patients and the genetic diversity of cancer cells within tumors.
Part 1: Detecting low-frequency single-nucleotide variants (SNVs)
Part 2: Local haplotype inference and global quasispecies assembly
The Paradigm of Data Stream for Next Generation Internet
Irene Finocchi, Università la Sapienza,Roma
Data stream algorithmics has gained increasing popularity in the lastfew years as an effective paradigm for processing massive data sets. A wide range of applications in computational sciences generate huge and rapidly changing streams of data that need to be continuously monitored and processed in one or few sequential passes, using a limited amount of working memory. Despite the heavy restrictions on time and space resources imposed by this data access model, major progress has been achieved in the last ten years in the design of streaming algorithms for several fundamental data sketching and statistics problems. The lectures will overview this rapidly evolving area and present basic algorithmic ideas, techniques, and challenges in data stream processing.
Next Generation sequencingdata analysis
Nadia Pisanti, Università di Pisa
New Sequencing Technologies have dramatically decreased costs and thus opened the way to new challenges in applications such as metagenomics and transcriptome analysis by means of sequences; in particular, low costs of re-sequencing applied to the human genome opens the way to new issued in personalised medicine. As a consequence, a new phase has been opened for genome research. From the point of view of the computer scientist, the management of huge amount of data, the small size of sequenced fragments (with respect to previous technologies), and the new applications that bring down on sequences lots of data that used to be managed with arrays, has led to several new problems in string algorithms. We will try to give an overview on them and on possible approaches to address these problems.
Niko Beerenwinkel, ETH Zurich, www.cbg.ethz.ch
Niko Beerenwinkel was born in Düsseldorf, Germany. He studied mathematics, biology, and computer science, and received his Diploma degree in Mathematics from the University of Bonn in 1999 and his PhD in Computer Science from Saarland University in 2004. He was a postdoctoral researcher at the University of California at Berkeley (2004-2006) and at Harvard University (2006-2007) before joining ETH Zurich as assistant professor of computational biology.
Niko Beerenwinkel's research is at the interface of mathematics, statistics, and computer science with biology and medicine. His interests range from mathematical foundations of biostatistical models to clinical applications. Current research topics include haplotype inference from ultra-deep sequencing data, somatic evolution of cancer, reconstruction of signaling pathways from RNAi screens, HIV drug resistance, graphical models, and algebraic statistics.
He has authored over 50 research articles in the areas of computational biology, bioinformatics, biostatistics, virology, and cancer biology. His honors include the Otto Hahn Medal of the Max Planck Society and the Emmy Noether Fellowship of the German National Science Foundation.
Irene Finocchi, Università la Sapienza, Roma
Irene Finocchi obtained a PhD in Computer Science (2002) from SapienzaUniversity of Rome, where is currently Associate Professor at theDepartment of Computer Science. Her research interests include thedesign, theoretical analysis, and experimental evaluation ofalgorithms and data structures, focusing on algorithmics for massivedata sets, algorithms resilient to memory faults, and algorithmengineering. More recently, she has been exploring the application ofalgorithmic theory for data-intensive scenarios to the design andimplementation of dynamic program analysis tools. Irene Finocchi hasbeen PC co-chair of ALENEX'09, the 11th SIAM Workshop on AlgorithmEngineering and Experiments, and has served on the program committeesof many major conferences in the field of algorithmics including SODA(ACM-SIAM Symp. on Discrete Algorithms), ICALP (Int. Colloquium onAutomata, Languages & Programming),and ESA (European Symp. OnAlgorithms). She is recipient of a Distiguished Paper Award at OOPSLA2011, the 26th Annual ACM SIGPLAN Conference on Object-OrientedProgramming, Systems, Languages, and Applications.
Nadia Pisanti,Università di Pisa
Nadia Pisantiobtained the DEA "Informatique Fondamentale et Applications" on the subject "Genome Analysis" from the University of Marne la Vallée, France, and a PhD in Computer Science in 2002 from the University of Pisa.Her research interests include Bioinformaticsand algorithms for Computational Biology. She has been a visiting researcherat the Instutute Pasteur in Paris, at INRIA Rhone Alpes, at the University of Haifa, LIACS in Leiden, the King's College of London and at the University of Lion 1. Since 2006 she isResearch Assistant at the Department of Computer Science in Pisa. She served in theProgram Committees of many major conferences in Bioinformatics, including ICCABS, WABI and RECOMB.