GMS6907: Applied bioinformatics and Omics Data Analysis
Post-graduate module: 4MCs
Module open once every year in semester 1 (August)
Prerequisites:
None
Instructors:
Kuan Rong Chan, Justin Ooi, Clara Koh
Module description:
With digitalisation of medical records and availability of high-throughput omics technologies, big data is becoming increasingly accessible. However, without formal bioinformatics training, it is easy to get lost in the big data as there could be multiple ways of interpreting them. Moreover, investigator bias may lead to errorneous interpretation of big data, which can compromise decision making processes. This module is designed to cover the concepts in big data analysis, including data mining, data preprocessing, data visualisation, pathway enrichment analysis and data dashboarding. Students will also learn the fundamentals of Python programming which is useful for handling big datasets.
Learning Outcomes:
Students will first learn the different technologies involved in generating omics big data, and appreciate how data is important in decision making processes. Students will also learn where to access such datasets and how to make their data available to the scientific community.
At the start of the course, students will master the use Microsoft Excel and Graphpad Prism for analysing and visualisation of scientific data. The fundamentals of data preprocessing and data visualisation will also be taught.
After learning the basics, students will be introduced to bioinformatics, particularly the Python programming language, which is essential for analysis of big data. At the end of the module, students are expected to be able to work within Jupyter Notebooks to write and execute python codes, load data files into Notebooks, manipulate data tables with the pandas library, plot visualisation graphs with matplotlib, seaborn and Plotly, and use the voila package for data dashboarding.
Finally, to be able to interpret omics datasets, students will learn how to use the different web tools and advanced python packages for enrichment analysis.
The course will be divided into lectures and tutorials, where the students will experience how to use the different bioinformatic tools during the tutorial sessions. Preparation work will include reading of additional materials related to data science and omics research.
Rationale:
As a researcher, it is critical to know how to obtain, analyse and interpret data from experiments. However, there are currently no graduate modules that cover the fundamental concepts of data analysis. This module should be of broad interest to most students who are pursuing a PhD in the field of sciences. The knowledge learnt can be immediately applied to their PhD projects, and even be useful for their future career if they are venturing into industry or academia.
Tentative topics covered:
Introduction to data science
At the start of the course, students will be taught the basic concepts of data science, and how data collection can influence our decision making processes. We will discuss the emerging platforms that can be used to generate big data and how they can be curated and stored in databases. Students will learn on the different types of big data and how to use the different data repositories to gain access to omics datasets. Students will also learn about GitHub, and how they can use GitHub to upload and analyse their datasets publicly or privately. The fundamentals of Microsoft Excel will also be covered for simple data analysis.
Introduction to statistics and statistical computing
Students will learn about the fundamentals of statistics, such as t-test, anova, chi-square and correlation coefficients, and the assumptions that need to be satisfied to make these statistical analysis valid. The concepts of p-value adjustments and false-discovery rates will be explained so that students can understand when and how to do p-value adjustments. The tools that can be used to obtain these values will be described in detail.
Data visualisation – From basics to advanced
Students will learn about the anatomy of graphs, and the critical information that should be presented on the axis and data points for optimal data visualisation. Students will also distinguish between categorical and continuous variables, and the best visualisation tools that can be used to analyse different types of variables. The module will also examine how we can annotate data points and use colour effectively to present data trends in a multi-dimensional fashion.
After learning the basics of Python, students will re-visit the topic on data visualisation to apply the use of Python programming to plot interactive bar charts to facilitate data exploration. Specifically, the plotly package will be described in detail for interactive graph plotting.
Python for big data analysis
Students will learn about the limitations of Excel and why Python can be a better tool big data analysis. Jupyter notebooks will be described in great detail and students will experience how to execute python commands within jupyter notebooks. The fundamentals of the python synthax will also be taught so that students can apply them for big data analysis.
Data preprocessing with Pandas package
Big data may be heterogeneous and incomplete. Students will learn the Pandas package in python to preprocess datasets before further data analysis can be performed. Specific topics on how to manage missing values, how to detect outliers, how to manage outliers will be covered to great length. Finally, we will cover about normalisation and scaling, and describe when, why and how we can normalise datasets.
Feature detection with Python
There could be multiple ways of analysing big data. How can we systematically plow through the data so that all the important features are captured? The content here covers how we can use Python to identify the features of a dataset. These include using Python to plot volcano plots, stacked bar charts, correlation matrices, heatmap, clustergrams etc, to extract the most important information from a big dataset.
Pathway enrichment analysis for omics studies
After feature detection, an enrichment analysis is often required to identify what the data mean in a biology context. We will cover the web tools that are available online to discover the biological functions of these features, In addition, we will cover on how we can use GSEApy for transcriptomics analysis. Pathway analysis of transcriptomics, proteomics and metabolomics will be covered in this topic.
Machine learning with Python
With collection of sufficient samples, machine learning is possible. Students will be taught on the fundamentals of machine learning, and the potential applications of machine learning. Students will also learn about multivariate analysis, and taught on how to do multivariate analysis using web tools and Python.
Publishing with big data
Scientists often struggle how to communicate big data effectively, which may cause confusion. We will discuss how to communicate big data effectively, and how to produce publication quality figures for scientific publications. We will also cover on how to do data dashboarding within jupyter notebooks, which will help greatly in communication of big data.
Mode of teaching and assessment
The lessons will be split into lectures and tutorials, where students will have hands-on sessions on how to download different data analysis tools, and how to use them effectively. Hence, every student will need to bring their laptops during lessons to learn most effectively.
There will be continuous assessment throughout the length of the course. Students will be tasked to perform projects related to:
Exploratory data analysis and interpretation of data
Building and developing a database for omics analysis
Building a webtool for data analysis
The final exam will be an assignment where an omics dataset will be given to students for them to use Python to gain insights on a dataset. This will evaluate their problem solving skills and whether they can apply their knowledge on real datasets. The instructors will be involved in facilitating these sessions and students will present their data analysis findings during classes.