Carl-Johan Ivarsson, chief executive officer of Qlucore, explains how principal component analysis (PCA) is helping to reveal hidden structures and patterns within high-dimensional data.
During the last decade, research into molecular biology has helped to identify a large number of genes associated with human disease and is therefore helping researchers to unpick the fundamental biology of major illnesses.
Complex gene expression experiments, in particular, are helping to support this process as they are able to create a global picture of cellular function by measuring the activity (often called the 'expression') of tens of thousands of genes at once.
Difficulties arise, however, as a result of the vast amount of data that is created by experiments such as these.
This data overload can present a serious problem for researchers, as it is essential to capture, explore and analyse this kind of data effectively in order to obtain the most meaningful results from their experiments.
In recent years, however, researchers have begun to use powerful software engines that enable them to visualise their data as full-colour 3D images that can be manipulated on a computer screen.
With this approach, scientists can identify hidden structures and patterns more easily and they can also identify any interesting and/or significant results by themselves, without having to rely on specialist bioinformaticians and/or biostatisticians.
The concept of 'data visualisation' works by projecting high dimensional data down to lower dimensions, which can then be plotted in 3D onto a computer screen so that it can be rotated (either manually or automatically) and examined by the naked eye.
With the benefit of instant user feedback on all of these actions, scientists studying human disease can now analyse their findings in real time and in an easy-to-interpret graphical form.
To begin the visualisation process, however, researchers must first reduce their high dimensional data down to lower dimensions so that it can be plotted in 3D.
This is where PCA comes in - PCA is often used for this purpose as it uses a mathematical procedure to transform a number of possibly correlated variables into a number of uncorrelated variables (called principal components).
One of the key breakthroughs in the latest generation of bioinformatics software is the ability to combine this PCA analysis with immediate user interaction.
This approach allows scientists to manipulate different PCA plots - interactively and in real time - directly on the computer screen, with all of their annotations and other links preserved.
As such, researchers are given full freedom to explore all possible versions of the presented view and are therefore able to visualise, analyse and explore a large dataset.
The high dimensional nature of many datasets makes direct visualisation impossible, since the human brain can only process a maximum of three dimensions.
As such, the solution is to work with data dimension reduction techniques such as PCA.
However, when using PCA to reduce the dimensions of such valuable data, it is important not to lose too much information in the process.
As such, the variation in a dataset can be seen as representing the information that researchers would like to keep.
PCA works well in this regard, as it is a well-established technique for reducing the dimensionality of data, while keeping as much variation as possible.
A meaningful reduction of the dimensionality is possible with PCA since the data is usually not uniformly distributed, but there are often strong correlations among groups of variables, indicating a certain amount of redundancy in the variable set.
Thus, the true number of underlying factors (representing most of the information in the data) is usually much smaller than the number of measured variables.
PCA is able to achieve this dimension reduction by creating artificial variables called principal components.
Each principal component is a linear combination of the observed variables; so if the data has been centred via subtraction of the mean value for each variable, the first principal component can be interpreted as the linear combination having maximal variance.
Subsequent components are then used to maximise the variance and yet remain uncorrelated to the previously extracted components.
The fact that the different principal components are uncorrelated ensures that they represent different characteristics of the original dataset.
Even though the exploration and analysis of large datasets can be challenging, tools such as PCA can provide a powerful way of identifying important structures and patterns very quickly, since data visualisation can provide the user with instant feedback and results that present themselves as they are being generated.
Larger studies, especially those that include multiple samples that need to be analysed on comprehensive array platforms, have traditionally been very time consuming and have also required a considerable amount of computer power.
As humans, however, we are used to interpreting 3D pictures in our environment and so our brains are able to find structures in complex 3D figures very quickly.
Therefore, it is no surprise that a 3D presentation of complex mathematical/statistical data makes it much easier for us to interpret.
Already, the latest technological advances in data visualisation are making it easier for scientists to compare the vast quantity of data generated by their studies and to test different hypotheses very quickly.
As a result, the latest data analysis tools are using techniques such as PCA to help scientists to regain control of this analysis and to realise the potential of the research being conducted in this area.