The paper proposes a new algorithm to infer the spatial and temporal evolution of genetic mutations in cancers from cross-sectional mutational data. Cancers are the result of multiple genetic mutations that badly influence the behavior of the new cells. More in detail, cancer mutations target genetic pathways, a set of molecular regulator that govern the expression levels of genes. Understanding which are the "driving" mutations of cancers and what is the order in which they appear is of major interest to fight the disease. The algorithm provided is based on the experimentally supported assumption that each pathway is targeted by at most one driver mutations.
The algorithms maps a set of observation in a mutation matrix where each row represents a sample (individual), each column represents a gene and each elements is either 1 or 0 depending on weather a mutation has happened on that gene of that sample or not. The genes are then grouped in partitioning sets. The aim is to find the partitioning set which most closely satisfies the Pathway Linear Progression Model (PLPM). A Matrix M satisfies PLPM if each row has at most one mutation per partition, and if a partition Pk has a mutation, then all partitions Pj with j < k have a mutation has well.
The resulting ordered partitioning then provides a representation of the order in which the mutations have appeared, where each partitioning set is a pathway.
The algorithm itself tries to find the optimal partitioning by minimizing the difference of the observed matrix from randomly constructed matrix that satisfies PLPM on random partitions. Such delta is computed as the number of elements of the matrix that have to be flipped to reach the observed matrix.
From a computational point of view, the flips represent the passenger mutations, those mutations who randomly happen and have nothing to do with the cancer.
In order to prove the correctness of the approach, the authors have tested the algorithm on several matrices showing different probabilities of observing passenger mutations. As shown in figure (a), the success rate for this model is about 95% when the number of samples considered is larger than 1000. As a consequence, we need cancer studies of larger size compared to the ones we have at our disposal nowadays when dealing with model of this size. However, figure (b) shows that by slightly reducing the size of the model, we can achieve good result with smaller dataset and, in particular, dataset of the size that researchers have at their disposal. Moreover the algorithm has been proved to be robust to the inclusion of genes not related to the progression.
Apart from the theoretical proofs shown in the graphs, the authors applied their approach to 3 real dataset: one related to glioblastoma multiforme and one related to colorectal cancer.
- Algorithm explained in depth
- Results supported with real data
- Comparison with data obtained from other paper
- The paper obtains some theorems without providing any kind of demonstrations
- Biological meaning of the mathematical model not exhaustive
- Formulas within the text are sometimes difficult to read. Putting every formula on its line could be more effective
- Real data analysis is very technical without enough comments