Journal Publication

Finding Maximal Exact Matches in Graphs

We show an $O(n\cdot L \cdot d^{L-1} + m + M_{\kappa,L})$-time algorithm finding all $\kappa$-MEMs between $Q$ and $G$ spanning exactly $L$ nodes in $G$, where $n$ is the total length of node labels, $d$ is the maximum degree of a node in $G$, $m = |Q|$, and $M_{\kappa,L}$ is the number of output MEMs.

Width Helps and Hinders Splitting Flows

We show that, for acyclic graphs, considering the width of the graph yields advances in our understanding of MFD approximability. For the version of the problem that uses only non-negative weights, we identify and characterise a new class of width-stable graphs, for which a popular heuristic is a $O(\log Val(X))$-approximation ($Val(X)$ being the total flow of $X$), and strengthen its worst-case approximation ratio from $\Omega(\sqrt{m})$ to $\Omega(m/\log{m})$ for sparse graphs. We also study a new problem on graphs with cycles, Minimum Cost Circulation Decomposition (MCCD), and show that it generalises MFD through a simple reduction. For the version allowing also negative weights, we give a $(\lceil \log ||X|| \rceil + 1)$-approximation ($||X||$ being the maximum absolute value of $X$ on any edge) using a power-of-two approach, combined with parity fixing arguments and a decomposition of unitary circulations ($||X||\le 1$), using a generalised notion of width for this problem.

A Safety Framework for Flow Decomposition Problems via Integer Linear Programming

We propose the first method for computing all safe solutions for an NP-hard problem, *minimum flow decomposition*. We obtain our results by developing a *safety test* for paths based on a general Integer Linear Programming (ILP) formulation. Moreover, we provide implementations with practical optimizations aimed to reduce the total ILP time. Experimental results on the transcriptome datasets of Shao and Kingsford (TCBB, 2017) show that all safe paths for minimum flow decompositions correctly recover up to 90% of the full RNA transcripts, which is at least 25% more than previously known safe paths. Moreover, despite the NP-hardness of the problem, we can report all safe paths for 99.8% of the over 27,000 non-trivial graphs of this dataset in only 1.5 hours.

Improving RNA Assembly via Safety and Completeness in Flow Decompositions

We give the first *local* characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the *complete* set of safe paths. Additionally, we evaluate our algorithm on RNA transcript datasets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers [TCBB 2021] and the popular heuristic *greedy-width*. Our safe and complete algorithm outperforms (by $\approx 20\%$) greedy-width on a unified metric (F-Score) considering both coverage and precision when the evaluated dataset has a significant number of complex graphs.

Safety in multi-assembly via paths appearing in all path covers of a DAG

We show how to compute maximal safe path for constrained path covers, with applications to multi-assembly. Our experiments in transcript assembly show that max. safe paths are very precise and cover 70% of transcripts.

Faster repetition-aware compressed suffix trees based on block trees

New Repetition-Aware Compressed Suffix Tree. Slightly larger than state-of-the-art, but outperforms them in time, often by orders of magnitude.

Block Trees

We introduce a data structure, the block tree, that represents $S$ in $O(z \log(n/z))$ space and extracts any symbol of $S$ in time $O(\log(n/z))$, among other space-time tradeoffs. The structure also supports other queries that are useful for building compressed data structures on top of $S$.