Projects

A follow-up on DBGWAS. We use the structure of the de Bruijn graph to define new genomic variants (e.g. gene presence) that accomodate polymorphisms. This amounts to defining one variant for each connected subgraph. We rely on a reverse search strategy to efficiently enumerate variants, and on the Tarone strategy to control the FWER when testing their association with a bacterial phenotype.

CALDERA

Convolutional kernel networks

We analyze convolutional and recurrent neural networks for biological sequences in the framework of positive definite kernels. In addition to providing a new interpretation on the genomic features underlying these networks, this analysis provides a version that is better regularized and performs better when few data is available. We also extended this work to graph-structured data.

Dexiong Chen, Laurent Jacob

Convolutional kernel networks

We describe the variation in bacterial genomes through their content in short sequences (k-mers). This avoids the bias usually caused by focusing on pre-defined SNPs or gene lists. We use this representation to pinpoint the genomic variation associated to antimicrobial resitances, and rely on a so-called de Bruijn graph to help make sense of the identified k-mers in terms of SNPs or mobile genetic elements.

DBGWAS

We train neural networks on sets of homologous sequences simulated from probabilistic evolution models to infer phylogenetic trees. We use self-attention to obtain permutation-equivariant functions and ensure that the obtained tree does not depend on the order in which the sequences are provided. Our hope is to perform inference that is order of magnitude faster than state of the art maximum likelihood, and as accurate or even better under models with untractable likelihoods.

Phyloformer

Convolutional neural networks on biological sequences are known to learn probabilistic motifs associated with the target phenotype. To go beyond this informal interpretation, we provide a statistical test quantifying the motif-phenotype association. This requires to solve a post-selection inference problem, where the selection involves the infinite set of possible motifs.

Antoine Villié, Laurent Jacob