Alkes Price
The talk is on "De Novo Identification of Repeat Families in Large Genomes", Alkes Price is giving the presentation. The slides are available here. A repeat family is a collection of similar sequence which appear many times in the genome e.g. Alu repeats. Pull out Alu sequences, align them, consensus. We don't know the regions, we don't know the boundaries, repeats don't appear of full copies only partial. Eddy concludes that the problem is messy.
Why do this ? Repeats are biological meaningful, genome rearrangements, drivers of evolution etc. For pragmatic reasons we need repeat masking. Why ? to do comparative genomics. You need to mask repeats before alignment, RepeatMasker is effective only if you know the library of repeats. So how do you identify the repeat families in large genomes.
In humans and mouse many repeats are already known via manual annotation. In other organisms there is a need for this.
Take an input genome an look for pairwise similarly (mentions an algorithm, Eddy ?). Disadvantages of pairwise approach... computationally intractable 10^6 Alu -> 10^12 pairwise alignments, difficult to determine the boundaries.
Their algorithm is RepeatScout. The speaker is going through the algorithm.
Idea: greedily extend 1 bp at a time from short l-mer seed, both left and right. Discard sequence when it stops aligning to the consensus. Now he's talking about repeat boundaries, they deal with repeat boundaries by optimizing an objective function (see the slides for the function).
Our algorithm is optimizing an objective function, this is different form most other algorithms. We are heuristically optimizing the function. This guy really likes the "objective function".
Post processing. There is potentially a lot of "junk" that the algorithm may pull out, this requires post processing of the results. They ran their software on the X chromosome. Recon algorithm gives different results (paper ). They benchmarked their algorithm against Recon using the C. briggsae genome. Shows a ven diagram of the overlap between the results of the two algorithms, 4.8 MB of repetitive sequence identified by RepeatScout that was not by Recon. Now the results of human and mouse. This is a test of the library generated by RepeatSouct vs. the manually curated library used with RepeatMasker.
Their algorithm is fast: linear time. Recon is quadratic time (due to the use of pairwise alignment)
Questions:
You don't seem to assume false positives ? (I didn't get the answer)

