Medline XML to database parser?

I recently downloaded Medline in XML format - the goal is to load it in a relational database (like mySQL), index it, and then somehow save the world with the data. I'm pretty sure tons (relatively speaking) of people have done the same thing before (except maybe the save the world part), and I'd prefer not to reinvent the wheel if I don't have to.

Anyone know of a good XML->database parser for Medline? If not I guess I'll code one myself! Indexing tips / advice would also be appreciated (first time I'm playing with a 50+ GB database). I heard Lucene is the bomb for indexing such a large database... but my database experience is mainly mySQL up so far.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

LingPipe's MEDLINE XML parser + DB + search demo

LingPipe (which can be downloaded with source from our site -- see links below) contains a parser for the 2006 MEDLINE distribution. It works through a document-object model, creating Java objects for MEDLINE citations. It covers the entire MEDLINE DTD. We'll update it as soon as the 2007 data replaces the 2006 data; the format changes slightly every year.

We also provide a program that will automatically download MEDLINE from NLM's FTP site. It checks the checksums and retries if necessary. It can be run as a scheduled job to get the daily updates, too; it only downloads what doesn't already match the checksums.

We have a tutorial on using the parser to create a Lucene (search engine) index from the data, and another tutorial on using the parser to create a MySQL DB. The latter includes complete from-scratch MySQL installation instructions.

In our tutorial on MEDLINE, we show how to load data from MEDLINE into MySQL using our parser. It's easily extendible, but if you're looking for something that off-the-shelf that puts all of MEDLINE data into a search engine or DB, this only cherry-picks the most useful information.

The DB tutorial also shows how to run some of our natural language processing tools to do things like break the text into sentences and find references to entities of various kinds (e.g. protein subdomains, malignancies, variations such as SNPs or mutations, organisms, cell lines, etc.)

Here are the links:

LingPipe Home Page (Description and download links)

LingPipe Tutorial: MEDLINE Parsing and Indexing (Download MEDLINE, parse MEDLINE and index MEDLINE in Lucene)

LingPipe DB Text Mining Tutorial (Parse MEDLINE, load MEDLINE into MySQL and run natural language text mining tools over MEDLINE)


Sedna, a native XML database

I guess I could add my experiences with Medline database here, too. My website's (ebounce.info) PubMedTool -tool is currently using E-utilities offered by NCBI for querying and fetching data, but offline I've been making tests with native XML databases.

Most of the native XML databases practically choke, because of the huge amount of data. Sedna is one of those, which I found to handle things nicely. I mean, if one doesn't require immediate results and have time to wait for few minutes.

Usual weaknessies, which become clear in first 5 minutes:
- out of memory errors
- not possible to join databases or query multiple databases at the same time
- (awful) slowness of indexing of xml-files and actual querying
- ..

So far, I've mostly been using Sedna for playing with XQueries. However, Sedna doesn't support all the XQuery-functions yet, but so far it hasn't been a big problem. Also, while learning more about Sedna, I've started to miss different kind of indexing methods, XSLT support and few other things. It does have a API for few languages (C, Java, Python..).

Sedna is a good choice, if you don't need to query Medline in full, but maybe just last 10 years or so. Otherwise you'd get bored of waiting. And for querying smaller databases like MeSH (240Mb), it is a really good choice.

I've heard that Berkeley DB XML might also be worth a try, so I'll try it next. Again. Earlier I had some problems with it (I've already forgot, what it was).


XML database performance

Hi, I'm a Sedna developer.

Marko, thank you for reporting on Sedna!

Theoretically, Sedna can store data sets of unlimited size. Practically, we have tested it successfully with 150 Gb of XML data as a single document (!) and as collections of documents.

As concerns speed of query execution, it should be ok for 50+ Gb if you configure indexes properly. Sedna is not too fast for data loading but it usually works very good for querying. It would be interesting to find out why Marko found Sedna awfully slow for indexing and querying.

Errors like "out of memory" might happen but they are bugs (not fundamental Sedna limitations) which we will fix if you contact us. We are always avaliable for support, questions and bugfixes.


This is quite late reply

This is quite late reply from me, but I wanted to add that I wasn't referring to Sedna with that "Usual weaknessies" -list, but to some others yes. I've never had out of memory problems with Sedna, for example.


XML database performance is

XML database performance is very meh I think, and I need split second fast queries. Right now I'm having fun with PHP::SimpleXML. It loads XML as an object in memory and then you can do fun things like foreach ($xml->MedlineCitation as $citation) { do stuff here }. Works wonderfully. 30000 records inserts in mySQL in 25 seconds... so it should take 5-10 hours loading the entire thing in. And then maybe a day to index it?


PyLucene

I have a Lucene index of the MEDLINE data, created with PyLucene - it's what HubMed uses for relevance-ranked searches. If you still want some help with the indexing let me know.


Mapping XML2RDB

There are a few Perl tools around for parsing XML and mapping it to a RDB. XML.com is a good place to start.

Are you sure that you really want to do this? 50 GB of Medline XML sounds like a living hell to me. I assume this is some sort of text mining effort?


XML looks easy

It's for my next web project (as biologynews.net is kinda finished, for now). I'm building something useful (I hope) for the biologists community. Still in the preliminary stages, but big databases don't scare me :) It could take time to build as I'm still busy with my PhD (wetlab and all, microarrays and HIV, tons of fun) and can only work by night / weekends on it. My goal is to have something functional by the end of the year.


I would use XSLT

Most time I need to parse this kind of XML, I use a XSLT transformation (either directly from xsltproc or using the JAVA transformation API or PHP).
For example to insert a Medline in a DB I would use:


<?xml version='1.0' ?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>
<xsl:output method='text'/>

<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>

<!-- =================================================================== -->
<!-- =================================================================== -->
<!-- =================================================================== -->

<xsl:template match="PubmedArticleSet">
use database DB;
<xsl:apply-templates select='PubmedArticle'/>
</xsl:template>

<!-- =================================================================== -->
<!-- =================================================================== -->
<!-- =================================================================== -->

<xsl:template match="PubmedArticle">
insert ignore into article(PMID,title) values
(
<xsl:value-of select='MedlineCitation/PMID'/>,
&apos;<xsl:value-of select="MedlineCitation/Article/ArticleTitle"/>&apos;
);

set @pmid:= select uid from article where pmid=<xsl:value-of select='MedlineCitation/PMID'/>;
<xsl:apply-templates select='MedlineCitation/Article/AuthorList' />
</xsl:template>

<!-- =================================================================== -->
<!-- =================================================================== -->
<!-- =================================================================== -->
<xsl:template match="AuthorList">
<xsl:apply-templates select='Author' />
</xsl:template>

<!-- =================================================================== -->
<!-- =================================================================== -->
<!-- =================================================================== -->
<xsl:template match="Author">
insert ignore into author(forename,lastname) values
(
&quot;<xsl:value-of select='ForeName' />&quot;,
&quot;<xsl:value-of select='LastName' />&quot;
);

set @authorid:= select uid from author where
forename=&quot;<xsl:value-of select='ForeName' />&quot; and
lastname=&quot;<xsl:value-of select='LastName' />&quot;
;

insert into paper2author(pmid,athorid) values (@pmid,@authorid);

</xsl:template>

Not checked, not SQL-escaped but you get the idea...

Pierre


Thanks for posting the code

Thanks for posting the code snippet, however the code formatting for the comments is less than ideal. If you want to preserve this maybe post it to the wiki, the syntax highlighting feature is infinitely better...

While XSLT maybe your weapon of choice for the small chunks of XML that get returned from NCBI's EUtils, 50GB of data is probably going to make it explode (XSLT builds an in memory model of the XML file, right ?). I've tried running XSLT on the XML serialization of dbSNP, with less than spectacular results. That is unless the Medline database is split into smaller chunks ? Maybe I'm just missing the point here...

The largest amount of XML formatted data I've had to deal with was the UniProt database serialized as RDF/XML (10 GB). At the time I used the XMLReader interface (via python) from libxml2 to extract a subset of the data for this article (which is poorly written I might add and needs updating).


about the file size

Oh, Sorry, I thought the post was about chunks of pubmed articles (such as the one you can fetch via EUtils).

I remember one day I inserted some data from dbSNP in a database. This time I used a lex/yacc (the grammar was trivial) to parse the XML and to quickly insert the records.

Pierre


The medline database is

The medline database (in XML format) is broken in ~150 megs chunks, so it's not that bad I guess.


I don't have an answer to

I don't have an answer to your question, except roll-your-own solution maybe. And I'm sure that something like this has been asked before on nodalpoint (I don't have time to search now). However I am curious about a few points.

I remember a postdoc at UNSW (Rohan Tesdale ?) had the whole thing sent to him on DAT tapes and that was only a few years ago. Maybe it is DVDs now. Also do you have to sign a licence agreement to get the data ?

I assume 50 gig means the entire database of 20 million abstracts ? I don't think it matters how you do it, it is still going to requre some serious hardware support (RAM).

Also for future reference, questions like this are best posted in the forums. No big deal, for the moment I'll let this through as note to those who read this and also want to post questions.


Rohan Williams

Rohan Williams (R Teasdale is at the IMB). Last I heard, that copy was lying around on a hard drive somewhere - the paperwork to get it released from USyd to UNSW across the city was stupendous!


My mistake (slap me if you

My mistake (slap me if you will), my memory must be declining with age.


Yeah you have to sign a

Yeah you have to sign a license agreement to get it; you even have to provide a specific ip adress to access their ftp. DLT tapes are an option, but next year it will be ftp only; no DVD format :( The database compressed is only 7 GB though, so on a good connection it took about a day.

I guess I'll roll my own solution :) I have yet to try Lingpipe to automatically download daily updates.

I almost used the forum, but the 'hey there's only 1 post there no one might read it' syndrome kicked in, so I blogged about it :) I will next time for sure.