Content Creation and Text processing

Liam Quin from W3C has given a few useful tips relating to processing documents (eg error-prone re-typed or scanned text) into XML.

Many of these practises are important for the sort of text processing tasks that seem to come up in bioinformatics.

Article summary: use lots of small one-off scripts to make small changes, continually validate your output, briefly document your steps, automate steps with a meta-script or Makefile and keep input and output text seperate (.. well duh!).


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

scripting != programming

I like this part: "Before embarking on writing scripts, you need one of two things: the right frame of mind to write a script, or someone else to do it for you. In either case, the frame of mind is very different from what's needed for making a product, so not all professional programmers are good at doing this until they understand the differences."


clunky

Shouldn't we dream of a more streamlined approach, where you don't need glued together ad-hocery to process data?


ad voc scripts vs. integrated systems

Yes, I prefer the *idea* of a streamlined approach (just because I posted that article doesn't mean I agree with what it says :) ).

I think if the style of text being processed doesn't have a defined format, and it is a once off task, the ad hoc set of tools (which occasionally may be reusable) is often the way to go. Luckily, most raw bioinformatics data has a defined format, making a streamlined approach much more sensible.

I guess the question you have to ask yourself is ... am I ever likely to use this script again ?
(or am I feeling altruistic, and will someone else use it after me even if I won't ever need it again ?).

I guess it is the streamlined approach which is slowly emerging from the BioPython, BioPerl, Bio* etc projects, which is nice.


clunky ^ 2

as jim kent puts it: It's safer on the lagging edge.