Skip to main content

Repairing the headers of phylip bioinformatics files to accurately reflect the updated number of sa [Resolved]

I have a dataset that I am working with made up of phylip files that I have been editing. Phylip format is a bioinformatics format that contains as a header the number of samples and the sequence length, followed by each sample and its sequence. for example:

5 10
sample_1 gaatatccga
sample_2 gaatatccga
sample_3 gaatatcgca
sample_4 caatatccga
sample_5 gaataagcga

My issue is that in trimming these datasets, the sample number in the header no longer is accurate (e.g. in above example might say five, but I've since trimmed to have only three samples). What I need to do is to replace that sample count with the new, accurate sample count but I cannot figure out how to do so without losing the sequence length number (e.g. the 10).

I have 550 files so simply doing this by hand is not an option. I can for-loop the wc but again I need to retain that sequence length information and somehow combine it with a new, accurate wc.

Question Credit: erikusrex
Question Reference
Asked May 15, 2019
Posted Under: Unix Linux
2 Answers

Another option would be to use ed (of course!):

for f in input*
  printf '1s/[[:digit:]][[:digit:]]*/%d\nw\nq' $(( $(wc -l < "$f") - 1 )) | ed -s "$f"

This loops over the files (named, for example input-something) and sends a simple ed-script to ed:

  • on line 1, search and replace (s//) one or more digits at the beginning of the line with another number -- that replacement number being the result of computing the line length of the input minus one
  • after that, w write the file out and
  • then q quit ed

credit: Jeff Schaller
Answered May 15, 2019
Your Answer