fastq file modification help

am 06.06.2011 12:29:36 von Natalie Conte

Hi,

I need to remove the first 52 bp sequences reads in a fastq
file,sequence is on line 2.
fastq file from wikipedia:A FASTQ file normally uses four lines per
sequence. Line 1 begins with a '@' character and is followed by a
sequence identifier and an /optional/ description. Line 2 is the raw
sequence letters. Line 3 begins with a '+' character and is /optionally/
followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence.

A minimal FASTQ file might look like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

I have written this script to remove the first 52 bp on each sequence
and write this new line on newfile.txt document. It seems to do the job
, but what I need is to change my original bed file with the trimmed
seuqence lines and keep the other lines the same. I am not sure where to
start to modify the original fatsq.
this is my script to trim my sequence :

#!/software/bin/perl
use warnings;
use strict;

open (IN, "/file.fastq") or die "can't open in:$!";
open (OUT, ">>newfile.txt") or die "can't open out: $!";

while () {
next unless (/^[A-Z]/);
my $new_line=substr($_,52);
print OUT $new_line;

}

thanks for any suggestions
Nat

--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: fastq file modification help

am 06.06.2011 12:57:17 von Rob Coops

--0016368340d8ab1e3904a508f8dd
Content-Type: text/plain; charset=UTF-8

On Mon, Jun 6, 2011 at 12:29 PM, Nathalie Conte wrote:

> Hi,
>
> I need to remove the first 52 bp sequences reads in a fastq file,sequence
> is on line 2.
> fastq file from wikipedia:A FASTQ file normally uses four lines per
> sequence. Line 1 begins with a '@' character and is followed by a sequence
> identifier and an /optional/ description. Line 2 is the raw sequence
> letters. Line 3 begins with a '+' character and is /optionally/ followed by
> the same sequence identifier (and any description) again. Line 4 encodes the
> quality values for the sequence in Line 2, and must contain the same number
> of symbols as letters in the sequence.
>
> A minimal FASTQ file might look like this:
>
> @SEQ_ID
> GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
> +
> !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
>
>
> I have written this script to remove the first 52 bp on each sequence and
> write this new line on newfile.txt document. It seems to do the job , but
> what I need is to change my original bed file with the trimmed seuqence
> lines and keep the other lines the same. I am not sure where to start to
> modify the original fatsq.
> this is my script to trim my sequence :
>
> #!/software/bin/perl
> use warnings;
> use strict;
>
>
> open (IN, "/file.fastq") or die "can't open in:$!";
> open (OUT, ">>newfile.txt") or die "can't open out: $!";
>
> while () {
> next unless (/^[A-Z]/);
> my $new_line=substr($_,52);
> print OUT $new_line;
>
> }
>
>
> thanks for any suggestions
> Nat
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE.
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>
Hi Nathalie,

I am not 100% sure on this as I suspect that when modifying the original
file you also want to deal with that 4th line in which case I have no idea
how to deal with that as I do not understand what its purpose is....

Anyway assuming that you are simply dealing with the 2nd line only life is a
whole lot simpler.
You know the number of characters you are removing is always 52 so that we
don't have to deal with that anymore. Now we can take various routes we
could take all characters from 52 till the end of the string (substr puts
the first character on position 0 not on position 1 ;-) or we could simply
cut out all characters before the 52nd. We could do the cutting using a
regular expression or we could use substr for this purpose (I have no idea
which one is faster please benchmark that if you are looking at a large
number of such operations to be executed it could save you a lot of time ;-)

Using substr to do all the work: my $new_line2 = substr $_, 0, 52, "";
Using a regular expression to do the work: my $new_line3 = $_; $new_line3 =~
s/[A-Z]{51}//;
Doing the counting thing...: my $new_line4 = substr $_, 52, length $_;

All 3 will provide you the result you are looking for I suspect that the
first one will be the fastest option, based on what little experience I have
with these types of opperations but please do prove this before you start
working on thousands of files...

Regards,

Rob

--0016368340d8ab1e3904a508f8dd--

Re: fastq file modification help

am 06.06.2011 18:25:13 von Raymond Wan

Hi Nathalie,

On Mon, Jun 6, 2011 at 19:29, Nathalie Conte wrote:
> I need to remove the first 52 bp sequences reads in a fastq file,sequence=
is
> on line 2.
> fastq file from wikipedia:A FASTQ file normally uses four lines per
> sequence. Line 1 begins with a '@' character and is followed by a sequenc=
e
> identifier and an /optional/ description. Line 2 is the raw sequence
....
> #!/software/bin/perl
> use warnings;
> use strict;
>
>
> open (IN, "/file.fastq") or die "can't open in:$!";
> open (OUT, ">>newfile.txt") or die "can't open out: $!";
>
> Â while () {
> next unless (/^[A-Z]/);
> Â my $new_line=3Dsubstr($_,52);
> Â print OUT $new_line;
>
> }

I frequently play with FASTQ data but basically Rob has said it all.
Since FASTQ has a fixed format, you should use that to your advantage
by taking in the 4 lines at a time and then processing them as needed.

However, what you have above does not work because the fourth line
(the quality scores) can also contain A-Z. (Of course, if you are
trimming 52 bases from sequences, you probably want to trim 52 from
the quality scores, too. But that's a separate issue...)

I would suggest you create a loop that loops through the file and
takes in the four lines. Then check that line 1 starts with a @ and
line 3 starts with a +. Then compare the lengths of line 2 and 4 to
make sure they're equal. If all checks out, then do the trimming that
Rob suggests.

The FASTQ standard technically allows lines 2 and 4 to span multiple
lines -- so the sanity check above is a good idea if you want to make
your script flexible. But sometimes, you may know for certain this
does not occur in your data; if so, then you can skip this sanity
check

Good luck!

Ray

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/