another parsing script :)

another parsing script :)

am 13.05.2011 17:46:51 von Natalie Conte

HI,

I have a file with sequences each sequence is 200 pb long and I have 30K
lines

ATGGATAGATA\n
TTCGATTCATT\n
GCCTAGACAT\n
TTGCATAGACTA\n
I want to calculate the AT ratio of each base based on their position
(3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...
I am beginner so please excuse my perl thinking!

my plan was to put everything in arrays, split on the digit and then
for each line put the 1st digit in another array,
my $fh ="./txt" ;
unless (open(REGIONS, $fh)){
print "Cannot open file \n";
}

my @list = ;
close REGIONS;

foreach my $line (@list){
chomp $line;
my @pb = split(/\d/, $line);
my @position = $pb[0]; for the fisrt position
$line++;

do that in a loop 200 times ( as we have 200 pb per sequence) which will
create 200 arrays with 30K digits in them. I would need an array of all
arrays at that point???

from them use a condition loop assessing the A or T compo for each
array in the big array , count them with a counter and divide by the
size of each array.

Could you please help me with this?
Thanks
Nat


--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: another parsing script :)

am 13.05.2011 18:11:46 von John Francini

"200 pb" -- does pb mean petabytes?

If so, those aren't going to fit in memory; you're going to have to read the file line by line, accumulating totals and ratios as you go.

J

--
John Francini
"I have come to the conclusion that one useless man is called a disgrace; that two are called a law firm; and that three or more become a Congress. And by God I have had *this* Congress!" --John Adams

On May 13, 2011, at 11:46, Nathalie Conte wrote:

> HI,
>
> I have a file with sequences each sequence is 200 pb long and I have 30K lines
>
> ATGGATAGATA\n
> TTCGATTCATT\n
> GCCTAGACAT\n
> TTGCATAGACTA\n
> I want to calculate the AT ratio of each base based on their position (3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...
> I am beginner so please excuse my perl thinking!
>
> my plan was to put everything in arrays, split on the digit and then for each line put the 1st digit in another array,
> my $fh ="./txt" ;
> unless (open(REGIONS, $fh)){
> print "Cannot open file \n";
> }
>
> my @list = ;
> close REGIONS;
>
> foreach my $line (@list){
> chomp $line;
> my @pb = split(/\d/, $line);
> my @position = $pb[0]; for the fisrt position
> $line++;
>
> do that in a loop 200 times ( as we have 200 pb per sequence) which will create 200 arrays with 30K digits in them. I would need an array of all arrays at that point???
>
> from them use a condition loop assessing the A or T compo for each array in the big array , count them with a counter and divide by the size of each array.
>
> Could you please help me with this?
> Thanks
> Nat
>
>
> --
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: another parsing script :)

am 13.05.2011 18:17:08 von Luca Cappelletti

--bcaec544ef8a89e41804a32aa5c8
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Nathalie

I'm absolute newbie in terms of Perl but take into account the use of PDL
that will help you better manage vectors, matrix and number crunch
calculations.

cheers,

Luca















--=20
---
Luca Cappelletti
http://developerinfodomestic.blogspot.com

"...Together we stand, divided we fall."

..O.
...O
OOO

Zeitgeist Activist

GTalk,MSN: luca cappelletti gmail com
FSF Member: #9269
Linux Registered User: #223411
Ubuntu Registered User: #7221

"l'intelligenza è utile per la sopravvivenza se ci permette di estingu=
ere
una cattiva idea prima che la cattiva idea estingua noi"

"La chiave di ogni uomo è il suo pensiero. Benché egli possa appa=
rire saldo
e autonomo, ha un criterio cui obbedisce, che è l'idea in base alla qu=
ale
classifica tutte le cose. Può essere cambiato solo mostrandogli una nu=
ova
idea che sovrasti la sua"

"Uno studioso è soltanto un modo in cui una biblioteca crea unâ€=
=99altra
biblioteca "

--bcaec544ef8a89e41804a32aa5c8--

Re: another parsing script :)

am 13.05.2011 18:17:33 von rent0n

On 13/05/11 17:11, John Francini wrote:
> "200 pb" -- does pb mean petabytes?
>
> If so, those aren't going to fit in memory; you're going to have to read the file line by line, accumulating totals and ratios as you go.
>
> J
>
> --
> John Francini

No, I'm quite sure pb (bp?) stands for base pairs, or nucleotides, the
units of DNA sequences. :)

> "I have come to the conclusion that one useless man is called a disgrace; that two are called a law firm; and that three or more become a Congress. And by God I have had *this* Congress!" --John Adams
>
> On May 13, 2011, at 11:46, Nathalie Conte wrote:
>
>> HI,
>>
>> I have a file with sequences each sequence is 200 pb long and I have 30K lines
>>
>> ATGGATAGATA\n
>> TTCGATTCATT\n
>> GCCTAGACAT\n
>> TTGCATAGACTA\n
>> I want to calculate the AT ratio of each base based on their position (3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...
>> I am beginner so please excuse my perl thinking!
>>
>> my plan was to put everything in arrays, split on the digit and then for each line put the 1st digit in another array,
>> my $fh ="./txt" ;
>> unless (open(REGIONS, $fh)){
>> print "Cannot open file \n";
>> }
>>
>> my @list =;
>> close REGIONS;
>>
>> foreach my $line (@list){
>> chomp $line;
>> my @pb = split(/\d/, $line);
>> my @position = $pb[0]; for the fisrt position
>> $line++;
>>
>> do that in a loop 200 times ( as we have 200 pb per sequence) which will create 200 arrays with 30K digits in them. I would need an array of all arrays at that point???
>>
>> from them use a condition loop assessing the A or T compo for each array in the big array , count them with a counter and divide by the size of each array.
>>
>> Could you please help me with this?
>> Thanks
>> Nat
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>> --
>> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
>> For additional commands, e-mail: beginners-help@perl.org
>> http://learn.perl.org/
>>
>>
>

--
rent0n

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: another parsing script :)

am 13.05.2011 18:22:24 von Rob Dixon

On 13/05/2011 16:46, Nathalie Conte wrote:
>
> I have a file with sequences each sequence is 200 pb long and I have 30K
> lines
>
> ATGGATAGATA\n
> TTCGATTCATT\n
> GCCTAGACAT\n
> TTGCATAGACTA\n

Does your data look like this? With 10, 11, or 12 characters per line?
I'm afraid I don't know what a pb is, are you saying that each line is
200 characters long?

> I want to calculate the AT ratio of each base based on their position
> (3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...
> I am beginner so please excuse my perl thinking!
>
> my plan was to put everything in arrays, split on the digit and then
> for each line put the 1st digit in another array,
> my $fh ="./txt" ;
> unless (open(REGIONS, $fh)){
> print "Cannot open file \n";
> }

OK, this has been mentioned before, but you should at least die instead
of just printing an error and continuing. The error message should
include the $! built-in variable, and ideally you would also use a
lexical file handle and the three-parameter form of open. Idiomatic Perl
would look like this:

my $filename ="./txt";
open my $regions, '<', $filename or die "Cannot open file: $!";

>
> my @list = ;
> close REGIONS;

Instead of reading the entire file into memory, especially with the
amount of data you have, you should read and process the file one line
at a time:

while (my $line = <$regions>) {
chomp $line;
:
}

> foreach my $line (@list){
> chomp $line;
> my @pb = split(/\d/, $line);
> my @position = $pb[0]; for the fisrt position
> $line++;

I'm afraid I don't follow your code. Although I can see that it
corresponds to your design above, there are no digits in the sample data
you show. Also, you are incrementing $line at the end, which it the most
recent line read from the file.

> do that in a loop 200 times ( as we have 200 pb per sequence) which will
> create 200 arrays with 30K digits in them. I would need an array of all
> arrays at that point???
>
> from them use a condition loop assessing the A or T compo for each
> array in the big array , count them with a counter and divide by the
> size of each array.
>
> Could you please help me with this?

I think the problem isn't a difficult one, but I am having problems
understanding what you need to do. Could you post a reasonable sample of
data and the corresponding output that you require? Perhaps it would
help to explain what pb, AT ratio, base, and so on mean in terms of the
data in the file.

Cheers,

Rob

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: another parsing script :)

am 13.05.2011 18:44:31 von John SJ Anderson

On Fri, May 13, 2011 at 11:46, Nathalie Conte wrote:
> I have a file with sequences each sequence is 200 pb long and I have 30K
> lines
>
> ATGGATAGATA\n
> TTCGATTCATT\n
> GCCTAGACAT\n
> TTGCATAGACTA\n
> I want to calculate the AT ratio of each  base based on their positi=
on
>  (3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...
[ snip ]
> foreach my $line (@list){
>    chomp $line;
>     my @pb =3D split(/\d/, $line);
>   my @position =3D $pb[0]; for the fisrt position
>       $line++;
>
> do that in a loop 200 times ( as we have 200 pb per sequence) which will
> create 200 arrays with 30K digits in them. I would need an array of all
> arrays at that point???

You don't need to do it 200 times; you can loop over the file once.
Something like this (caveat, untested):

my( @AT_count , $total );

open( my $FH , '<' , $file ) or die( $! );
while( <$FH> ) {
my @bases =3D split // , $_;
foreach my $idx ( 0 .. $#bases ) {
$AT_count[$idx]++ if $bases[$idx] eq 'A' or $bases[$idx] eq 'T';
$total++;
}
}
close( $FH );

foreach my $position ( 0 .. $#AT_count) {
printf "%.2f%% AT at position %d\n" , $AT_count[$position] / $total
* 100 , $position + 1;
}

Basically you're going to keep an array where each element corresponds
to the number of A or T residues at a given position across all your
sequences. So the first element is the total number of [AT] residues
in the first column, second element is [AT] in the second column, and
so on.

Read each line in the file, split into individual bases, loop over
each base, and if it is an A or a T, increment the corresponding
position in the array.

At the end, loop over your count array and convert each element into a
percentage and print it out.

NOTE: If some of the sequences are different lengths, you'll need to
store the total number of positions in an array too. You said they
were all 200bp long, so I only used a single counter for the total.

chrs,
john.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: another parsing script :)

am 14.05.2011 14:08:32 von rvtol+usenet

On 2011-05-13 17:46, Nathalie Conte wrote:

> ATGGATAGATA\n
> I want to calculate the AT ratio of each base based on their position
> (3/4) for the 1st position, 3/4 on the second, (0/4) on the 3rd...

perl -MData::Dumper -wle '
my %pos;
my $s = $ARGV[0];
push @{ $pos{ substr $s, $_, 2 } }, $_ for 0 .. length($s) - 2;
print Dumper \%pos;
' ATGGATAGATA

$VAR1 = {
'AG' => [
'6'
],
'GA' => [
'3',
'7'
],
'TG' => [
'1'
],
'GG' => [
'2'
],
'AT' => [
'0',
'4',
'8'
],
'TA' => [
'5',
'9'
]
};

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/