Parsing file

am 02.06.2011 12:41:12 von Aravind Venkatesan

--------------020700020802070205070300
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

I want to parse a file with contents that looks as follows:

ENTRY K00001 KO
NAME E1.1.1.1, adh
DEFINITION alcohol dehydrogenase [EC:1.1.1.1]
PATHWAY ko00010 Glycolysis / Gluconeogenesis
ko00071 Fatty acid metabolism
///
ENTRY K14865 KO
NAME U14snoRNA, snR128
DEFINITION U14 small nucleolar RNA
CLASS Genetic Information Processing; Translation; Ribosome
Biogenesis [BR:ko03009]
///
ENTRY K14866 KO
NAME U18snoRNA, snR18
DEFINITION U18 small nucleolar RNA
CLASS Genetic Information Processing; Translation; Ribosome
Biogenesis [BR:ko03009]
///

each record ends with "///". The ultimate aim is to store information
from each record (for instance ENTRY, NAME) in a data structure (hash)
such as (ENTRY => K14865; NAME => [U14snoRNA, snR128]... so on)

so to start of I have produced the following snippet:

use strict;
use warnings;
use Carp;
use Data::Dumper;

my $set = &parse("D:/workspace/KEGG_Parser/data/ko");

sub parse {
my $keggFile = shift;
my $keggHash;
open my $fh, '<', $keggFile || croak ("Cannot open file
'$keggFile': $!");
my $contents = do {local $/; <$fh>};
my @rec = split ('///', $contents);

foreach my $line (@{rec}){
next if ($line =~ /^\s*$/);
if ($line =~ /^ENTRY\s{7}(.+?)\s+/){
$keggHash->{'ENTRY'}= $1;
}
elsif ($line =~ /^NAME\s{8}(.+?)$/){

push @{$keggHash->{'NAME'}}, $1;
}
else{}
print Dumper($keggHash);
close $fh;
}

The output I get is

$VAR1 = {
'ENTRY' => 'K00001'
};

Not all the lines in each element of @rec is getting read.I would
appreciate if somebody could guide me through this.

Thank to all,

Aravind

--------------020700020802070205070300--

Re: Parsing file

am 02.06.2011 12:46:36 von John SJ Anderson

On Thu, Jun 2, 2011 at 06:41, venkates wrote:
> Hi,
>
> I want to parse a file with contents that looks as follows:
[ snip ]

Have you considered using this module? ->

Alternatively, I think somebody on the BioPerl mailing list was
working on another KEGG parser...

chrs,
j.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parsing file

am 02.06.2011 13:28:55 von Aravind Venkatesan

On 6/2/2011 12:46 PM, John SJ Anderson wrote:
> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>> Hi,
>>
>> I want to parse a file with contents that looks as follows:
> [ snip ]
>
> Have you considered using this module? ->
>
>
> Alternatively, I think somebody on the BioPerl mailing list was
> working on another KEGG parser...
>
> chrs,
> j.
>
I am doing this as an exercise to learn parsing techniques so guidance
help needed.

Aravind

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parsing file

am 02.06.2011 14:44:49 von Rob Coops

--00163649a3ade3808304a4ba01c5
Content-Type: text/plain; charset=UTF-8

On Thu, Jun 2, 2011 at 1:28 PM, venkates wrote:

> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>
>> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>>
>>> Hi,
>>>
>>> I want to parse a file with contents that looks as follows:
>>>
>> [ snip ]
>>
>> Have you considered using this module? ->
>>
>>
>> Alternatively, I think somebody on the BioPerl mailing list was
>> working on another KEGG parser...
>>
>> chrs,
>> j.
>>
>> I am doing this as an exercise to learn parsing techniques so guidance
> help needed.
>
> Aravind
>
>
>
> --
> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
> For additional commands, e-mail: beginners-help@perl.org
> http://learn.perl.org/
>
>
>
This is a simple and ugly way of parsing your file:

use strict;
use warnings;
use Carp;
use Data::Dumper;

my $set = parse("ko");

sub parse {
my $keggFile = shift;
my $keggHash;

my $counter = 1;

open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile': $!");
while ( <$fh> ) {
chomp;
if ( $_ =~ m!///! ) {
$counter++;
next;
}

if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY' =>
$1 }; }
if ( $_ =~ /^NAME\s+(.*)$/sm ) {
my $temp = $1;
$temp =~ s/,\s/,/g;
my @names = split /,/, $temp;
push @{${$keggHash}{$counter}{'NAME'}}, @names;
}
}
close $fh;
print Dumper $keggHash;
}

The output being:

$VAR1 = {
'1' => {
'NAME' => [
'E1.1.1.1',
'adh'
],
'ENTRY' => 'K00001'
},
'3' => {
'NAME' => [
'U18snoRNA',
'snR18'
],
'ENTRY' => 'K14866'
},
'2' => {
'NAME' => [
'U14snoRNA',
'snR128'
],
'ENTRY' => 'K14865'
}
};

Which to me looks sort of like what you are looking for.
The main thing I did was read the file one line at a time to prevent a
unexpectedly large file from causing memory issues on your machine (in the
end the structure that you are building will cause enough issues
when handling a large file.

You already dealt with the Entry bit so I'll leave that open though I
slightly changed the regex but nothing spectacular there.
The Name bit is simple as I just pull out all of them then then remove all
spaces and split them into an array, feed the array to the hash and hop time
for the next step which is up to you ;-)

I hope it helps you a bit, regards,

Rob

--00163649a3ade3808304a4ba01c5--

Re: Parsing file

am 02.06.2011 16:41:03 von Aravind Venkatesan

On 6/2/2011 2:44 PM, Rob Coops wrote:
> On Thu, Jun 2, 2011 at 1:28 PM, venkates wrote:
>
>> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>
>>> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>>>
>>>> Hi,
>>>>
>>>> I want to parse a file with contents that looks as follows:
>>>>
>>> [ snip ]
>>>
>>> Have you considered using this module? ->
>>>
>>>
>>> Alternatively, I think somebody on the BioPerl mailing list was
>>> working on another KEGG parser...
>>>
>>> chrs,
>>> j.
>>>
>>> I am doing this as an exercise to learn parsing techniques so guidance
>> help needed.
>>
>> Aravind
>>
>>
>>
>> --
>> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
>> For additional commands, e-mail: beginners-help@perl.org
>> http://learn.perl.org/
>>
>>
>>
> This is a simple and ugly way of parsing your file:
>
> use strict;
> use warnings;
> use Carp;
> use Data::Dumper;
>
> my $set = parse("ko");
>
> sub parse {
> my $keggFile = shift;
> my $keggHash;
>
> my $counter = 1;
>
> open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile': $!");
> while (<$fh> ) {
> chomp;
> if ( $_ =~ m!///! ) {
> $counter++;
> next;
> }
>
> if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY' =>
> $1 }; }
While trying a similar thing for DEFINITION record, instead of appending
current hash with ENTRY and NAME, the DEFINITION record replaces the
contents in the hash?

$VAR1 = {
'4' => {
'DEFINITION' => 'U18 small nucleolar RNA'
},
'1' => {
'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
},
'3' => {
'DEFINITION' => 'U14 small nucleolar RNA'
},
'2' => {
'DEFINITION' => 'alcohol dehydrogenase (NADP+)
[EC:1.1.1.2]'
},
'5' => {
'DEFINITION' => 'U24 small nucleolar RNA'
}
};

code: in addition to what you had suggested -
if($_ =~ /^DEFINITION\s{2}(.+)?/){
${$keggHash}{$counter} = {'DEFINITION' => $1};
}
> if ( $_ =~ /^NAME\s+(.*)$/sm ) {
> my $temp = $1;
> $temp =~ s/,\s/,/g;
> my @names = split /,/, $temp;
> push @{${$keggHash}{$counter}{'NAME'}}, @names;
> }
> }
> close $fh;
> print Dumper $keggHash;
> }
>
> The output being:
>
> $VAR1 = {
> '1' => {
> 'NAME' => [
> 'E1.1.1.1',
> 'adh'
> ],
> 'ENTRY' => 'K00001'
> },
> '3' => {
> 'NAME' => [
> 'U18snoRNA',
> 'snR18'
> ],
> 'ENTRY' => 'K14866'
> },
> '2' => {
> 'NAME' => [
> 'U14snoRNA',
> 'snR128'
> ],
> 'ENTRY' => 'K14865'
> }
> };
>
> Which to me looks sort of like what you are looking for.
> The main thing I did was read the file one line at a time to prevent a
> unexpectedly large file from causing memory issues on your machine (in the
> end the structure that you are building will cause enough issues
> when handling a large file.
>
> You already dealt with the Entry bit so I'll leave that open though I
> slightly changed the regex but nothing spectacular there.
> The Name bit is simple as I just pull out all of them then then remove all
> spaces and split them into an array, feed the array to the hash and hop time
> for the next step which is up to you ;-)
>
> I hope it helps you a bit, regards,
>
> Rob
>

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parsing file

am 02.06.2011 17:06:02 von Rob Coops

--001636831fa8e7937f04a4bbfaf7
Content-Type: text/plain; charset=UTF-8

On Thu, Jun 2, 2011 at 4:41 PM, venkates wrote:

> On 6/2/2011 2:44 PM, Rob Coops wrote:
>
>> On Thu, Jun 2, 2011 at 1:28 PM, venkates wrote:
>>
>> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>>
>>> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> I want to parse a file with contents that looks as follows:
>>>>>
>>>>> [ snip ]
>>>>
>>>> Have you considered using this module? ->
>>>>
>>>>
>>>> Alternatively, I think somebody on the BioPerl mailing list was
>>>> working on another KEGG parser...
>>>>
>>>> chrs,
>>>> j.
>>>>
>>>> I am doing this as an exercise to learn parsing techniques so guidance
>>>>
>>> help needed.
>>>
>>> Aravind
>>>
>>>
>>>
>>> --
>>> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
>>> For additional commands, e-mail: beginners-help@perl.org
>>> http://learn.perl.org/
>>>
>>>
>>>
>>> This is a simple and ugly way of parsing your file:
>>
>> use strict;
>> use warnings;
>> use Carp;
>> use Data::Dumper;
>>
>> my $set = parse("ko");
>>
>> sub parse {
>> my $keggFile = shift;
>> my $keggHash;
>>
>> my $counter = 1;
>>
>> open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile':
>> $!");
>> while (<$fh> ) {
>> chomp;
>> if ( $_ =~ m!///! ) {
>> $counter++;
>> next;
>> }
>>
>> if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY'
>> =>
>> $1 }; }
>>
> While trying a similar thing for DEFINITION record, instead of appending
> current hash with ENTRY and NAME, the DEFINITION record replaces the
> contents in the hash?
>
> $VAR1 = {
> '4' => {
> 'DEFINITION' => 'U18 small nucleolar RNA'
> },
> '1' => {
> 'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
> },
> '3' => {
> 'DEFINITION' => 'U14 small nucleolar RNA'
> },
> '2' => {
> 'DEFINITION' => 'alcohol dehydrogenase (NADP+)
> [EC:1.1.1.2]'
> },
> '5' => {
> 'DEFINITION' => 'U24 small nucleolar RNA'
> }
> };
>
> code: in addition to what you had suggested -
> if($_ =~ /^DEFINITION\s{2}(.+)?/){
> ${$keggHash}{$counter} = {'DEFINITION' => $1};
>
> }
>
>> if ( $_ =~ /^NAME\s+(.*)$/sm ) {
>> my $temp = $1;
>> $temp =~ s/,\s/,/g;
>> my @names = split /,/, $temp;
>> push @{${$keggHash}{$counter}{'NAME'}}, @names;
>> }
>> }
>> close $fh;
>> print Dumper $keggHash;
>> }
>>
>> The output being:
>>
>> $VAR1 = {
>> '1' => {
>> 'NAME' => [
>> 'E1.1.1.1',
>> 'adh'
>> ],
>> 'ENTRY' => 'K00001'
>> },
>> '3' => {
>> 'NAME' => [
>> 'U18snoRNA',
>> 'snR18'
>> ],
>> 'ENTRY' => 'K14866'
>> },
>> '2' => {
>> 'NAME' => [
>> 'U14snoRNA',
>> 'snR128'
>> ],
>> 'ENTRY' => 'K14865'
>> }
>> };
>>
>> Which to me looks sort of like what you are looking for.
>> The main thing I did was read the file one line at a time to prevent a
>> unexpectedly large file from causing memory issues on your machine (in the
>> end the structure that you are building will cause enough issues
>> when handling a large file.
>>
>> You already dealt with the Entry bit so I'll leave that open though I
>> slightly changed the regex but nothing spectacular there.
>> The Name bit is simple as I just pull out all of them then then remove all
>> spaces and split them into an array, feed the array to the hash and hop
>> time
>> for the next step which is up to you ;-)
>>
>> I hope it helps you a bit, regards,
>>
>> Rob
>>
>>
>
What you do: ${$keggHash}{$counter} = {'DEFINITION' => $1};
Try the following: $keggHash}{$counter}{'DEFINITION'} = $1;

To make things a little clearer look at the following example.

my %hash;
$hash{'Key 1'} = { 'Nested Key 1' => 'Value 1' };

What you do is say: $hash{'Key 1'} = { 'Nested Key 2' => 'Value 2' }
What I do is: $hash{'Key 1'}{'Nested Key 2'} = 'Value 2'}

In your script you will end up with the following:
$VAR1 = {
'Key 1' => {
'Nested Key 2' => 'Value 2',
},
};

Where mine will result in:
$VAR1 = {
'Key 1' => {
'Nested Key 1' => 'Value 1',
'Nested Key 2' => 'Value 2',
},
};

Not that much different but you are basically over writting the value (
{NAME=>[], ENTRY=>''} ) associated with your key ($counter) with {
'DESCRIPTION' => ''}. If you instead add a new key to the hash that is
associated with your main key ($counter) then you will get the result you
are looking for.

Regards,

Rob

--001636831fa8e7937f04a4bbfaf7--

Re: Parsing file

am 02.06.2011 20:32:02 von Aravind Venkatesan

Hi,

Thanks a lot for the help, i had one more question. How can add diff
values from multiple lines to the same hash ref? for example in the
snippet data

PATHWAY ko00010 Glycolysis / Gluconeogenesis
ko00071 Fatty acid metabolism
ko00350 Tyrosine metabolism
ko00625 Chloroalkane and chloroalkene degradation
ko00626 Naphthalene degradation

I want it to stored in the following manner:

2' => {
'PATHWAY' => {
'ko00010' => 'Glycolysis /
Gluconeogenesis'
'ko00071' => ' Fatty acid
metabolism'

},
};

Thanks,

Aravind

On 6/2/2011 5:06 PM, Rob Coops wrote:
> On Thu, Jun 2, 2011 at 4:41 PM, venkates wrote:
>
>> On 6/2/2011 2:44 PM, Rob Coops wrote:
>>
>>> On Thu, Jun 2, 2011 at 1:28 PM, venkates wrote:
>>>
>>> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>>> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>>>>> Hi,
>>>>>> I want to parse a file with contents that looks as follows:
>>>>>>
>>>>>> [ snip ]
>>>>> Have you considered using this module? ->
>>>>>
>>>>>
>>>>> Alternatively, I think somebody on the BioPerl mailing list was
>>>>> working on another KEGG parser...
>>>>>
>>>>> chrs,
>>>>> j.
>>>>>
>>>>> I am doing this as an exercise to learn parsing techniques so guidance
>>>>>
>>>> help needed.
>>>>
>>>> Aravind
>>>>
>>>>
>>>>
>>>> --
>>>> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
>>>> For additional commands, e-mail: beginners-help@perl.org
>>>> http://learn.perl.org/
>>>>
>>>>
>>>>
>>>> This is a simple and ugly way of parsing your file:
>>> use strict;
>>> use warnings;
>>> use Carp;
>>> use Data::Dumper;
>>>
>>> my $set = parse("ko");
>>>
>>> sub parse {
>>> my $keggFile = shift;
>>> my $keggHash;
>>>
>>> my $counter = 1;
>>>
>>> open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile':
>>> $!");
>>> while (<$fh> ) {
>>> chomp;
>>> if ( $_ =~ m!///! ) {
>>> $counter++;
>>> next;
>>> }
>>>
>>> if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY'
>>> =>
>>> $1 }; }
>>>
>> While trying a similar thing for DEFINITION record, instead of appending
>> current hash with ENTRY and NAME, the DEFINITION record replaces the
>> contents in the hash?
>>
>> $VAR1 = {
>> '4' => {
>> 'DEFINITION' => 'U18 small nucleolar RNA'
>> },
>> '1' => {
>> 'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
>> },
>> '3' => {
>> 'DEFINITION' => 'U14 small nucleolar RNA'
>> },
>> '2' => {
>> 'DEFINITION' => 'alcohol dehydrogenase (NADP+)
>> [EC:1.1.1.2]'
>> },
>> '5' => {
>> 'DEFINITION' => 'U24 small nucleolar RNA'
>> }
>> };
>>
>> code: in addition to what you had suggested -
>> if($_ =~ /^DEFINITION\s{2}(.+)?/){
>> ${$keggHash}{$counter} = {'DEFINITION' => $1};
>>
>> }
>>
>>> if ( $_ =~ /^NAME\s+(.*)$/sm ) {
>>> my $temp = $1;
>>> $temp =~ s/,\s/,/g;
>>> my @names = split /,/, $temp;
>>> push @{${$keggHash}{$counter}{'NAME'}}, @names;
>>> }
>>> }
>>> close $fh;
>>> print Dumper $keggHash;
>>> }
>>>
>>> The output being:
>>>
>>> $VAR1 = {
>>> '1' => {
>>> 'NAME' => [
>>> 'E1.1.1.1',
>>> 'adh'
>>> ],
>>> 'ENTRY' => 'K00001'
>>> },
>>> '3' => {
>>> 'NAME' => [
>>> 'U18snoRNA',
>>> 'snR18'
>>> ],
>>> 'ENTRY' => 'K14866'
>>> },
>>> '2' => {
>>> 'NAME' => [
>>> 'U14snoRNA',
>>> 'snR128'
>>> ],
>>> 'ENTRY' => 'K14865'
>>> }
>>> };
>>>
>>> Which to me looks sort of like what you are looking for.
>>> The main thing I did was read the file one line at a time to prevent a
>>> unexpectedly large file from causing memory issues on your machine (in the
>>> end the structure that you are building will cause enough issues
>>> when handling a large file.
>>>
>>> You already dealt with the Entry bit so I'll leave that open though I
>>> slightly changed the regex but nothing spectacular there.
>>> The Name bit is simple as I just pull out all of them then then remove all
>>> spaces and split them into an array, feed the array to the hash and hop
>>> time
>>> for the next step which is up to you ;-)
>>>
>>> I hope it helps you a bit, regards,
>>>
>>> Rob
>>>
>>>
> What you do: ${$keggHash}{$counter} = {'DEFINITION' => $1};
> Try the following: $keggHash}{$counter}{'DEFINITION'} = $1;
>
> To make things a little clearer look at the following example.
>
> my %hash;
> $hash{'Key 1'} = { 'Nested Key 1' => 'Value 1' };
>
> What you do is say: $hash{'Key 1'} = { 'Nested Key 2' => 'Value 2' }
> What I do is: $hash{'Key 1'}{'Nested Key 2'} = 'Value 2'}
>
> In your script you will end up with the following:
> $VAR1 = {
> 'Key 1' => {
> 'Nested Key 2' => 'Value 2',
> },
> };
>
> Where mine will result in:
> $VAR1 = {
> 'Key 1' => {
> 'Nested Key 1' => 'Value 1',
> 'Nested Key 2' => 'Value 2',
> },
> };
>
> Not that much different but you are basically over writting the value (
> {NAME=>[], ENTRY=>''} ) associated with your key ($counter) with {
> 'DESCRIPTION' => ''}. If you instead add a new key to the hash that is
> associated with your main key ($counter) then you will get the result you
> are looking for.
>
> Regards,
>
> Rob
>

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parsing file

am 02.06.2011 22:03:20 von Rob Coops

--00163641737121d9de04a4c02254
Content-Type: text/plain; charset=UTF-8

On Thu, Jun 2, 2011 at 8:32 PM, venkates wrote:

> Hi,
>
> Thanks a lot for the help, i had one more question. How can add diff values
> from multiple lines to the same hash ref? for example in the snippet data
>
>
> PATHWAY ko00010 Glycolysis / Gluconeogenesis
> ko00071 Fatty acid metabolism
> ko00350 Tyrosine metabolism
> ko00625 Chloroalkane and chloroalkene degradation
> ko00626 Naphthalene degradation
>
> I want it to stored in the following manner:
>
> 2' => {
> 'PATHWAY' => {
> 'ko00010' => 'Glycolysis /
> Gluconeogenesis'
> 'ko00071' => ' Fatty acid
> metabolism'
>
> },
> };
>
> Thanks,
>
> Aravind
>
>
> On 6/2/2011 5:06 PM, Rob Coops wrote:
>
>> On Thu, Jun 2, 2011 at 4:41 PM, venkates wrote:
>>
>> On 6/2/2011 2:44 PM, Rob Coops wrote:
>>>
>>> On Thu, Jun 2, 2011 at 1:28 PM, venkates wrote:
>>>>
>>>> On 6/2/2011 12:46 PM, John SJ Anderson wrote:
>>>>
>>>>> On Thu, Jun 2, 2011 at 06:41, venkates wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>> I want to parse a file with contents that looks as follows:
>>>>>>>
>>>>>>> [ snip ]
>>>>>>>
>>>>>> Have you considered using this module? ->
>>>>>>
>>>>>>
>>>>>> Alternatively, I think somebody on the BioPerl mailing list was
>>>>>> working on another KEGG parser...
>>>>>>
>>>>>> chrs,
>>>>>> j.
>>>>>>
>>>>>> I am doing this as an exercise to learn parsing techniques so
>>>>>> guidance
>>>>>>
>>>>>> help needed.
>>>>>
>>>>> Aravind
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe, e-mail: beginners-unsubscribe@perl.org
>>>>> For additional commands, e-mail: beginners-help@perl.org
>>>>> http://learn.perl.org/
>>>>>
>>>>>
>>>>>
>>>>> This is a simple and ugly way of parsing your file:
>>>>>
>>>> use strict;
>>>> use warnings;
>>>> use Carp;
>>>> use Data::Dumper;
>>>>
>>>> my $set = parse("ko");
>>>>
>>>> sub parse {
>>>> my $keggFile = shift;
>>>> my $keggHash;
>>>>
>>>> my $counter = 1;
>>>>
>>>> open my $fh, '<', $keggFile || croak ("Cannot open file '$keggFile':
>>>> $!");
>>>> while (<$fh> ) {
>>>> chomp;
>>>> if ( $_ =~ m!///! ) {
>>>> $counter++;
>>>> next;
>>>> }
>>>>
>>>> if ( $_ =~ /^ENTRY\s+(.+?)\s/sm ) { ${$keggHash}{$counter} = { 'ENTRY'
>>>> =>
>>>> $1 }; }
>>>>
>>>> While trying a similar thing for DEFINITION record, instead of
>>> appending
>>> current hash with ENTRY and NAME, the DEFINITION record replaces the
>>> contents in the hash?
>>>
>>> $VAR1 = {
>>> '4' => {
>>> 'DEFINITION' => 'U18 small nucleolar RNA'
>>> },
>>> '1' => {
>>> 'DEFINITION' => 'alcohol dehydrogenase [EC:1.1.1.1]'
>>> },
>>> '3' => {
>>> 'DEFINITION' => 'U14 small nucleolar RNA'
>>> },
>>> '2' => {
>>> 'DEFINITION' => 'alcohol dehydrogenase (NADP+)
>>> [EC:1.1.1.2]'
>>> },
>>> '5' => {
>>> 'DEFINITION' => 'U24 small nucleolar RNA'
>>> }
>>> };
>>>
>>> code: in addition to what you had suggested -
>>> if($_ =~ /^DEFINITION\s{2}(.+)?/){
>>> ${$keggHash}{$counter} = {'DEFINITION' => $1};
>>>
>>> }
>>>
>>> if ( $_ =~ /^NAME\s+(.*)$/sm ) {
>>>> my $temp = $1;
>>>> $temp =~ s/,\s/,/g;
>>>> my @names = split /,/, $temp;
>>>> push @{${$keggHash}{$counter}{'NAME'}}, @names;
>>>> }
>>>> }
>>>> close $fh;
>>>> print Dumper $keggHash;
>>>> }
>>>>
>>>> The output being:
>>>>
>>>> $VAR1 = {
>>>> '1' => {
>>>> 'NAME' => [
>>>> 'E1.1.1.1',
>>>> 'adh'
>>>> ],
>>>> 'ENTRY' => 'K00001'
>>>> },
>>>> '3' => {
>>>> 'NAME' => [
>>>> 'U18snoRNA',
>>>> 'snR18'
>>>> ],
>>>> 'ENTRY' => 'K14866'
>>>> },
>>>> '2' => {
>>>> 'NAME' => [
>>>> 'U14snoRNA',
>>>> 'snR128'
>>>> ],
>>>> 'ENTRY' => 'K14865'
>>>> }
>>>> };
>>>>
>>>> Which to me looks sort of like what you are looking for.
>>>> The main thing I did was read the file one line at a time to prevent a
>>>> unexpectedly large file from causing memory issues on your machine (in
>>>> the
>>>> end the structure that you are building will cause enough issues
>>>> when handling a large file.
>>>>
>>>> You already dealt with the Entry bit so I'll leave that open though I
>>>> slightly changed the regex but nothing spectacular there.
>>>> The Name bit is simple as I just pull out all of them then then remove
>>>> all
>>>> spaces and split them into an array, feed the array to the hash and hop
>>>> time
>>>> for the next step which is up to you ;-)
>>>>
>>>> I hope it helps you a bit, regards,
>>>>
>>>> Rob
>>>>
>>>>
>>>> What you do: ${$keggHash}{$counter} = {'DEFINITION' => $1};
>> Try the following: $keggHash}{$counter}{'DEFINITION'} = $1;
>>
>> To make things a little clearer look at the following example.
>>
>> my %hash;
>> $hash{'Key 1'} = { 'Nested Key 1' => 'Value 1' };
>>
>> What you do is say: $hash{'Key 1'} = { 'Nested Key 2' => 'Value 2' }
>> What I do is: $hash{'Key 1'}{'Nested Key 2'} = 'Value 2'}
>>
>> In your script you will end up with the following:
>> $VAR1 = {
>> 'Key 1' => {
>> 'Nested Key 2' => 'Value 2',
>> },
>> };
>>
>> Where mine will result in:
>> $VAR1 = {
>> 'Key 1' => {
>> 'Nested Key 1' => 'Value 1',
>> 'Nested Key 2' => 'Value 2',
>> },
>> };
>>
>> Not that much different but you are basically over writting the value (
>> {NAME=>[], ENTRY=>''} ) associated with your key ($counter) with {
>> 'DESCRIPTION' => ''}. If you instead add a new key to the hash that is
>> associated with your main key ($counter) then you will get the result you
>> are looking for.
>>
>> Regards,
>>
>> Rob
>>
>>
>
>
In that case you need to do various things. First of all you need to
recognise where the PATHWAY segment beings which is easy enough you are
doing that for the NAME DESCRIPTION etc segments. Of course you need to
remember that you are now owrking on the pathway segement (or any multiline
segment to be more flexible). Then all you do is process the lines in the
way you would normally do.

So first of all lets make a $multiline variable before the while loop:
my $multiline;
while ( <$fh> ) {
chomp;
if ( $_ =~ m!///! ) {
$counter++;
next;
}

# If you find the start of any other segement empty the $multiline variable
if ( $_ =~ /^\w+/ ) { $multiline = ''; }

if ( $_ =~ /^PATHWAY\s+(.+?)\s+(.*)/ ) {
# If we find the PATHWAY segment we set the $multiline variable to
indicate this
$multiline = 'PATHWAY';
# Deal with the data found behind the PATHWAY variable and end the
processing of this line.
${$keggHash}{$counter}{'PATHWAY'}{$1} = $2;
next;
}

if ( $multiline eq 'PATHWAY' ) {
$_ =~ /\s+(.+?)\s+(.*)/;
${$keggHash}{$counter}{'PATHWAY'}{$1} = $2;
}

# Now you can deal with any other lines below just like before
}

Of course if you have other multiline situations simply do the same but this
time you fill the $multiline variable with the name of that segment...
Now I have not tested this so there might be a typo here or there but the
principle is hopefully clear

Regards,

Rob

--00163641737121d9de04a4c02254--