solution for Regex

solution for Regex

am 09.06.2011 10:48:23 von Aravind Venkatesan

--------------070704020706070304000408
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi,

data snippet:

ENTRY K00002 KO
NAME E1.1.1.2, adh
DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
PATHWAY ko00010 Glycolysis / Gluconeogenesis
ko00561 Glycerolipid metabolism
ko00930 Caprolactam degradation
CLASS Metabolism; Carbohydrate Metabolism; Glycolysis /
Gluconeogenesis [PATH:ko00010]
Metabolism; Lipid Metabolism; Glycerolipid metabolism
[PATH:ko00561]
Metabolism; Xenobiotics Biodegradation and Metabolism;
Caprolactam degradation [PATH:ko00930]
DBLINKS RN: R00746 R01041 R05231
COG: COG0656
GO: 0008106
GENES HSA: 10327(AKR1A1)
PTR: 741418(AKR1A1)
PON: 100173796(AKR1A1)
MCC: 693380(AKR1A1)
MMU: 58810(Akr1a4)
RNO: 78959(Akr1a1)
CFA: 610537
///
ENTRY K00730 KO
NAME OST4
DEFINITION oligosaccharyl transferase complex subunit OST4
PATHWAY ko00510 N-Glycan biosynthesis
ko00513 Various types of N-glycan biosynthesis
ko04141 Protein processing in endoplasmic reticulum
MODULE M00072 Oligosaccharyltransferase
CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan
biosynthesis [PATH:ko00510]
Metabolism; Glycan Biosynthesis and Metabolism; Various
types of N-glycan biosynthesis [PATH:ko00513]
Genetic Information Processing; Folding, Sorting and
Degradation; Protein processing in endoplasmic reticulum [PATH:ko04141]
DBLINKS GO: 0008250
GENES SCE: YDL232W(OST4)
AGO: AGOS_ABL170C
KLA: KLLA0A01287g
VPO: Kpol_1054p35
SSL: SS1G_13465
REFERENCE PMID:15001703
AUTHORS Zubkov S, Lennarz WJ, Mohanty S
TITLE Structural basis for the function of a minimembrane protein
subunit of yeast oligosaccharyltransferase.
JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
///

I need to retrieve all the gene entries to add it to a hash ref. My code
does that in the first record but in the second case it also pulls out
the REFERENCE information. I have provided the code below. If some one
could tell me where exactly I am going wrong (is it in the regex? or
otherwise) I would be glad!!

code :

use strict;
use warnings;
use Carp;
use Data::Dumper;


my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko");

sub parse {

my $kegg_file_path = shift;
my $keggData; # Hash ref

open my $fh, '<', $kegg_file_path or croak("Cannot open file
'$kegg_file_path': $!");
local $/ = "\n///\n";
while (<$fh>){
chomp;
my $record = $_;
$record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
my $entries = $1;
if ($record =~ m/^GENES\s{7}(.+)$/xms){
my $gene = $1;
${$keggData}{$entries}{'GENE'} = $gene;
my @genes = split ('\s{13}', $gene);
foreach my $gene_element (@genes){
my $taxon_label = substr($gene_element, 0, 3);
my $gene_label = substr($gene_element, 5);
my @gene_label_array = split '\s', $gene_label;
push @{${$keggData}{$entries}{'GENES'}{$taxon_label}},
@gene_label_array;
}
}

}
print Dumper($keggData);
close $fh;
}

Thanks,

Aravind

--------------070704020706070304000408--

Re: solution for Regex

am 09.06.2011 17:13:05 von Jim Gibson

At 10:48 AM +0200 6/9/11, venkates wrote:
>Hi,
>
>data snippet:
>
>
>I need to retrieve all the gene entries to add it to a hash ref. My
>code does that in the first record but in the second case it also
>pulls out the REFERENCE information. I have provided the code below.
>If some one could tell me where exactly I am going wrong (is it in
>the regex? or otherwise) I would be glad!!
>
>code :
>
>use strict;
>use warnings;
>use Carp;
>use Data::Dumper;
>
>
>my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko");
>
>sub parse {
>
> my $kegg_file_path = shift;
> my $keggData; # Hash ref

Please simplify your program for posting by using a hash instead of a
hash reference. Your goal should be to make it as easy as possible
for people to help you. Once you learn how to solve your problems,
you can use the solution in your actual program with whatever
complexity is necessary.

>
> open my $fh, '<', $kegg_file_path or croak("Cannot open file
>'$kegg_file_path': $!");
> local $/ = "\n///\n";
> while (<$fh>){
> chomp;
> my $record = $_;


Why don't you just read into $record in the first place:

while( my $record = <$fh> ) [


> $record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
> my $entries = $1;
> if ($record =~ m/^GENES\s{7}(.+)$/xms){


You are capturing everything from just after GENES to the end of the
record. Try putting in REFERENCE:

if ($record =~ m/^GENES\s{7}(.+)REFERENCE/xms){


> my $gene = $1;
> ${$keggData}{$entries}{'GENE'} = $gene;
> my @genes = split ('\s{13}', $gene);
> foreach my $gene_element (@genes){
> my $taxon_label = substr($gene_element, 0, 3);
> my $gene_label = substr($gene_element, 5);
> my @gene_label_array = split '\s', $gene_label;
> push
>@{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;
> }
> }
>
> }
> print Dumper($keggData);
> close $fh;
>}

Please use the file handle to make it easier to run your
program. Put your file data at the end of the program after the line

__DATA__

then use to read the data lines.

Thanks.

--
Jim Gibson
Jim@Gibson.org

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 09.06.2011 19:48:03 von rvtol+usenet

On 2011-06-09 10:48, venkates wrote:

> my @gene_label_array = split '\s', $gene_label;

That '\s' is more clearly written as /\s/ or for example m{\s}.

But best just make it ' ' (see perldoc -f split, about that special case).

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 09.06.2011 20:34:12 von Shawn H Corey

On 11-06-09 01:48 PM, Dr.Ruud wrote:
> On 2011-06-09 10:48, venkates wrote:
>
>> my @gene_label_array = split '\s', $gene_label;
>
> That '\s' is more clearly written as /\s/ or for example m{\s}.
>
> But best just make it ' ' (see perldoc -f split, about that special case).
>

FYI, some people find it hard to distinguish between ' ' and '', so they
write it, "\x20". If you ever see this, you now know why. :)


--
Just my 0.00000002 million dollars worth,
Shawn

Confusion is the first step of understanding.

Programming is as much about organization and communication
as it is about coding.

The secret to great software: Fail early & often.

Eliminate software piracy: use only FLOSS.

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 09.06.2011 23:11:23 von John Delacour

At 10:48 +0200 09/06/2011, venkates wrote:

I need to retrieve all the gene entries to add it to a hash ref.

Your code is very fussy with all those substrings etc. What about
something like this:

#!/usr/local/bin/perl
use strict;
my $read = 0; my @genes; my %hash;
while (){
chomp;
$read = 1 if /^GENES/;
$read = 0 unless /\s[A-Z]{3}:/;
if ($read){
s/^GENES//;
s/^\s+//;
push @genes, $_;
}
}
for (@genes){
my ($taxon_label, $gene_label) = split /:\s*/;
$hash{$taxon_label} = $gene_label;
}
for (keys %hash){
print "Key: $_; Val: $hash{$_}\n";
}
__DATA__
Your file contents here



JD

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 10.06.2011 00:48:29 von Rob Dixon

On 09/06/2011 09:48, venkates wrote:
> Hi,
>
> data snippet:
>
> ENTRY K00002 KO
> NAME E1.1.1.2, adh
> DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
> PATHWAY ko00010 Glycolysis / Gluconeogenesis
> ko00561 Glycerolipid metabolism
> ko00930 Caprolactam degradation
> CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis
> [PATH:ko00010]
> Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
> Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam
> degradation [PATH:ko00930]
> DBLINKS RN: R00746 R01041 R05231
> COG: COG0656
> GO: 0008106
> GENES HSA: 10327(AKR1A1)
> PTR: 741418(AKR1A1)
> PON: 100173796(AKR1A1)
> MCC: 693380(AKR1A1)
> MMU: 58810(Akr1a4)
> RNO: 78959(Akr1a1)
> CFA: 610537
> ///
> ENTRY K00730 KO
> NAME OST4
> DEFINITION oligosaccharyl transferase complex subunit OST4
> PATHWAY ko00510 N-Glycan biosynthesis
> ko00513 Various types of N-glycan biosynthesis
> ko04141 Protein processing in endoplasmic reticulum
> MODULE M00072 Oligosaccharyltransferase
> CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan
> biosynthesis [PATH:ko00510]
> Metabolism; Glycan Biosynthesis and Metabolism; Various types of
> N-glycan biosynthesis [PATH:ko00513]
> Genetic Information Processing; Folding, Sorting and Degradation;
> Protein processing in endoplasmic reticulum [PATH:ko04141]
> DBLINKS GO: 0008250
> GENES SCE: YDL232W(OST4)
> AGO: AGOS_ABL170C
> KLA: KLLA0A01287g
> VPO: Kpol_1054p35
> SSL: SS1G_13465
> REFERENCE PMID:15001703
> AUTHORS Zubkov S, Lennarz WJ, Mohanty S
> TITLE Structural basis for the function of a minimembrane protein
> subunit of yeast oligosaccharyltransferase.
> JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
> ///
>
> I need to retrieve all the gene entries to add it to a hash ref. My code
> does that in the first record but in the second case it also pulls out
> the REFERENCE information. I have provided the code below. If some one
> could tell me where exactly I am going wrong (is it in the regex? or
> otherwise) I would be glad!!
>
> code :
>
> use strict;
> use warnings;
> use Carp;
> use Data::Dumper;
>
>
> my $set = parse("/home/venkates/workspace/KEGG_Parser/data/ko");
>
> sub parse {
>
> my $kegg_file_path = shift;
> my $keggData; # Hash ref
>
> open my $fh, '<', $kegg_file_path or croak("Cannot open file
> '$kegg_file_path': $!");
> local $/ = "\n///\n";
> while (<$fh>){
> chomp;
> my $record = $_;
> $record =~ m/^ENTRY\s{7}(.+?)\s+/xms;
> my $entries = $1;
> if ($record =~ m/^GENES\s{7}(.+)$/xms){
> my $gene = $1;
> ${$keggData}{$entries}{'GENE'} = $gene;
> my @genes = split ('\s{13}', $gene);
> foreach my $gene_element (@genes){
> my $taxon_label = substr($gene_element, 0, 3);
> my $gene_label = substr($gene_element, 5);
> my @gene_label_array = split '\s', $gene_label;
> push @{${$keggData}{$entries}{'GENES'}{$taxon_label}}, @gene_label_array;
> }
> }
>
> }
> print Dumper($keggData);
> close $fh;
> }

I would prefer to read the file a line at a time. The code below seems
to do what you want.

HTH,

Rob


use strict;
use warnings;

use Data::Dumper;

my $kegg_file = '/home/venkates/workspace/KEGG_Parser/data/ko';

my $fh;
unless (open $fh, $kegg_file) {
warn "Failed to open file: $!. Defaulting to DATA.";
$fh = *DATA;
}

parse($fh);

sub parse {

my $kegg_file_handle = shift;
my $keggData;

my $entry;
my $key;

while (<$fh>) {

next unless /\S/;
if (m|///|) {
undef $entry;
undef $key;
next;
}

chomp;

next unless m|^(.{0,11}?)\s+(.+)|;

$key = $1 if $1;
my $val = $2;

if ($key eq 'ENTRY') {
($entry) = $val =~ /(\S+)/;
}
elsif ($key eq 'GENES') {
die "No current entry" unless $entry;
my ($taxon_label, @gene_label_array) = split /:?\s+/, $val;
push @{$keggData->{$entry}{$key}{$taxon_label}}, @gene_label_array;
}
}

print Dumper($keggData);
}

__DATA__
ENTRY K00002 KO
NAME E1.1.1.2, adh
DEFINITION alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
PATHWAY ko00010 Glycolysis / Gluconeogenesis
ko00561 Glycerolipid metabolism
ko00930 Caprolactam degradation
CLASS Metabolism; Carbohydrate Metabolism; Glycolysis / Gluconeogenesis [PATH:ko00010]
Metabolism; Lipid Metabolism; Glycerolipid metabolism [PATH:ko00561]
Metabolism; Xenobiotics Biodegradation and Metabolism; Caprolactam degradation [PATH:ko00930]
DBLINKS RN: R00746 R01041 R05231
COG: COG0656
GO: 0008106
GENES HSA: 10327(AKR1A1)
PTR: 741418(AKR1A1)
PON: 100173796(AKR1A1)
MCC: 693380(AKR1A1)
MMU: 58810(Akr1a4)
RNO: 78959(Akr1a1)
CFA: 610537
///
ENTRY K00730 KO
NAME OST4
DEFINITION oligosaccharyl transferase complex subunit OST4
PATHWAY ko00510 N-Glycan biosynthesis
ko00513 Various types of N-glycan biosynthesis
ko04141 Protein processing in endoplasmic reticulum
MODULE M00072 Oligosaccharyltransferase
CLASS Metabolism; Glycan Biosynthesis and Metabolism; N-Glycan biosynthesis [PATH:ko00510]
Metabolism; Glycan Biosynthesis and Metabolism; Various types of N-glycan biosynthesis [PATH:ko00513]
Genetic Information Processing; Folding, Sorting and Degradation; Protein processing in endoplasmic reticulum [PATH:ko04141]
DBLINKS GO: 0008250
GENES SCE: YDL232W(OST4)
AGO: AGOS_ABL170C
KLA: KLLA0A01287g
VPO: Kpol_1054p35
SSL: SS1G_13465
REFERENCE PMID:15001703
AUTHORS Zubkov S, Lennarz WJ, Mohanty S
TITLE Structural basis for the function of a minimembrane protein subunit of yeast oligosaccharyltransferase.
JOURNAL Proc Natl Acad Sci U S A 101:3821-6 (2004)
///

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 10.06.2011 00:50:34 von Uri Guttman

>>>>> "JD" == John Delacour writes:

JD> use strict;

use warnings ;

JD> my $read = 0; my @genes; my %hash;

why the $read flag?

JD> while (){
JD> chomp;
JD> $read = 1 if /^GENES/;

you can always just so this and test for it. i don't know the data logic
so i can't go further. at least you can run this and assign it to $read
to remove redundancy. also you can declare $read here.

my $read = s/^GENES//;

JD> $read = 0 unless /\s[A-Z]{3}:/;

since you don't do the work unless that passes, just next away:

next if /\s[A-Z]{3}:/;

now you don't need to test $read at all. again, i haven't checked the
flow logic so i could be wrong. i just smell better logic here. having
data in this example (sure i can look at the OP's post but i am tired. :)

JD> if ($read){
JD> s/^GENES//;

that line isn't neded.

JD> s/^\s+//;
JD> push @genes, $_;
JD> }
JD> }
JD> for (@genes){
JD> my ($taxon_label, $gene_label) = split /:\s*/;
JD> $hash{$taxon_label} = $gene_label;
JD> }

if the list isn't that long you can map/split in one cleaner line. and
don't use %hash for a hash name.

my %labeled_genes = map { split /:\s*/ } @genes ;

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 10.06.2011 10:05:20 von John Delacour

At 18:50 -0400 09/06/2011, Uri Guttman wrote:


>...i don't know the data logic so i can't go further. at least you
>can run this and assign it to $read to remove redundancy. also you
>can declare $read here.
>
> my $read = s/^GENES//;

No it isn't. If you don't "know the data logic" then it's because
you have not read the message that started the thread, which contains
the data.

>> $read = 0 unless /\s[A-Z]{3}:/;
>
>since you don't do the work unless that passes, just next away:
>
> next if /\s[A-Z]{3}:/;

Wrong again. That will collect zilch. Do your homework.



JD

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 10.06.2011 10:09:45 von Uri Guttman

>>>>> "JD" == John Delacour writes:

JD> At 18:50 -0400 09/06/2011, Uri Guttman wrote:
>> ...i don't know the data logic so i can't go further. at least you
>> can run this and assign it to $read to remove redundancy. also you
>> can declare $read here.
>>
>> my $read = s/^GENES//;

JD> No it isn't. If you don't "know the data logic" then it's because you
JD> have not read the message that started the thread, which contains the
JD> data.

>>> $read = 0 unless /\s[A-Z]{3}:/;
>>
>> since you don't do the work unless that passes, just next away:
>>
>> next if /\s[A-Z]{3}:/;

JD> Wrong again. That will collect zilch. Do your homework.

not interested right now. i am sure i can clean it up if i did. flags
like that are a red flag that there is something wrong in the
design. too late for me to get into it now.

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 10.06.2011 17:28:23 von Gurpreet Singh

Hi,
Correct me if i am wrong -

$read = 0 unless /\s[A-Z]{3}:/;
This might pick up wrong values also - since one of the DBLINKS data (the first record) might also get picked up - it should match this regex.


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: solution for Regex

am 11.06.2011 11:47:08 von John Delacour

At 08:28 -0700 10/06/2011, Gurpreet Singh wrote:

>Correct me if i am wrong -
>
>$read = 0 unless /\s[A-Z]{3}:/;
>This might pick up wrong values also - since one of the DBLINKS data
>(the first record) might also get picked up - it should match this
>regex.

Run my script with the data and see. $read is already false at that
point. If the data were differently arranged then maybe, but I was
not proposing a universal solution - just a suggestion as to a
different approach.

JD



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/