looking for efficient way to parse a file

looking for efficient way to parse a file

am 12.01.2008 22:40:10 von Eric Martin

Hello,

I have a file with the following data structure:
#category
item name
data1
data2
item name
data1
data2
#category
item name
data1
data2
.... etc.

Any line that starts with #, indicates a new category. Between
categories, there can be any number of items, with associated data.
Each item has exactly two data properties.

My plan was to just get an array that contained the index of each of
the categories and then parse each item from there, since they are in
a set format...but I was wondering if there were any suggestions for a
more efficient way...

Re: looking for efficient way to parse a file

am 12.01.2008 23:59:13 von Gunnar Hjalmarsson

Eric Martin wrote:
> I have a file with the following data structure:
> #category
> item name
> data1
> data2
> item name
> data1
> data2
> #category
> item name
> data1
> data2
> ... etc.
>
> Any line that starts with #, indicates a new category. Between
> categories, there can be any number of items, with associated data.
> Each item has exactly two data properties.
>
> My plan was to just get an array that contained the index of each of
> the categories and then parse each item from there, since they are in
> a set format...

Not sure what you mean by that. Could you please expand?

> but I was wondering if there were any suggestions for a
> more efficient way...

Efficient - in what sense?

To me, the described data structure would suggest a HoHoA (hash of
hashes of arrays):

use Data::Dumper;

my (%HoHoA, $cat);
while ( ) {
chomp;
if ( substr($_, 0, 1) eq '#' ) {
$cat = substr $_, 1;
next;
}
for my $item ( 0, 1 ) {
chomp( $HoHoA{$cat}{$_}[$item] = );
}
}
print Dumper \%HoHoA;

__DATA__
#category1
item1
data1
data2
item2
data1
data2
#category2
item1
data1
data2

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: looking for efficient way to parse a file

am 13.01.2008 00:09:20 von xhoster

Eric Martin wrote:
> Hello,
>
> I have a file with the following data structure:
> #category
> item name
> data1
> data2
> item name
> data1
> data2
> #category
> item name
> data1
> data2
> ... etc.
>
> Any line that starts with #, indicates a new category. Between
> categories, there can be any number of items, with associated data.
> Each item has exactly two data properties.
>
> My plan was to just get an array that contained the index of each of
> the categories

That suggests the categories are already in an array, or else what is the
index the index to? I'd probably not bother to load them into an array
in the first place, just parse it on the fly. Maybe not, depending on
where it was coming from and how big I expected it to plausibly get.

> and then parse each item from there, since they are in
> a set format...but I was wondering if there were any suggestions for a
> more efficient way...

Efficient in what sense? Memory? CPU time? Programmer maintenance time?

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: looking for efficient way to parse a file

am 13.01.2008 03:38:14 von jurgenex

Eric Martin wrote:
>I have a file with the following data structure:
>#category
>item name
>data1
>data2
>item name
>data1
>data2
>#category
>item name
>data1
>data2
>... etc.
>
>Any line that starts with #, indicates a new category. Between
>categories, there can be any number of items, with associated data.
>Each item has exactly two data properties.

That suggests to me a Hash(category) of Hash(item name) of Array (two data
elements)

>My plan was to just get an array that contained the index of each of
>the categories and then parse each item from there, since they are in

What's an index of a category?

>a set format...but I was wondering if there were any suggestions for a
>more efficient way...

Reading the file line by line in a linear manner is about as efficient as
you can possibly get because you need to read each item at least once and
you don't read it more than once, either. The suggested data structure would
support a linear reading, too.

jue

Re: looking for efficient way to parse a file

am 13.01.2008 16:46:26 von Eric Martin

On Jan 12, 2:59 pm, Gunnar Hjalmarsson wrote:
> Eric Martin wrote:
> > I have a file with the following data structure:
> > #category
> > item name
> > data1
> > data2
> > item name
> > data1
> > data2
> > #category
> > item name
> > data1
> > data2
> > ... etc.
>
> > Any line that starts with #, indicates a new category. Between
> > categories, there can be any number of items, with associated data.
> > Each item has exactly two data properties.
>
> > My plan was to just get an array that contained the index of each of
> > the categories and then parse each item from there, since they are in
> > a set format...
>
> Not sure what you mean by that. Could you please expand?

I was thinking of loading the file into an array, iterating over it to
find the index values for each category, then parsing the data between
each category, using the array of indexes I previously created.
However, your suggestion to use a HoHoA and code sample, proved to be
exactly what I needed.

>
> > but I was wondering if there were any suggestions for a
> > more efficient way...
>
> Efficient - in what sense?

I probably should have said effective ;)

>
> To me, the described data structure would suggest a HoHoA (hash of
> hashes of arrays):
>
> use Data::Dumper;
>
> my (%HoHoA, $cat);
> while ( ) {
> chomp;
> if ( substr($_, 0, 1) eq '#' ) {
> $cat = substr $_, 1;
> next;
> }
> for my $item ( 0, 1 ) {
> chomp( $HoHoA{$cat}{$_}[$item] = );
> }}
>
> print Dumper \%HoHoA;
>
> __DATA__
> #category1
> item1
> data1
> data2
> item2
> data1
> data2
> #category2
> item1
> data1
> data2
>
> --
> Gunnar Hjalmarsson
> Email:http://www.gunnar.cc/cgi-bin/contact.pl

Thanks for the code sample, it worked great! I didn't realize
referencing in the while block would "increment" the record of
the data file.

-Eric