Shrink large file according to REG_EXP

Shrink large file according to REG_EXP

am 16.01.2008 18:28:26 von thellper

Hello,
I've a problem to solve, and I need some help, please.
I've as input a large text file (up to 5GB) which I need to filter
according some REG_EXP and then I need to write the filtered
(hopefully smaller) output to another file.
The filtering applies row-by-row: a row is splitted according to some
rules in various pieces, then some of the pieces are checked according
to some REG_EXP, and if a match is found, the whole line is written to
the output.

The problem is that this solution is slow.
I'm now reading line by line the whole file, and then I'm applying the
reg_exp... but it is very slow.
I've noticed that the time to read and write the file without doing
anything is very small, so I'm loosing a lot of time for my
reg_exps... .

Ok, the whole program is more complicated: the files may have
different syntax, and I have syntax files which tell me how to split
each line in its fields. Then I load separately files with the rules
(the reg_exps) used to filter them.... .
Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
split the file in chunks and let each thread work on a chunk of the
file: can somebody tell me how to do this ? Or a better way?

Any help is really appreciated.

Best regards,
Davide

Re: Shrink large file according to REG_EXP

am 16.01.2008 18:54:13 von xhoster

thellper wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .

Figure out which regex is slow, why it is slow, and then make it faster.

If you did the first step and posted the culprit with some sample input, we
might be able to help with the latter two.

> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

I'd try to make the single-threaded one faster first, and resort to
parallelization only as a last resort. Also, if I were doing
parallelization of this, I probably wouldn't use forks.pm to do it. Once
started, your threads (or processes) really don't need to communicate with
each other (as long as you make independent output files to be combined
later) , so a simpler solution, like Parallel::ForkManager or just doing
fork yourself. Or just start the jobs as separate processes in the first
place.

If the orders of the lines in the output files isn't important, I'd give
each job a different integer token (from 0 to num_job-1) and then have each
job process only those lines where
$token == $. % $num_job

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Shrink large file according to REG_EXP

am 16.01.2008 19:00:36 von Ted Zlatanov

On Wed, 16 Jan 2008 09:28:26 -0800 (PST) thellper wrote:

t> The problem is that this solution is slow. I'm now reading line by
t> line the whole file, and then I'm applying the reg_exp... but it is
t> very slow. I've noticed that the time to read and write the file
t> without doing anything is very small, so I'm loosing a lot of time
t> for my reg_exps... .

t> Ok, the whole program is more complicated: the files may have
t> different syntax, and I have syntax files which tell me how to split
t> each line in its fields. Then I load separately files with the rules
t> (the reg_exps) used to filter them.... . Anyway, my idea was to try
t> to use the FORKS.pm module (s. CPAN) to split the file in chunks and
t> let each thread work on a chunk of the file: can somebody tell me how
t> to do this ? Or a better way?

Please post a practical example of what's slow (with sample input) so we
can see, comment on, and test it. There's a Benchmark module that will
measure the performance of a function well.

Ted

Re: Shrink large file according to REG_EXP

am 16.01.2008 19:02:20 von Jim Gibson

In article
,
thellper wrote:

> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
>
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?

If your program is I/O bound, then it might be faster to work on
different parts simultaneously. However, you are going to suffer some
head thrashing as your multiple processes attempt to read different
parts of the same file at the same time.

If your program is cpu bound, then splitting up the work won't help
unless you are using a multi-processor system.

If, as you say, reading the file without doing any processing is quick
enough, then it is the processing of the data that is the bottleneck.
You should concentrate on improving that part of your program. People
here can help, if you post short examples of what you are trying to do.
Show us some of your regexes, at least, and samples of these "syntax
files".

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Re: Shrink large file according to REG_EXP

am 16.01.2008 19:13:04 von it_says_BALLS_on_your forehead

On Jan 16, 12:28=A0pm, thellper wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>
> The problem is that this solution is slow.
> I'm now reading line by line the whole file, and then I'm applying the
> reg_exp... but it is very slow.
> I've noticed that the time to read and write the file without doing
> anything is very small, so I'm loosing a lot of time for my
> reg_exps... .
>
> Ok, the whole program is more complicated: the files may have
> different syntax, and I have syntax files which tell me how to split
> each line in its fields. Then I load separately files with the rules
> (the reg_exps) used to filter them.... .
> Anyway, my idea was to try to use the FORKS.pm module (s. CPAN) to
> split the file in chunks and let each thread work on a chunk of the
> file: can somebody tell me how to do this ? Or a better way?
>

check out /REGEX/o

and qr/REGEX/

..also, if you keep a history of which filters get used the most,
stick those at the top. this will speed up the file processing if the
trend does not change. may want to do this periodically in case it
does change.

Re: Shrink large file according to REG_EXP

am 16.01.2008 20:17:02 von Uri Guttman

>>>>> "nc" == nolo contendere writes:

nc> check out /REGEX/o

obsolete and probably useless.

nc> and qr/REGEX/

we still haven't seen his code so that is not a solution. more likely
his loops are clunky and slow and his regexes are worse.

nc> ...also, if you keep a history of which filters get used the most,
nc> stick those at the top. this will speed up the file processing if the
nc> trend does not change. may want to do this periodically in case it
nc> does change.

or which are the slowest regexes and speed those up. there are too many
ways to optimize unknown code. let's see if the OP will actually post
some data and code.

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Re: Shrink large file according to REG_EXP

am 16.01.2008 20:22:37 von it_says_BALLS_on_your forehead

On Jan 16, 2:17=A0pm, Uri Guttman wrote:
> >>>>> "nc" == nolo contendere writes:
>
> =A0 nc> check out /REGEX/o
>
> obsolete and probably useless.
>

really? is this since 5.10?

> =A0 nc> and qr/REGEX/
>
> we still haven't seen his code so that is not a solution. more likely
> his loops are clunky and slow and his regexes are worse.
>
> =A0 nc> ...also, if you keep a history of which filters get used the most,=

> =A0 nc> stick those at the top. this will speed up the file processing if =
the
> =A0 nc> trend does not change. may want to do this periodically in case it=

> =A0 nc> does change.
>
> or which are the slowest regexes and speed those up. there are too many
> ways to optimize unknown code. let's see if the OP will actually post
> some data and code.
>

yeah, Xho already suggested the speed-up-the-slowest-regex solution,
so I was going for something different. you're right though, code +
data would help enormously.

Re: Shrink large file according to REG_EXP

am 16.01.2008 20:54:32 von Uri Guttman

>>>>> "nc" == nolo contendere writes:

nc> On Jan 16, 2:17 pm, Uri Guttman wrote:
>> >>>>> "nc" == nolo contendere writes:
>>
>>   nc> check out /REGEX/o
>>
>> obsolete and probably useless.
>>

nc> really? is this since 5.10?

since at least when qr// came in. also dynamic regexes (those with
interpolation) are not recompiled unless some variable in them
changes. this is what /o was all about in the early days of perl5. so
it's purpose of not recompiling has been moot for eons. and qr// even
makes it even more useless.

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Re: Shrink large file according to REG_EXP

am 16.01.2008 21:00:28 von it_says_BALLS_on_your forehead

On Jan 16, 2:54=A0pm, Uri Guttman wrote:
> >>>>> "nc" == nolo contendere writes:
>
> =A0 nc> On Jan 16, 2:17=A0pm, Uri Guttman wrote:
> =A0 >> >>>>> "nc" == nolo contendere writes:
> =A0 >>
> =A0 >> =A0 nc> check out /REGEX/o
> =A0 >>
> =A0 >> obsolete and probably useless.
> =A0 >>
>
> =A0 nc> really? is this since 5.10?
>
> since at least when qr// came in. also dynamic regexes (those with
> interpolation) are not recompiled unless some variable in them
> changes. this is what /o was all about in the early days of perl5. so
> it's purpose of not recompiling has been moot for eons. and qr// even
> makes it even more useless.

Ok, thanks for the info.

Re: Shrink large file according to REG_EXP

am 16.01.2008 21:43:55 von Martien Verbruggen

On Wed, 16 Jan 2008 10:02:20 -0800,
Jim Gibson wrote:
> In article
>,
> thellper wrote:
>

>> The problem is that this solution is slow.
>> I'm now reading line by line the whole file, and then I'm applying the
>> reg_exp... but it is very slow.
>> I've noticed that the time to read and write the file without doing
>> anything is very small, so I'm loosing a lot of time for my
>> reg_exps... .

> If your program is I/O bound, then it might be faster to work on
> different parts simultaneously.

If the process is I/O bound, then it's unlikely that it'll speed up if
you work on multiple parts simultaneously, unles you can guarantee that
those multiple parts are going to come from a different part of your I/O
subsystem, i.e. ones that don't compete with each other for resources.
Given that it's one single file as input, it's very unlikely that you'll
be able to pick your parts to work on in such a way that you avoid I/O
contention.

You might see some improvement if you're lucky, but you could also see a
marked decrease in total I/O speed, if you're unlucky.

Splitting a process in multiple worker processes generally only is
better if each worker process can then utilise a piece of hardware that
wasn't used before, like another I/O system, or another CPU.

> However, you are going to suffer some
> head thrashing as your multiple processes attempt to read different
> parts of the same file at the same time.

Indeed, at least, if your file is on a single disk. If it's on a RAID
system, the O/S might be able to avoid contention on disks. Or not. For
linear access patterns you generally do get some improvement.

> If your program is cpu bound, then splitting up the work won't help
> unless you are using a multi-processor system.

Indeed.

But CPU bound processes can benefit from algorithm improvements, or even
small tweaks to code if that code is in a place that gets executed a
lot.

Profiling would be able to identify that.

> If, as you say, reading the file without doing any processing is quick
> enough, then it is the processing of the data that is the bottleneck.

Agree :) It also is really the only bit which is likely to be
Perl-specific. All the previous is not.

Martien
--
|
Martien Verbruggen | The Second Law of Thermodenial: In any closed
| mind the quantity of ignorance remains
| constant or increases.

Re: Shrink large file according to REG_EXP

am 17.01.2008 10:17:05 von bugbear

Uri Guttman wrote:
>>>>>> "nc" == nolo contendere writes:
>
> nc> On Jan 16, 2:17 pm, Uri Guttman wrote:
> >> >>>>> "nc" == nolo contendere writes:
> >>
> >> nc> check out /REGEX/o
> >>
> >> obsolete and probably useless.
> >>
>
> nc> really? is this since 5.10?
>
> since at least when qr// came in. also dynamic regexes (those with
> interpolation) are not recompiled unless some variable in them
> changes. this is what /o was all about in the early days of perl5. so
> it's purpose of not recompiling has been moot for eons. and qr// even
> makes it even more useless.

I won't ask you lots of questions - but do you have a link
to this info that I can read - it's of (substantial) interest
to me.

BugBear

Re: Shrink large file according to REG_EXP

am 17.01.2008 10:38:08 von Uri Guttman

>>>>> "b" == bugbear writes:

b> Uri Guttman wrote:
>>>>>>> "nc" == nolo contendere writes:
nc> On Jan 16, 2:17 pm, Uri Guttman wrote:
>> >> >>>>> "nc" == nolo contendere writes:
>> >> >> nc> check out /REGEX/o
>> >> >> obsolete and probably useless.
>> >> nc> really? is this since 5.10?
>> since at least when qr// came in. also dynamic regexes (those with
>> interpolation) are not recompiled unless some variable in them
>> changes. this is what /o was all about in the early days of perl5. so
>> it's purpose of not recompiling has been moot for eons. and qr// even
>> makes it even more useless.

b> I won't ask you lots of questions - but do you have a link
b> to this info that I can read - it's of (substantial) interest
b> to me.

this should be in perlop under the regexp quote like ops but it doesn't
mention that /o is useless now. the faq covers it. and 5.6 is pretty old
so /o has been useless for years.


perlfaq6: What is /o really for? (code snipped)

The /o option for regular expressions (documented in perlop and
perlreref) tells Perl to compile the regular expression only once. This
is only useful when the pattern contains a variable. Perls 5.6 and later
handle this automatically if the pattern does not change.

Since the match operator m//, the substitution operator s///, and the
regular expression quoting operator qr// are double-quotish constructs,
you can interpolate variables into the pattern. See the answer to "How
can I quote a variable to use in a regex?" for more details.

Versions of Perl prior to 5.6 would recompile the regular expression for
each iteration, even if $pattern had not changed. The /o would prevent
this by telling Perl to compile the pattern the first time, then reuse
that for subsequent iterations:

In versions 5.6 and later, Perl won't recompile the regular expression
if the variable hasn't changed, so you probably don't need the /o
option. It doesn't hurt, but it doesn't help either. If you want any
version of Perl to compile the regular expression only once even if the
variable changes (thus, only using its initial value), you still need
the /o.

You can watch Perl's regular expression engine at work to verify for
yourself if Perl is recompiling a regular expression. The use re 'debug'
pragma (comes with Perl 5.005 and later) shows the details. With Perls
before 5.6, you should see re reporting that its compiling the regular
expression on each iteration. With Perl 5.6 or later, you should only
see re report that for the first iteration.

uri

--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Re: Shrink large file according to REG_EXP

am 17.01.2008 12:10:41 von bugbear

Uri Guttman wrote:
>
> b> I won't ask you lots of questions - but do you have a link
> b> to this info that I can read - it's of (substantial) interest
> b> to me.

(helpful stuff snipped)

Thank you for that - most helpful.

BugBear

Re: Shrink large file according to REG_EXP

am 17.01.2008 21:59:12 von Charles DeRykus

On Jan 16, 9:28 am, thellper wrote:
> Hello,
> I've a problem to solve, and I need some help, please.
> I've as input a large text file (up to 5GB) which I need to filter
> according some REG_EXP and then I need to write the filtered
> (hopefully smaller) output to another file.
> The filtering applies row-by-row: a row is splitted according to some
> rules in various pieces, then some of the pieces are checked according
> to some REG_EXP, and if a match is found, the whole line is written to
> the output.
>...

Just a guess but splitting into pieces and then applying the regex
to each piece may well be a signifcant slowdown. Have you considered
trying to tweak the regex to avoid the split and resultant copies...


--
Charles DeRykus

Re: Shrink large file according to REG_EXP

am 17.01.2008 23:52:41 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Uri Guttman
], who wrote in article :
> In versions 5.6 and later, Perl won't recompile the regular expression
> if the variable hasn't changed, so you probably don't need the /o
> option. It doesn't hurt, but it doesn't help either.

Yet another case of broken documentation. Still, //o helps (though
nowhere as dramatically as before). It avoids CHECKING that the
pattern did not change.

Hope this helps,
Ilya

Re: Shrink large file according to REG_EXP

am 18.01.2008 02:36:38 von John Bokma

Ilya Zakharevich wrote:

> Yet another case of broken documentation.

Important question: how can this be fixed?

Preferable both:

- the documentation itself,
- and a way to make the fixing process easier (wiki?)

--
John

http://johnbokma.com/mexit/