"negative" regexp
am 30.01.2008 02:10:04 von Petr Vileta
I have problem to construct regexp, this is out of my brain ;-) please help.
I have string
$string="
href='abc.htm'>click";
and I need to remove all html tags except . The result should be
$string="click";
Now I do it this way
# replace < with Ctrl-B and > with Ctrl-E for all tags
$string=~s//\cb$1\ce/g;
# remove all html tags
$string=~s/<.+?>//g;
# replace back all Ctrl-B with <
$string=~s/\cb/
# replace back all Ctrl-E with >
$string=~s/\ce/>/g;
but maybe exist another way.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your mail from
another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 30.01.2008 03:32:46 von Abigail
_
Petr Vileta (stoupa@practisoft.cz) wrote on VCCLXV September MCMXCIII in
:
;; I have problem to construct regexp, this is out of my brain ;-) please help.
;;
;; I have string
;;
;; $string="
;; href='abc.htm'>click";
;;
;; and I need to remove all html tags except . The result should be
;;
;; $string="click";
;;
;; Now I do it this way
;;
;; # replace < with Ctrl-B and > with Ctrl-E for all tags
;; $string=~s//\cb$1\ce/g;
;; # remove all html tags
;; $string=~s/<.+?>//g;
;; # replace back all Ctrl-B with <
;; $string=~s/\cb/
;;
;; # replace back all Ctrl-E with >
;; $string=~s/\ce/>/g;
;;
;; but maybe exist another way.
Well, for your example,
s/<(?!img)[^>]*>//g
ought to do it (untested).
But that assumes no '>' is present inside a tag, which doesn't have
to be the case.
The "right" way to do it is to use a proper HTML parser.
Get one from your nearest CPAN.
Abigail
--
perl -swleprint -- -_=Just\ another\ Perl\ Hacker
Re: "negative" regexp
am 30.01.2008 16:10:15 von Petr Vileta
Abigail wrote:
> _
> Petr Vileta (stoupa@practisoft.cz) wrote on VCCLXV September MCMXCIII
> in :
> ;; I have problem to construct regexp, this is out of my brain ;-)
> please help. ;;
> ;; I have string
> ;;
> ;; $string="
> ;; href='abc.htm'>click";
> ;;
> ;; and I need to remove all html tags except . The result
> should be ;;
> ;; $string="click";
> ;;
>
> Well, for your example,
>
> s/<(?!img)[^>]*>//g
>
Thanks, this work OK.
>
> The "right" way to do it is to use a proper HTML parser.
>
> Get one from your nearest CPAN.
>
I'm tending to not use HTMP parsers because these construct a huge hashes and
this is usually not needed for my purposes.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 30.01.2008 17:47:04 von Michele Dondi
On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
wrote:
>I'm tending to not use HTMP parsers because these construct a huge hashes and
>this is usually not needed for my purposes.
Huh?!? Evidence?
Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
Re: "negative" regexp
am 30.01.2008 19:58:04 von Ted Zlatanov
On Wed, 30 Jan 2008 17:47:04 +0100 Michele Dondi wrote:
MD> On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
MD> wrote:
>> I'm tending to not use HTMP parsers because these construct a huge hashes and
>> this is usually not needed for my purposes.
MD> Huh?!? Evidence?
Well, it's right in the name: Huge Temporary Memory Packrat, HTMP. I
try not to use HTMP parsers myself as well for this very reason.
Ted
Re: "negative" regexp
am 31.01.2008 01:54:11 von Petr Vileta
Michele Dondi wrote:
> On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
> wrote:
>
>> I'm tending to not use HTMP parsers because these construct a huge
>> hashes and this is usually not needed for my purposes.
>
> Huh?!? Evidence?
>
>
Example for convert any basic html page to plain text.
# remove all except body content
$html=~s/^.+?(.+?)<\/body>.*$/$1/si;
# remove all scripts
$html=~s///sig;
# remove all images
$html=~s///sig;
# remove all html coments
$html=~s/<\!\-\-.+?\-\->//sig;
# replace possible table end-of-row or
with new line
$html=~s/(<\/tr>|
)/\n/sig;
# remove all remaining html tags
$html=~s/<.+?>//sg;
Now I have plain text. Yes, this way is not ideal but is quickly and consumpt
low memory.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 31.01.2008 11:44:39 von Martien Verbruggen
On Thu, 31 Jan 2008 01:54:11 +0100,
Petr Vileta wrote:
> Michele Dondi wrote:
>> On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
>> wrote:
>>
>>> I'm tending to not use HTMP parsers because these construct a huge
>>> hashes and this is usually not needed for my purposes.
>>
>> Huh?!? Evidence?
>>
>>
> Example for convert any basic html page to plain text.
>
> # remove all except body content
> $html=~s/^.+?(.+?)<\/body>.*$/$1/si;
> # remove all scripts
> $html=~s///sig;
> # remove all images
> $html=~s///sig;
> # remove all html coments
> $html=~s/<\!\-\-.+?\-\->//sig;
> # replace possible table end-of-row or
with new line
> $html=~s/(<\/tr>|
)/\n/sig;
> # remove all remaining html tags
>
> $html=~s/<.+?>//sg;
>
> Now I have plain text.
No, you don't. At least, you probably do, but that's only because HTML
files are plain text to start off with. What you do not have is a file
completely cleared of HTML markup. And you also possibly have removed bits
of text that you meant to leave in place.
> Yes, this way is not ideal but is quickly and consumpt
> low memory.
If by "not ideal" you mean incorrect, you're right.
You really need a HTML parser to do this correctly, and it's simply not
as trivial as you seem to think to roll one yourself.
You still haven't given any evidence for your statement that HTML
parsers construct huge hashes. I don't believe they necessarily, or
ever, do. Even if that was simply a clumsy attempt to make a more
general statement about why an HTML parser isn't going to work for you,
I'd still like to hear some clarification. What is the high performance
task that you need to perfomr on your memory starved machine that
doesn't allow a HTML parser?
Martien
--
|
Martien Verbruggen | The Second Law of Thermodenial: In any closed
| mind the quantity of ignorance remains
| constant or increases.
Re: "negative" regexp
am 31.01.2008 13:01:06 von Tad J McClellan
Petr Vileta wrote:
> $html=~s/<\!\-\-.+?\-\->//sig;
Unnecessary backslashes make your code much harder to read
and understand. You should backslash only when you actually
need to.
There is not much point in ignoring case when your pattern
does not contain any letters...
$html =~ s///sg;
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Re: "negative" regexp
am 31.01.2008 14:47:57 von Petr Vileta
Tad J McClellan wrote:
> Petr Vileta wrote:
>
>
>> $html=~s/<\!\-\-.+?\-\->//sig;
>
>
> Unnecessary backslashes make your code much harder to read
> and understand. You should backslash only when you actually
> need to.
>
> There is not much point in ignoring case when your pattern
> does not contain any letters...
>
>
> $html =~ s///sg;
Yes, you are right, but O'Reilly book "Programin Perl" say "... any other
escaped character is character itself". Maybe this is not correct cite, I have
Czech version. In other word the character - is sometime "range operator" say
in case [a-z] and character ! sometime mean "not". So for to be sure a
character is a character but not operator I'm used to escape all possible
ambiguous characters ;-)
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 31.01.2008 15:05:35 von Petr Vileta
Martien Verbruggen wrote:
> On Thu, 31 Jan 2008 01:54:11 +0100,
> Petr Vileta wrote:
>> Michele Dondi wrote:
>>> On Wed, 30 Jan 2008 16:10:15 +0100, "Petr Vileta"
>>> wrote:
>>>
>>>> I'm tending to not use HTMP parsers because these construct a huge
>>>> hashes and this is usually not needed for my purposes.
>>>
>>> Huh?!? Evidence?
>>>
>>>
>> Example for convert any basic html page to plain text.
>>
>> # remove all except body content
>> $html=~s/^.+?(.+?)<\/body>.*$/$1/si;
>> # remove all scripts
>> $html=~s///sig;
>> # remove all images
>> $html=~s///sig;
>> # remove all html coments
>> $html=~s/<\!\-\-.+?\-\->//sig;
>> # replace possible table end-of-row or
with new line
>> $html=~s/(<\/tr>|
)/\n/sig;
>> # remove all remaining html tags
>>
>> $html=~s/<.+?>//sg;
>>
>> Now I have plain text.
>
> No, you don't. At least, you probably do, but that's only because HTML
> files are plain text to start off with. What you do not have is a file
> completely cleared of HTML markup. And you also possibly have removed
> bits of text that you meant to leave in place.
>
What texts? Image's titles and alts? Links (anchors)? Form fields? Unimportant
for me in concrete case.
>> Yes, this way is not ideal but is quickly and
>> consumpt low memory.
>
> If by "not ideal" you mean incorrect, you're right.
>
No, I mean not ideal for using universally. I have concrete goal and I use as
minimal resource as possible. For example if I want to extract clicable email
addresses from html source I need to extract all
/href=['"]*mailto:\s*(.+?)['"\s>/
only.
> You really need a HTML parser to do this correctly, and it's simply
> not as trivial as you seem to think to roll one yourself.
>
Yes, HTML parse know to parse correctly but sometime fail on not valid html
pages. For example I saw many times pages generated by PHP from templates,
which contain or tags twice or more ;-)
> You still haven't given any evidence for your statement that HTML
> parsers construct huge hashes. I don't believe they necessarily, or
> ever, do. Even if that was simply a clumsy attempt to make a more
> general statement about why an HTML parser isn't going to work for
> you, I'd still like to hear some clarification. What is the high
> performance task that you need to perfomr on your memory starved
> machine that
> doesn't allow a HTML parser?
>
HTML:Parser and WWW:Mechanize are good modules but in many case these are "too
big gun" :-)
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 31.01.2008 15:32:48 von Michele Dondi
On Thu, 31 Jan 2008 01:54:11 +0100, "Petr Vileta"
wrote:
>>> I'm tending to not use HTMP parsers because these construct a huge
>>> hashes and this is usually not needed for my purposes.
>>
>> Huh?!? Evidence?
[snip]
>Now I have plain text. Yes, this way is not ideal but is quickly and consumpt
>low memory.
I was asking for evidence that "HTML parsers construct huge hashes"
which is what I think you meant.
Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
Re: "negative" regexp
am 31.01.2008 17:11:33 von Uri Guttman
>>>>> "PV" == Petr Vileta writes:
PV> No, I mean not ideal for using universally. I have concrete goal and I
PV> use as minimal resource as possible. For example if I want to extract
PV> clicable email addresses from html source I need to extract all
PV> /href=['"]*mailto:\s*(.+?)['"\s>/
PV> only.
besides the typo (no close ] on the right), that wouldn't always
work. it allows for an open ' and a closing " which is wrong. it doesn't
handle html comments which shouldn't be parsed for email
addresses. there are other problems with it that i can't get into. so
even a 'simple' thing like that is much harder to extract with a regex
than you think. use a module designed and tested to parse html and email
addresses. it is actually simpler coding from your point of view and
correct as well! and correct beats efficient every day.
uri
PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
PV> are "too big gun" :-)
better a big accurate gun than a tiny pistol with no accuracy. you might
even shoot your eye out!
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Re: "negative" regexp
am 31.01.2008 17:13:14 von Uri Guttman
>>>>> "PV" == Petr Vileta writes:
PV> Tad J McClellan wrote:
>> Petr Vileta wrote:
>>
>>
>>> $html=~s/<\!\-\-.+?\-\->//sig;
>>
>>
>> Unnecessary backslashes make your code much harder to read
>> and understand. You should backslash only when you actually
>> need to.
>>
>> There is not much point in ignoring case when your pattern
>> does not contain any letters...
>>
>>
>> $html =~ s///sg;
PV> Yes, you are right, but O'Reilly book "Programin Perl" say "... any
PV> other escaped character is character itself". Maybe this is not
PV> correct cite, I have Czech version. In other word the character - is
PV> sometime "range operator" say in case [a-z] and character ! sometime
PV> mean "not". So for to be sure a character is a character but not
PV> operator I'm used to escape all possible ambiguous characters ;-)
that is chicken programming. it leads to noise and potentially buggy
programs. i wouldn't accept it for any production code or code that i
review.
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Re: "negative" regexp
am 31.01.2008 21:59:36 von Michele Dondi
On Thu, 31 Jan 2008 15:05:35 +0100, "Petr Vileta"
wrote:
>No, I mean not ideal for using universally. I have concrete goal and I use as
>minimal resource as possible. For example if I want to extract clicable email
>addresses from html source I need to extract all
Do not misunderstand me: I also use custom rolled regexen to *extract*
some info from specific pages. Since the kind of data I need depends
on the actual structure of the pages, using a real parser would still
make the program fail if the structure were to change, for example.
OTOH if I need to extract links -a classic task- from heterogeneous
pages, then a parser is best suited. Do not make confusion between
*extracting* needs, and *parsing* needs, although there can be
superimpositions.
I was *just* commenting on you claim that HTML parsing modules "build
large hashes" which IMHO is not (necessarily) the case. And I'm still
asking you for some evidence.
Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^
..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
Re: "negative" regexp
am 01.02.2008 01:10:33 von Tad J McClellan
Petr Vileta wrote:
> Tad J McClellan wrote:
>> Petr Vileta wrote:
>>
>>
>>> $html=~s/<\!\-\-.+?\-\->//sig;
>>
>>
>> Unnecessary backslashes make your code much harder to read
>> and understand. You should backslash only when you actually
>> need to.
>>
>> There is not much point in ignoring case when your pattern
>> does not contain any letters...
>>
>>
>> $html =~ s///sg;
> Yes, you are right, but O'Reilly book "Programin Perl" say "... any other
> escaped character is character itself".
.... and in the regular expression language, many UNescaped characters
also are the character itself!
> Maybe this is not correct cite, I have
> Czech version.
You have not given enough information to be able to find it in the
Camel book. What chapter/section? What edition?
> In other word the character - is sometime "range operator"
.... and it is sometimes "subtraction operator".
> say
> in case [a-z]
That is not the regular expression language (grammar), that is
in the character class language.
In the Perl language, the character - is subtraction.
In the regular expression language, the character - is not special,
it matches a - character.
In the character class language, the character - forms a range.
You have to know which language you are in before you can properly
discern what all those funny characters mean.
> and character ! sometime mean "not".
.... in the Perl language.
It is not special in either the regular expression language nor in
the character class language.
> So for to be sure a
> character is a character but not operator
then all you need to know is which language you are currently writing in.
> I'm used to escape all possible
> ambiguous characters ;-)
If you code from ignorance, you end up with ignorant code.
Simply learn Perl and its "mini languages", then you will be really sure,
and your code won't look so embarrassingly amateurish (as well as be
much easier to maintain).
For a bit more on this, see:
http://groups.google.com/group/comp.lang.perl.misc/msg/a218a 97e390c892a
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Re: "negative" regexp
am 01.02.2008 01:10:33 von Tad J McClellan
Uri Guttman wrote:
>>>>>> "PV" == Petr Vileta writes:
[ snip "parsing" HTML with a regex ]
> PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
> PV> are "too big gun" :-)
>
> better a big accurate gun than a tiny pistol with no accuracy. you might
> even shoot your eye out!
Or your foot!
http://groups.google.com/group/comp.lang.perl.misc/msg/a97f4 d7d02afa8ff
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Re: "negative" regexp
am 01.02.2008 01:34:48 von Uri Guttman
>>>>> "TJM" == Tad J McClellan writes:
TJM> Uri Guttman wrote:
>>>>>>> "PV" == Petr Vileta writes:
TJM> [ snip "parsing" HTML with a regex ]
PV> HTML:Parser and WWW:Mechanize are good modules but in many case these
PV> are "too big gun" :-)
>>
>> better a big accurate gun than a tiny pistol with no accuracy. you might
>> even shoot your eye out!
TJM> Or your foot!
TJM> http://groups.google.com/group/comp.lang.perl.misc/msg/a97f4 d7d02afa8ff
ha!
i was refering to jean shepherd's a christmas story. :)
use Red::Ryder::BB ;
uri
--
Uri Guttman ------ uri@stemsystems.com -------- http://www.sysarch.com --
----- Perl Architecture, Development, Training, Support, Code Review ------
----------- Search or Offer Perl Jobs ----- http://jobs.perl.org ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
Re: "negative" regexp
am 01.02.2008 02:14:16 von Petr Vileta
Michele Dondi wrote:
> On Thu, 31 Jan 2008 15:05:35 +0100, "Petr Vileta"
> wrote:
>
> I was *just* commenting on you claim that HTML parsing modules "build
> large hashes" which IMHO is not (necessarily) the case. And I'm still
> asking you for some evidence.
>
Michele, I have no time to prepare concrete example, but please compute with
me:
when I load html page say 100kB into string type variable then script occupy
100kB + 4 (?) bytes for varable pointer. When I parse it by HTML::Parser into
has then I will get hash with 100, 200, 1000 ? hash items. All of these items
must ocupy space for own name (as text) and pointers to parent and child
items. Maybe this is not correct definition of hash structure in memory, but
maybe is near to true ;-) In other word when you use my way and dump all
memory occupied by perl script into file then this file may be say about
200kB. If you use Parser and dump to file then the file may be say about from
300 up to 500kB in dependence of html complexity. I'm old fashion programmer
and I begin with assembler for 8-bit computers so I still tend to spare
memory, disk space, number of CPU usage anytime it is possible :-) We have
saying in Czech "You can not teach an old dog to do new stunts".
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 01.02.2008 03:41:57 von Petr Vileta
Tad J McClellan wrote:
>> Yes, you are right, but O'Reilly book "Programin Perl" say "... any
>> other escaped character is character itself".
>
>
> ... and in the regular expression language, many UNescaped characters
> also are the character itself!
>
Of course, sir ;-) But if I'm not sure if character could be operator then I
escape it, if I'm not sure about precedence in calculation then I add
"needless" parentheses ;-)
> You have not given enough information to be able to find it in the
> Camel book. What chapter/section? What edition?
>
I try it, but I have Czech edition and maybe my translation will not be
accurate or my edition can have more or less pages.
Larry Wall, Tom Christiansen & Randal L. Schwartz
Programming Perl
Original copyright: 1996 O'Reilly and Associates Inc.
Translations: 1997 Computer Press, Pague, Czech Republic
Chapter "2. Basic program parts", page 69 "comparing by paterns"
Code Meaning
-----------------------
\a signal
\n new line
.....
\S other then blank character
.....
Character "c" preceded by backslash and followed by single character , for
example \cD, is identical with
control-character.
Any other character preceded by backslash is identical with character itself.
> Simply learn Perl and its "mini languages", then you will be really
> sure, and your code won't look so embarrassingly amateurish (as well
> as be much easier to maintain).
>
I endeavour to learn Perl and all its parts, nuances and tricks but I'm from
"lost postcommunistic generation". Now I'm 50+ and I started learn English
only few years ago, this is too late for man. We had "iron curtain" here 40
years and we had no chance to get informations from "free world". When you are
20 or 30, then you can learn 2-3 human languages and many programming
languages because your memory is able to absorb informations. But as you are
older and older then your memory more and more fail to absorb new
informations.
But end of lament - I'm happy that we have not communism here now :-) and I
can communicate with any people in the world.
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to
Re: "negative" regexp
am 01.02.2008 04:40:44 von John Bokma
"Petr Vileta" wrote:
> Michele Dondi wrote:
>> On Thu, 31 Jan 2008 15:05:35 +0100, "Petr Vileta"
>> wrote:
>>
>> I was *just* commenting on you claim that HTML parsing modules "build
>> large hashes" which IMHO is not (necessarily) the case. And I'm still
>> asking you for some evidence.
>>
> Michele, I have no time to prepare concrete example, but please
> compute with
> me:
>
> when I load html page say 100kB into string type variable then script
> occupy 100kB + 4 (?) bytes for varable pointer. When I parse it by
> HTML::Parser into has then I will get hash with 100, 200, 1000 ? hash
> items. All of these items must ocupy space for own name (as text) and
> pointers to parent and child items. Maybe this is not correct
> definition of hash structure in memory, but maybe is near to true ;-)
Not all HTML parsers create an entire tree in memory. I have more
experience with XML parsers, but with XML you have parsers that generate
events for each element encoutered (to be more exactly, the start of an
element, character data, end of an element, and possible some more). If
you don't store it yourself, nothing is stored. Those are great if you
want to get some information from a huge file, for example.
I just did a quick check, and HTML::PullParser does sound to me like it
works along those lines:
"repeatedly call $parser->get_token to obtain the tags and text found in
the parsed document."
And I have the feeling that even HTML::Parser works (or can work) that
way.
> memory, disk space, number of CPU usage anytime it is possible :-) We
> have saying in Czech "You can not teach an old dog to do new stunts".
But you're not a dog ;-) The problem is that an old dog often has become a
part of the family and knows it gets its food and walks anyway. There is
no need to learn new tricks, the old ones will work.
--
John
Arachnids near Coyolillo - part 1
http://johnbokma.com/mexit/2006/05/04/arachnids-coyolillo-1. html
Re: "negative" regexp
am 01.02.2008 06:25:00 von jurgenex
"Petr Vileta" wrote:
>Michele Dondi wrote:
>> On Thu, 31 Jan 2008 15:05:35 +0100, "Petr Vileta"
>> wrote:
>>
>> I was *just* commenting on you claim that HTML parsing modules "build
>> large hashes" which IMHO is not (necessarily) the case. And I'm still
>> asking you for some evidence.
>>
> Michele, I have no time to prepare concrete example, but please compute with
>me:
>
>when I load html page say 100kB into string type variable then script occupy
>100kB + 4 (?) bytes for varable pointer. When I parse it by HTML::Parser into
>has then I will get hash with 100, 200, 1000 ? hash items. All of these items
>must ocupy space for own name (as text) and pointers to parent and child
>items.
Well, no. Or at least not necessarily. Just check the documentation for e.g.
HTML::Parser. It clearly says:
Objects of the "HTML::Parser" class will recognize markup and separate
it from plain text (alias data content) in HTML documents. As different
kinds of markup and text are recognized, the corresponding event
handlers are invoked.
In other words unless _YOU_ define a call-back that stores those elements
nothing will be stored. This way you can extract exactly _what_ you want and
store it in the _way_ you want it.
Re: "negative" regexp
am 01.02.2008 11:06:30 von rvtol+news
Petr Vileta schreef:
> Example for convert any basic html page to plain text.
Use "|links -dump" as a filter.
--
Affijn, Ruud
"Gewoon is een tijger."
Re: "negative" regexp
am 01.02.2008 15:54:51 von Petr Vileta
Jürgen Exner wrote:
> In other words unless _YOU_ define a call-back that stores those
> elements nothing will be stored. This way you can extract exactly
> _what_ you want and store it in the _way_ you want it.
Yes, sorry, I confound HTML::Parser and HTML::Tree
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to