Parse HTML

am 25.07.2011 22:17:57 von Jeffrey Joh

--_97742119-59b4-44c7-961f-cd71f8fdae2a_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hello=2C I'm trying to parse HTML files. I want to extract values from tab=
les (1) and from text fields (2). (1) <br =3D"" width=3D"1" height=3D"1" border=3D"0"> =20

Floor plan:

Ranch #1 =20
(2)
e=3D"04/01/2004" size=3D"10" disabled> I would want to retrieve the floor p=
lan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (a=
long with many other text boxes). What is an easy way of doing that? Jeff =
=

--_97742119-59b4-44c7-961f-cd71f8fdae2a_--

Re: Parse HTML

am 25.07.2011 22:30:17 von Shlomi Fish

Hi Jeffrey,

On Mon, 25 Jul 2011 13:17:57 -0700
Jeffrey Joh wrote:

>=20
>=20
>=20
>=20
> Hello, I'm trying to parse HTML files. I want to extract values from tab=
les
> (1) and from text fields (2). (1) 3D""<br > width=3D"1" height=3D"1" border=3D"0"> > valign=3D"top">Floor plan:
> Ranch #1 =20
> (2)
> > value=3D"04/01/2004" size=3D"10" disabled> I would want to retrieve the f=
loor
> plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
> (along with many other text boxes). What is an easy way of doing that? J=
eff
> =20

You should use an HTML parser for that:

http://perl-begin.org/uses/text-parsing/

Regards,

Shlomi Fish

--=20
------------------------------------------------------------ -----
Shlomi Fish http://www.shlomifish.org/
Interview with Ben Collins-Sussman - http://shlom.in/sussman

Had I not been already insane, I would have long ago driven myself mad.
â=94 The Enemy and how I Helped to Fight It

Please reply to list if it's a mailing list post - http://shlom.in/reply .

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parse HTML

am 25.07.2011 22:35:35 von Jim Gibson

On 7/25/11 Mon Jul 25, 2011 1:30 PM, "Shlomi Fish"
scribbled:

> On Mon, 25 Jul 2011 13:17:57 -0700
> Jeffrey Joh wrote:

>>
>> Hello, I'm trying to parse HTML files.
>>
>
> You should use an HTML parser for that:
>
> http://perl-begin.org/uses/text-parsing/

Also look at HTML::TableExtract (I have not used it).

pm>

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parse HTML

am 26.07.2011 00:36:03 von rvtol+usenet

On 2011-07-25 22:35, Jim Gibson wrote:
> Shlomi:
>> Jeffrey:

>>> Hello, I'm trying to parse HTML files.
>>
>> You should use an HTML parser for that:
>>
>> http://perl-begin.org/uses/text-parsing/
>
> Also look at HTML::TableExtract (I have not used it).
>
>

The 'permalink' is on the top-right of such a page:
http://search.cpan.org/perldoc?HTML::TableExtract

--
Ruud

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Parse HTML

am 26.07.2011 17:48:41 von Rob Dixon

On 25/07/2011 21:17, Jeffrey Joh wrote:
>
> Hello, I'm trying to parse HTML files. I want to extract values from
> tables (1) and from text fields (2). (1) > src="/image.gif" alt="" width="1" height="1" border="0">
>
>
> Floor plan:
>
> Ranch #1
> (2)
> I would want to retrieve the floor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file (along with many other text boxes). What is an easy way of doing that? Jeff

Hello Jeff

I am unclear what you want to do. The HTML fragments you have shown are
syntactically incorrect, and in any case are irrelevant out of the
context of a complete HTML document.

However I think I can help a little. The HTML::TreeBuilder module will
build an HTML::Element object for you that you can navigate, modify, and
extract data from. It is very forgiving of incorrect syntax, and will
try to build a complete HTML document from any fragment that you offer it.

The program below seems to do what you want, but without testing against
the complete data that you are dealing with I cannot vouch for its
correctness. In particular you should add checks to verify that the HTML
you are working with looks as you expect it to. I have written a couple
such checks, but only you can improve on those.

HTH,

Rob

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print "Working from HTML:\n\n";
print $tree->as_HTML(undef, ' '), "\n\n";

# Find an element with an 'id' atttribute of 'date_constructed'
# (there should be only one). The date required comes from the 'value'
# attribute of that element.
#
my $date_tr = $tree->look_down(
_tag => 'input',
id => 'date_constructed',
)
or die "No construction date";
my $plan_date = $date_tr->attr('value');

# Now look up the tree to the containing element, and find its previous
# sibling which contains the floor plan text in the second child
# element
#
my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
my @tds = $plan_tr->look_down(_tag => 'td');
die "Unexpected format" unless @tds == 2;

my $plan_text = $tds[1]->as_trimmed_text;

print "Plan found: $plan_text on $plan_date\n";

__DATA__

Floor plan:

Ranch #1

**OUTPUT**

Working from HTML:

Floor plan:	Ranch #1

Plan found: Ranch #1 on 04/01/2004

Tool completed successfully

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

RE: Parse HTML

am 26.07.2011 22:12:13 von Jeffrey Joh

--_825f121f-d56c-419d-a7a5-fd74d5dd2794_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hey Rob=2CThis is awesome! However=2C let's say I have an unknown number o=
f floorplans in a table that looks like this:=20
Floor plan:=20
Ranch #1=20

ed" value=3D"04/01/2004" size=3D"10" disabled>
>

=20
Floor plan:=20
Mission #3=20

ed" value=3D"08/01/2009" size=3D"10" disabled>
>

=20
Floor plan:=20
Big house #9=20

ed" value=3D"last summer" size=3D"10" disabled>
>
=20
> Date: Tue=2C 26 Jul 2011 16:48:41 +0100
> From: rob.dixon@gmx.com
> To: beginners@perl.org
> CC: johjeffrey@hotmail.com
> Subject: Re: Parse HTML
>=20
> On 25/07/2011 21:17=2C Jeffrey Joh wrote:
> >=20
> > Hello=2C I'm trying to parse HTML files. I want to extract values from
> > tables (1) and from text fields (2). (1) > > src=3D"/image.gif" alt=3D"" width=3D"1" height=3D"1" border=3D"0">=

> >
> >
> > Floor plan:
> >
> > Ranch #1
> > (2)
> > value=3D"04/01/2004" size=3D"10" disabled> I would want to retrieve the fl=
oor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML fi=
le (along with many other text boxes). What is an easy way of doing that? =
Jeff =09
>=20
> Hello Jeff
>=20
> I am unclear what you want to do. The HTML fragments you have shown are
> syntactically incorrect=2C and in any case are irrelevant out of the
> context of a complete HTML document.
>=20
> However I think I can help a little. The HTML::TreeBuilder module will
> build an HTML::Element object for you that you can navigate=2C modify=2C =
and
> extract data from. It is very forgiving of incorrect syntax=2C and will
> try to build a complete HTML document from any fragment that you offer it=
..
>=20
> The program below seems to do what you want=2C but without testing agains=
t
> the complete data that you are dealing with I cannot vouch for its
> correctness. In particular you should add checks to verify that the HTML
> you are working with looks as you expect it to. I have written a couple
> such checks=2C but only you can improve on those.
>=20
> HTH=2C
>=20
> Rob
>=20
>=20
> use strict=3B
> use warnings=3B
>=20
> use HTML::TreeBuilder=3B
>=20
> my $tree =3D HTML::TreeBuilder->new_from_file(*DATA)=3B
>=20
> print "Working from HTML:\n\n"=3B
> print $tree->as_HTML(undef=2C ' ')=2C "\n\n"=3B
>=20
> # Find an element with an 'id' atttribute of 'date_constructed'
> # (there should be only one). The date required comes from the 'value'
> # attribute of that element.
> #
> my $date_tr =3D $tree->look_down(
> _tag =3D> 'input'=2C
> id =3D> 'date_constructed'=2C
> )
> or die "No construction date"=3B
> my $plan_date =3D $date_tr->attr('value')=3B
>=20
> # Now look up the tree to the containing element=2C and find its pre=
vious
> # sibling which contains the floor plan text in the second chil=
d
> # element
> #
> my $plan_tr =3D $date_tr->look_up(_tag =3D> 'tr')->left=3B
> my @tds =3D $plan_tr->look_down(_tag =3D> 'td')=3B
> die "Unexpected format" unless @tds == 2=3B
>=20
> my $plan_text =3D $tds[1]->as_trimmed_text=3B
>=20
> print "Plan found: $plan_text on $plan_date\n"=3B
>=20
> __DATA__
>
> Floor plan:
>
> Ranch #1 =20
>
> lue=3D"04/01/2004" size=3D"10" disabled>
>=20
> **OUTPUT**
>=20
> Working from HTML:
>=20
>
>
>
>
>

>
>
>
>
>
>
>
>

Floor plan:	Ranch #1
ted" size=3D"10" type=3D"text" value=3D"04/01/2004" />

>
>
>=20
> Plan found: Ranch #1 on 04/01/2004
>=20
> Tool completed successfully
>=20
>=20
=

--_825f121f-d56c-419d-a7a5-fd74d5dd2794_--

RE: Parse HTML

am 26.07.2011 22:12:43 von Jeffrey Joh

--_d29d380e-7645-48d4-a923-7b9ae4dd2df6_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

Hey Rob=2CThis is awesome! However=2C let's say I have an unknown number o=
f floorplans in a table that looks like this:=20
Floor plan:=20
Ranch #1=20

ed" value=3D"04/01/2004" size=3D"10" disabled>
>

=20
Floor plan:=20
Mission #3=20

ed" value=3D"08/01/2009" size=3D"10" disabled>
>

=20
Floor plan:=20
Big house #9=20

ed" value=3D"last summer" size=3D"10" disabled>
>
I would like to retrieve all of the plan/date/IDs=2C AND discard all=
those plans that do not have a proper date_constructed such as "last summe=
r".How could I do that? Jeff
> Date: Tue=2C 26 Jul 2011 16:48:41 +0100
> From: rob.dixon@gmx.com
> To: beginners@perl.org
> CC: johjeffrey@hotmail.com
> Subject: Re: Parse HTML
>=20
> On 25/07/2011 21:17=2C Jeffrey Joh wrote:
> >=20
> > Hello=2C I'm trying to parse HTML files. I want to extract values from
> > tables (1) and from text fields (2). (1) > > src=3D"/image.gif" alt=3D"" width=3D"1" height=3D"1" border=3D"0">=

> >
> >
> > Floor plan:
> >
> > Ranch #1
> > (2)
> > value=3D"04/01/2004" size=3D"10" disabled> I would want to retrieve the fl=
oor plan (Ranch #1) and the date constructed (04/01/2004) from each HTML fi=
le (along with many other text boxes). What is an easy way of doing that? =
Jeff =09
>=20
> Hello Jeff
>=20
> I am unclear what you want to do. The HTML fragments you have shown are
> syntactically incorrect=2C and in any case are irrelevant out of the
> context of a complete HTML document.
>=20
> However I think I can help a little. The HTML::TreeBuilder module will
> build an HTML::Element object for you that you can navigate=2C modify=2C =
and
> extract data from. It is very forgiving of incorrect syntax=2C and will
> try to build a complete HTML document from any fragment that you offer it=
..
>=20
> The program below seems to do what you want=2C but without testing agains=
t
> the complete data that you are dealing with I cannot vouch for its
> correctness. In particular you should add checks to verify that the HTML
> you are working with looks as you expect it to. I have written a couple
> such checks=2C but only you can improve on those.
>=20
> HTH=2C
>=20
> Rob
>=20
>=20
> use strict=3B
> use warnings=3B
>=20
> use HTML::TreeBuilder=3B
>=20
> my $tree =3D HTML::TreeBuilder->new_from_file(*DATA)=3B
>=20
> print "Working from HTML:\n\n"=3B
> print $tree->as_HTML(undef=2C ' ')=2C "\n\n"=3B
>=20
> # Find an element with an 'id' atttribute of 'date_constructed'
> # (there should be only one). The date required comes from the 'value'
> # attribute of that element.
> #
> my $date_tr =3D $tree->look_down(
> _tag =3D> 'input'=2C
> id =3D> 'date_constructed'=2C
> )
> or die "No construction date"=3B
> my $plan_date =3D $date_tr->attr('value')=3B
>=20
> # Now look up the tree to the containing element=2C and find its pre=
vious
> # sibling which contains the floor plan text in the second chil=
d
> # element
> #
> my $plan_tr =3D $date_tr->look_up(_tag =3D> 'tr')->left=3B
> my @tds =3D $plan_tr->look_down(_tag =3D> 'td')=3B
> die "Unexpected format" unless @tds == 2=3B
>=20
> my $plan_text =3D $tds[1]->as_trimmed_text=3B
>=20
> print "Plan found: $plan_text on $plan_date\n"=3B
>=20
> __DATA__
>
> Floor plan:
>
> Ranch #1 =20
>
> lue=3D"04/01/2004" size=3D"10" disabled>
>=20
> **OUTPUT**
>=20
> Working from HTML:
>=20
>
>
>
>
>

>
>
>
>
>
>
>
>

Floor plan:	Ranch #1
ted" size=3D"10" type=3D"text" value=3D"04/01/2004" />

>
>
>=20
> Plan found: Ranch #1 on 04/01/2004
>=20
> Tool completed successfully
>=20
>=20
=

--_d29d380e-7645-48d4-a923-7b9ae4dd2df6_--

RE: Parse HTML

am 27.07.2011 00:27:33 von Shawn Wilson

--90e6ba6e8d6a1fcb1104a9007061
Content-Type: text/plain; charset=ISO-8859-1

Ya know, I'm sure there's a place for all of these. However, web::scraper
works great with the xpath that element inspectors return. It's real easy to
use and you can easily return variable types that suite your output best.
Ie, a hash with field names per table element for dbic.
On Jul 26, 2011 4:15 PM, "Jeffrey Joh" wrote:
>
> Hey Rob,This is awesome! However, let's say I have an unknown number of
floorplans in a table that looks like this:
> Floor plan:
> Ranch #1
>
> value="04/01/2004" size="10" disabled>
>
>
>
> Floor plan:
> Mission #3
>
> value="08/01/2009" size="10" disabled>
>
>
>
> Floor plan:
> Big house #9
>
> value="last summer" size="10" disabled>
>
> I would like to retrieve all of the plan/date/IDs, AND discard all
those plans that do not have a proper date_constructed such as "last
summer".How could I do that? Jeff
> > Date: Tue, 26 Jul 2011 16:48:41 +0100
>> From: rob.dixon@gmx.com
>> To: beginners@perl.org
>> CC: johjeffrey@hotmail.com
>> Subject: Re: Parse HTML
>>
>> On 25/07/2011 21:17, Jeffrey Joh wrote:
>> >
>> > Hello, I'm trying to parse HTML files. I want to extract values from
>> > tables (1) and from text fields (2). (1) >> > src="/image.gif" alt="" width="1" height="1" border="0">
>> >
>> >
>> > Floor plan:
>> >
>> > Ranch #1
>> > (2)
>> > value="04/01/2004" size="10" disabled> I would want to retrieve the floor
plan (Ranch #1) and the date constructed (04/01/2004) from each HTML file
(along with many other text boxes). What is an easy way of doing that? Jeff
>>
>> Hello Jeff
>>
>> I am unclear what you want to do. The HTML fragments you have shown are
>> syntactically incorrect, and in any case are irrelevant out of the
>> context of a complete HTML document.
>>
>> However I think I can help a little. The HTML::TreeBuilder module will
>> build an HTML::Element object for you that you can navigate, modify, and
>> extract data from. It is very forgiving of incorrect syntax, and will
>> try to build a complete HTML document from any fragment that you offer
it.
>>
>> The program below seems to do what you want, but without testing against
>> the complete data that you are dealing with I cannot vouch for its
>> correctness. In particular you should add checks to verify that the HTML
>> you are working with looks as you expect it to. I have written a couple
>> such checks, but only you can improve on those.
>>
>> HTH,
>>
>> Rob
>>
>>
>> use strict;
>> use warnings;
>>
>> use HTML::TreeBuilder;
>>
>> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
>>
>> print "Working from HTML:\n\n";
>> print $tree->as_HTML(undef, ' '), "\n\n";
>>
>> # Find an element with an 'id' atttribute of 'date_constructed'
>> # (there should be only one). The date required comes from the 'value'
>> # attribute of that element.
>> #
>> my $date_tr = $tree->look_down(
>> _tag => 'input',
>> id => 'date_constructed',
>> )
>> or die "No construction date";
>> my $plan_date = $date_tr->attr('value');
>>
>> # Now look up the tree to the containing element, and find its
previous
>> # sibling which contains the floor plan text in the second
child
>> # element
>> #
>> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
>> my @tds = $plan_tr->look_down(_tag => 'td');
>> die "Unexpected format" unless @tds == 2;
>>
>> my $plan_text = $tds[1]->as_trimmed_text;
>>
>> print "Plan found: $plan_text on $plan_date\n";
>>
>> __DATA__
>>
>> Floor plan:
>>
>> Ranch #1
>>
>> value="04/01/2004" size="10" disabled>
>>
>> **OUTPUT**
>>
>> Working from HTML:
>>
>>
>>
>>
>>
>>

>>
>>
>>
>>
>>
>>
>>
>>

Floor plan:	Ranch #1
size="10" type="text" value="04/01/2004" />

>>
>>
>>
>> Plan found: Ranch #1 on 04/01/2004
>>
>> Tool completed successfully
>>
>>
>

--90e6ba6e8d6a1fcb1104a9007061--

Re: Parse HTML

am 28.07.2011 21:25:26 von Rob Dixon

On 26/07/2011 21:12, Jeffrey Joh wrote:
> On 26 Jul 2011 16:48, Rob Dixon wrote:
>> On 25/07/2011 21:17, Jeffrey Joh wrote:
>>>
>>> Hello, I'm trying to parse HTML files. I want to extract values from
>>> tables (1) and from text fields (2). (1) >>> src="/image.gif" alt="" width="1" height="1" border="0">
>>>
>>> Floor plan:
>>> Ranch #1
>>> (2)
>>> >>> value="04/01/2004" size="10" disabled>
>>>
>>> I would want to retrieve the floor plan (Ranch #1) and the date
>>> constructed (04/01/2004) from each HTML file (along with many
>>> other text boxes). What is an easy way of doing that? Jeff
>>
>> I am unclear what you want to do. The HTML fragments you have shown are
>> syntactically incorrect, and in any case are irrelevant out of the
>> context of a complete HTML document.
>>
>> However I think I can help a little. The HTML::TreeBuilder module will
>> build an HTML::Element object for you that you can navigate, modify, and
>> extract data from. It is very forgiving of incorrect syntax, and will
>> try to build a complete HTML document from any fragment that you offer it.
>>
>> The program below seems to do what you want, but without testing against
>> the complete data that you are dealing with I cannot vouch for its
>> correctness. In particular you should add checks to verify that the HTML
>> you are working with looks as you expect it to. I have written a couple
>> such checks, but only you can improve on those.
>>
>>
>> use strict;
>> use warnings;
>>
>> use HTML::TreeBuilder;
>>
>> my $tree = HTML::TreeBuilder->new_from_file(*DATA);
>>
>> print "Working from HTML:\n\n";
>> print $tree->as_HTML(undef, ' '), "\n\n";
>>
>> # Find an element with an 'id' atttribute of 'date_constructed'
>> # (there should be only one). The date required comes from the 'value'
>> # attribute of that element.
>> #
>> my $date_tr = $tree->look_down(
>> _tag => 'input',
>> id => 'date_constructed',
>> )
>> or die "No construction date";
>> my $plan_date = $date_tr->attr('value');
>>
>> # Now look up the tree to the containing element, and find its previous
> > # sibling which contains the floor plan text in the second child
>> # element
>> #
>> my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
>> my @tds = $plan_tr->look_down(_tag => 'td');
>> die "Unexpected format" unless @tds == 2;
>>
>> my $plan_text = $tds[1]->as_trimmed_text;
>>
>> print "Plan found: $plan_text on $plan_date\n";
>>
>> __DATA__
>>
>> Floor plan:
>> Ranch #1
>>
>>
>>
>> **OUTPUT**
>>
>> Plan found: Ranch #1 on 04/01/2004
>
> This is awesome! However, let's say I have an unknown number of
> floorplans in a table that looks like this:
>
>
> Floor plan:
> Ranch #1
>
> > value="04/01/2004" size="10" disabled>
>
>
>
> Floor plan:
> Mission #3
>
> > value="08/01/2009" size="10" disabled>
>
>
>
> Floor plan:
> Big house #9
>
> > value="last summer" size="10" disabled>
>
>

Hi Jeff

Please bottom-post your replies here. It is the standard for the list,
and long and complex threads can quickly become incomprehensible if
posts are made at both ends of the quoted message. Thank you.

To achieve this, all you need to do is find all of the elements
with an id attribute of 'date_constructed'. The plan name can be found
from each of these as before. Take a look at the program below.

HTH,

Rob

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print "Working from HTML:\n\n";
print $tree->as_HTML(undef, ' '), "\n\n";

# Find all elements with an 'id' atttribute of 'date_constructed'.
#
my @date_tr = $tree->look_down(
_tag => 'input',
id => 'date_constructed',
)
or die "No construction dates";

# Look at each element found, taking the date string from its 'value'
# attribute
#
for my $date_tr (@date_tr) {

my $plan_date = $date_tr->attr('value');

# Now look up the tree to the containing element, and find its previous
# sibling which contains the floor plan text in the second child
# element
#
my $plan_tr = $date_tr->look_up(_tag => 'tr')->left;
my @tds = $plan_tr->look_down(_tag => 'td');
die "Unexpected format" unless @tds == 2;

my $plan_text = $tds[1]->as_trimmed_text;

print "Plan found: $plan_text on $plan_date\n";
}

__DATA__

Floor plan:
Ranch #1

Floor plan:
Mission #3

Floor plan:
Big house #9

**OUTPUT**

Working from HTML:

Floor plan:	Ranch #1

Floor plan:	Mission #3

Floor plan:	Big house #9

Plan found: Ranch #1 on 04/01/2004
Plan found: Mission #3 on 08/01/2009
Plan found: Big house #9 on last summer

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/