Regex to get the <html></html>

am 02.08.2007 17:48:24 von FFMG

Hi,

I want to get the code and a 'simple?' solution seems to be
be...

preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
$matches, PREG_SET_ORDER);

but I want to make sure that there isn't a better solution to the
problem especially if the head contains invalid code like...

//--

//--

unfortunately this is my html code so I cannot ignore invalid
like the one above.

So...
How can I change my regex to ignore head tags inside double or single
quotes?
How can I look for multiple line code inside the head.

Any suggestions?

FFMG

--

'webmaster forum' (http://www.httppoint.com) | 'Free Blogs'
(http://www.journalhome.com/) | 'webmaster Directory'
(http://www.webhostshunter.com/)
'Recreation Vehicle insurance'
(http://www.insurance-owl.com/other/car_rec.php) | 'Free URL
redirection service' (http://urlkick.com/)
------------------------------------------------------------ ------------
FFMG's Profile: http://www.httppoint.com/member.php?userid=580
View this thread: http://www.httppoint.com/showthread.php?t=19012

Message Posted via the webmaster forum http://www.httppoint.com, (Ad revenue sharing).

Re: Regex to get the <html></html>

am 02.08.2007 18:33:00 von luiheidsgoeroe

On Thu, 02 Aug 2007 17:48:24 +0200, FFMG m> =

wrote:
> I want to get the code and a 'simple?' solution seems to be
> be...
>
> preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> $matches, PREG_SET_ORDER);

Euhm, nope. you start on an undefined tag (lose the blockquotes around =

'[html]'), and you;re matching the html tag, not the head tag.

> but I want to make sure that there isn't a better solution to the
> problem especially if the head contains invalid code like...
>
> //--
>
> " />
>
> //--

DOM functions?

> How can I change my regex to ignore head tags inside double or single
> quotes?

Could be done by setting a greedy match starting on a quote untill the =

endquote. Then again, if you're concerned with invalid attributes, you'd=
=

have to allow for the possibility the quotes are erronous too, i.e. =

someone forgot to open or close them.

I've taken a stab at it with regexes in the past, which works quite well=
=

as long as you can be sure it's stricly valid HTML. If it isn't, or you'=
re =

using outside sources where this isn't known, don't use regular =

expressions for something a parser ought to be doing.
-- =

Rik Wasmus

Re: Regex to get the <html></html>

am 02.08.2007 19:36:32 von gosha bine

FFMG wrote:
> Hi,
>
> I want to get the code and a 'simple?' solution seems to be
> be...
>
> preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> $matches, PREG_SET_ORDER);
>
> but I want to make sure that there isn't a better solution to the
> problem especially if the head contains invalid code like...
>
> //--
>
>
>
> //--
>
> unfortunately this is my html code so I cannot ignore invalid
> like the one above.
>
> So...
> How can I change my regex to ignore head tags inside double or single
> quotes?

I'd suggest

$re = << ~
<\w+ \b
(?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
>
|
| [^<]+
| <
~six
HTML;

This should be able to parse most html or html-alike streams, even
hopeless mailformed.

This is how it works with your example:

///

$html = 'text

more text';

preg_match_all($re, $html, $m);
print_r($m[0]);

///

output:

Array
(
[0] => text
[1] =>
[2] =>

[3] =>
[4] =>

[5] =>
[6] => more text
)

--
gosha bine

extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok

Re: Regex to get the <html></html>

am 02.08.2007 19:39:12 von FFMG

Rik;84869 Wrote:
> On Thu, 02 Aug 2007 17:48:24 +0200, FFMG
>
> wrote:
> > I want to get the code and a 'simple?' solution seems to be
> > be...
> >
> > preg_match_all("/<[html]+[^>]*>\s*(.*\s*)<\/html>\s*/i", $html,
> > $matches, PREG_SET_ORDER);
>
> Euhm, nope. you start on an undefined tag (lose the blockquotes around
>
> '[html]'), and you;re matching the html tag, not the head tag.
>

Of course, thanks. Must have been a typo.

Rik;84869 Wrote:
>
>
> DOM functions?
>
> > How can I change my regex to ignore head tags inside double or
> single
> > quotes?
>
> Could be done by setting a greedy match starting on a quote untill the
>
> endquote. Then again, if you're concerned with invalid attributes,
> you'd
> have to allow for the possibility the quotes are erronous too, i.e.
> someone forgot to open or close them.
>
> I've taken a stab at it with regexes in the past, which works quite
> well
> as long as you can be sure it's stricly valid HTML. If it isn't, or
> you're
> using outside sources where this isn't known, don't use regular
> expressions for something a parser ought to be doing.
> --
> Rik Wasmus

Thanks, are you suggesting that I walk the text, first look for the
open tag, then look for the close tag that is not within a quote?

I guess a simple function could do that.

Would you know of such function or would I need to write one :)?

FFMG

--

'webmaster forum' (http://www.httppoint.com) | 'Free Blogs'
(http://www.journalhome.com/) | 'webmaster Directory'
(http://www.webhostshunter.com/)
'Recreation Vehicle insurance'
(http://www.insurance-owl.com/other/car_rec.php) | 'Free URL
redirection service' (http://urlkick.com/)
------------------------------------------------------------ ------------
FFMG's Profile: http://www.httppoint.com/member.php?userid=580
View this thread: http://www.httppoint.com/showthread.php?t=19012

Message Posted via the webmaster forum http://www.httppoint.com, (Ad revenue sharing).

Re: Regex to get the <html></html>

am 02.08.2007 19:42:56 von gosha bine

Rik wrote:
> DOM functions?

Yes, or a SAX parser like pear HTML_SAX. DOM is still too picky about
invalid html.

> I've taken a stab at it with regexes in the past, which works quite well
> as long as you can be sure it's stricly valid HTML. If it isn't, or
> you're using outside sources where this isn't known, don't use regular
> expressions for something a parser ought to be doing.

Yes, regexps suck as a parser, however in most cases you need just a
lexer and that's the job regexps do quite well.

--
gosha bine

extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok

Re: Regex to get the <html></html>

am 02.08.2007 21:21:07 von FFMG

gosha bine;84884 Wrote:
>
> I'd suggest
>
> $re = << > ~
> <\w+ \b
> (?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
> >
> |
> | [^<]+
> | <
> ~six
> HTML;
>
>

This looks great, and works for all cases.

I am just curious, why does...
// --

";

// --
Return
// --
0:
1:
2:
3:
4:
5:

What are the blank/empty lines matching?

FFMG

--

'webmaster forum' (http://www.httppoint.com) | 'Free Blogs'
(http://www.journalhome.com/) | 'webmaster Directory'
(http://www.webhostshunter.com/)
'Recreation Vehicle insurance'
(http://www.insurance-owl.com/other/car_rec.php) | 'Free URL
redirection service' (http://urlkick.com/)
------------------------------------------------------------ ------------
FFMG's Profile: http://www.httppoint.com/member.php?userid=580
View this thread: http://www.httppoint.com/showthread.php?t=19012

Message Posted via the webmaster forum http://www.httppoint.com, (Ad revenue sharing).

Re: Regex to get the <html></html>

am 03.08.2007 01:15:54 von gosha bine

FFMG wrote:
> gosha bine;84884 Wrote:
>> I'd suggest
>>
>> $re = << >> ~
>> <\w+ \b
>> (?: " [^"]* " | ' [^']* ' | [^"'>]+ )*
>> |
>> | [^<]+
>> | <
>> ~six
>> HTML;
>>
>>
>
> This looks great, and works for all cases.
>
> I am just curious, why does...
> // --
>
>
> ";
>
> // --
> Return
> // --
> 0:
> 1:
> 2:
> 3:
> 4:
> 5:
>
> What are the blank/empty lines matching?
>
> FFMG
>
>

2 and 4 match newlines that are there after head and meta tags.

--
gosha bine

extended php parser ~ http://code.google.com/p/pihipi
blok ~ http://www.tagarga.com/blok

Re: Regex to get the <html></html>

am 03.08.2007 12:20:02 von Toby A Inkster

FFMG wrote:

> I want to get the code and a 'simple?' solution seems to be
> be...

There is no simple solution. In HTML, the start and end tags for the
element are *optional* -- in other words, the following valid
document is considered to have a head element containing one TITLE and
one META element:

"http://www.w3.org/TR/html4/strict.dtd">
Foobar

Foobar

Foo bar baz.

Your regular expression will not find the element, which *is*
there, even if you can't explicitly see the beginning and end!

Best to use PHP's DOM stuff, as Rik mentioned.

--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.12-12mdksmp, up 43 days, 13:49.]

Command Line Interfaces, Again
http://tobyinkster.co.uk/blog/2007/08/02/command-line-again/