Extracting body from HTML document?

am 14.11.2007 22:23:54 von Andre-John Mas

Hi,

I am wanting to be able to get a section of a HTML document, by
specifying an XPath. For example:

$title= GetSection ( '/html/head/title');
$body= GetSection ( '/html/body');

I made a simple parser myself some time back, but it is failing with
certain types of documents. Instead of maintaining the code, I would
reather find an existing solution, so that I can concentrate my
development efforts elswhere. Does anyone have anything they can
recommend?

Andre

Re: Extracting body from HTML document?

am 14.11.2007 22:47:27 von Andre-John Mas

On Nov 14, 4:23 pm, Andre-John Mas wrote:
> Hi,
>
> I am wanting to be able to get a section of a HTML document, by
> specifying an XPath. For example:
>
> $title= GetSection ( '/html/head/title');
> $body= GetSection ( '/html/body');
>
> I made a simple parser myself some time back, but it is failing with
> certain types of documents. Instead of maintaining the code, I would
> reather find an existing solution, so that I can concentrate my
> development efforts elswhere. Does anyone have anything they can
> recommend?
>
> Andre

My current implementation is very basic. The main issue I am having is
that if there are any attributes associated with the start element,
then nothing is returned. While I can eventually solve this, I would
rather use a robust API, since there are certainly other issues I
might run into.

function GetElementByName ($xml, $start, $end) {
$startpos = strpos($xml, $start);
if ($startpos === false) {
return false;
}
$endpos = strpos($xml, $end);
$endpos = $endpos+strlen($end);
$endpos = $endpos-$startpos;
$endpos = $endpos - strlen($end);
$tag = substr ($xml, $startpos, $endpos);
$tag = substr ($tag, strlen($start));

return $tag;
}

function XPathValue($XPath,$XML) {
$XPathArray = explode("/",$XPath);

$node = $XML;
while (list($key,$value) = each($XPathArray)) {
$node = GetElementByName($node, "<$value>", "");
}

return $node;
}

Re: Extracting body from HTML document?

am 14.11.2007 22:54:13 von luiheidsgoeroe

On Wed, 14 Nov 2007 22:23:54 +0100, Andre-John Mas =

wrote:

> Hi,
>
> I am wanting to be able to get a section of a HTML document, by
> specifying an XPath. For example:
>
> $title=3D GetSection ( '/html/head/title');
> $body=3D GetSection ( '/html/body');
>
> I made a simple parser myself some time back, but it is failing with
> certain types of documents. Instead of maintaining the code, I would
> reather find an existing solution, so that I can concentrate my
> development efforts elswhere. Does anyone have anything they can
> recommend?

http://www.php.net/dom

$doc =3D new DOMDocument();
$doc->loadHTMLFile('test.html');

//just by tagname:
$title=3D $doc->getElementsByTagName('title')->item(0);

//or XPATH
$xpath =3D new DOMXPath($doc);
$tables =3D $xpath->query('//table');
?>
-- =

Rik Wasmus

Re: Extracting body from HTML document?

am 14.11.2007 23:46:43 von Andre-John Mas

On Nov 14, 4:54 pm, "Rik Wasmus" wrote:
>
> http://www.php.net/dom
>
> > $doc = new DOMDocument();
> $doc->loadHTMLFile('test.html');
>
> //just by tagname:
> $title= $doc->getElementsByTagName('title')->item(0);
>
> //or XPATH
> $xpath = new DOMXPath($doc);
> $tables = $xpath->query('//table');
> ?>
> --
> Rik Wasmus

Thanks for the answer, though I am not sure how go from here to having
a sub-section of the HTML text. Basically what I am wanting to do is
extract the body section of an HTML document, to be able to insert it
into another.

Andre

Re: Extracting body from HTML document?

am 15.11.2007 05:26:31 von jebblue

On Wed, 14 Nov 2007 14:46:43 -0800, Andre-John Mas wrote:

>
> Thanks for the answer, though I am not sure how go from here to having a
> sub-section of the HTML text. Basically what I am wanting to do is
> extract the body section of an HTML document, to be able to insert it
> into another.
>
> Andre

$doc = new DOMDocument();
$doc->loadHTMLFile('http://www.some_site_goes_here_or_some_f ile.nnn');

//just by tagname:
$body = $doc->getElementsByTagName('body')->item(0);
print nodeDump($body);

//or XPATH
//$xpath = new DOMXPath($doc);
//$tables = $xpath->query('//table');
//print $tables

// courtesy:
// "Dennis Shearin"
//04-Jul-2007 04:17
//http://php.benscom.com/manual/fr/function.dom-domelement-c onstruct.php
function nodeDump($node)
{
$output = print_r($node, TRUE);
$output = str_replace(")\n", '', $output);
$output .= ' ' . '[tagName] => ' . $node->tagName . " \n";

$numOfAttribs = $node->attributes->length;
for ($i = 0; $i < $numOfAttribs; $i++)
{
$output .= ' [' . $node->attributes->item($i)->nodeName . ']
=> ' . $node->attributes->item($i)->nodeValue . " \n";
}

$output .= ' [nodeValue] => ' . $node->nodeValue;
$output .= ')';
return $output;
}
?>

--
// This is my opinion.

Re: Extracting body from HTML document?

am 15.11.2007 16:47:32 von Andre-John Mas

Thanks for the help :)

On Nov 14, 11:26 pm, jebblue wrote:
> On Wed, 14 Nov 2007 14:46:43 -0800, Andre-John Mas wrote:
>
> > Thanks for the answer, though I am not sure how go from here to having a
> > sub-section of the HTML text. Basically what I am wanting to do is
> > extract the body section of an HTML document, to be able to insert it
> > into another.
>
> > Andre
>
> > $doc = new DOMDocument();
> $doc->loadHTMLFile('http://www.some_site_goes_here_or_some_f ile.nnn');
>
> //just by tagname:
> $body = $doc->getElementsByTagName('body')->item(0);
> print nodeDump($body);
>
> //or XPATH
> //$xpath = new DOMXPath($doc);
> //$tables = $xpath->query('//table');
> //print $tables
>
> // courtesy:
> // "Dennis Shearin"
> //04-Jul-2007 04:17
> //http://php.benscom.com/manual/fr/function.dom-domelement-c onstruct.php
> function nodeDump($node)
> {
> $output = print_r($node, TRUE);
> $output = str_replace(")\n", '', $output);
> $output .= ' ' . '[tagName] => ' . $node->tagName . " \n";
>
> $numOfAttribs = $node->attributes->length;
> for ($i = 0; $i < $numOfAttribs; $i++)
> {
> $output .= ' [' . $node->attributes->item($i)->nodeName . ']
> => ' . $node->attributes->item($i)->nodeValue . " \n";
> }
>
> $output .= ' [nodeValue] => ' . $node->nodeValue;
> $output .= ')';
> return $output;}
>
> ?>
>
> --
> // This is my opinion.