regular expression to extract text
regular expression to extract text
am 25.11.2007 22:48:22 von suzanne.boyle
Hi
I have an html file with headings followed by one or more paragraphs
like this
blah blah 1
more blah blah blah
blah blah 2
more blah blah blah
even more blah blah blah
I'd like to extract the text of the headings and the related
paragraphs and insert them into a database. So far I've managed to
get the heading text but cant figure out how to get the associated
paragraphs. I've been using regular expressions, here is the
expression I have so far (.+?)
(.+?). This gets the text
of the headings but not the paragraphs and now I'm basically stumped.
Any help would be appreciated.
Re: regular expression to extract text
am 25.11.2007 23:00:17 von shimmyshack
On Nov 25, 9:48 pm, suzanne.bo...@gmail.com wrote:
> Hi
>
> I have an html file with headings followed by one or more paragraphs
> like this
>
>
blah blah 1
> more blah blah blah
>
> blah blah 2
> more blah blah blah
> even more blah blah blah
>
> I'd like to extract the text of the headings and the related
> paragraphs and insert them into a database. So far I've managed to
> get the heading text but cant figure out how to get the associated
> paragraphs. I've been using regular expressions, here is the
> expression I have so far (.+?)
(.+?). This gets the text
> of the headings but not the paragraphs and now I'm basically stumped.
>
> Any help would be appreciated.
you could do this another way, although reg exp is a great way.
have you thought that you could use xml to so this.
since you are obviosuly starting with something which is basically
xml, why not just load the string as xml (topping and tailing it if
needed) and then extract using xpath.
Re: regular expression to extract text
am 26.11.2007 04:07:11 von Kailash Nadh
Slightly unorthodox, but this works.
preg_match_all("/((
(.+?)<\/h2>(.+?)
(.+?)<\/p>))/is", $html,
$matches);
print_r($matches);
// array[3] would be headings and array[5] would be the related
paragraph text
?>
Re: regular expression to extract text
am 26.11.2007 17:19:13 von suzanne.boyle
The problem with using xml is that the html is coming from Word so it
contains a lot of unnecessary crap and isn't valid xml. And since I
don't have much experience parsing xml in php I thought it would be
easier to use regular expressions to extract the sections I want.
And I'm almost there now, the expression Kailash wrote almost works
but it only gives the first paragraph after the heading. I just need
to work out how to extract the rest of the paragraphs.
Re: regular expression to extract text
am 26.11.2007 20:48:56 von Toby A Inkster
suzanne.boyle wrote:
> The problem with using xml is that the html is coming from Word so it
> contains a lot of unnecessary crap and isn't valid xml. And since I
> don't have much experience parsing xml in php I thought it would be
> easier to use regular expressions to extract the sections I want.
You could do worse than trying XML_HTMLSax3. I've previously posted an
example of using it to parse HTML:
http://tobyinkster.co.uk/blog/2007/07/20/html-table-parsing/
Note that it does not require documents to be well-formed XML.
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 2 days, 2:18.]
It'll be in the Last Place You Look
http://tobyinkster.co.uk/blog/2007/11/21/no2id/