preg_split problem

preg_split problem

am 28.11.2007 10:44:14 von taps128

Hi, all. I have a poroblem, and I'd really appreciate if someone helped
me to solve it.

The problem is, I want to take the first two sentences of a string. To
do that i need to split them whenever a dot occurs, and the join the
first two array occurences in a new string but I have a problem beacuse
the dot in the Croatian languages is not always used a sentence
delimiter, but is often used in conjuction with numbers and acronyms. So
I wanted to use a regular expression to split a string on every dot
ocurence but not when a dot is precedeed by a number or a 'd' or a 'o'.
This is my best shot at it:

$string='Glavna skupština Društva će se održati 27.12.2007. (četvrtak) u
11 sati u prostorijama Doma hrvatske voj­ske u Lori u Splitu.Atlas
turistička agencija d.d. stekla je 22. i 23. studenog 2007. godine 2800
vlastitih dionica.';
$uvod=preg_split('/((d\.o\.o\.)!|(d\.d\.)!|[0-9]!)|\./', $string);
print_r($uvod);

But it doesn't work right. If someone knows how to slove this problem.
Any help will be really appreciated.

TIA

Nikola

Re: preg_split problem

am 28.11.2007 11:36:39 von taps128

taps128 wrote:
> Hi, all. I have a poroblem, and I'd really appreciate if someone helped
> me to solve it.
>
> The problem is, I want to take the first two sentences of a string. To
> do that i need to split them whenever a dot occurs, and the join the
> first two array occurences in a new string but I have a problem beacuse
> the dot in the Croatian languages is not always used a sentence
> delimiter, but is often used in conjuction with numbers and acronyms. So
> I wanted to use a regular expression to split a string on every dot
> ocurence but not when a dot is precedeed by a number or a 'd' or a 'o'.
> This is my best shot at it:
>
> $string='Glavna skupština Društva će se održati 27.12.2007. (četvrtak) u
> 11 sati u prostorijama Doma hrvatske voj­ske u Lori u Splitu.Atlas
> turistička agencija d.d. stekla je 22. i 23. studenog 2007. godine 2800
> vlastitih dionica.';
> $uvod=preg_split('/((d\.o\.o\.)!|(d\.d\.)!|[0-9]!)|\./', $string);
> print_r($uvod);
>
> But it doesn't work right. If someone knows how to slove this problem.
> Any help will be really appreciated.
>
> TIA
>
> Nikola
Well I've made some progress.

$uvod=preg_split('(\D[^dDoO]\.\s)', $string );

I used this regex,it splits the string ok, but the last two characters
beside the dot are gone from the spllited string.

From 'Lori u Splitu' the last letters 'tu' are gone.

Re: preg_split problem

am 28.11.2007 13:33:39 von luiheidsgoeroe

On Wed, 28 Nov 2007 10:44:14 +0100, taps128 wrote:

> Hi, all. I have a poroblem, and I'd really appreciate if someone helped
> me to solve it.
>
> The problem is, I want to take the first two sentences of a string. To
> do that i need to split them whenever a dot occurs, and the join the
> first two array occurences in a new string but I have a problem beacuse
> the dot in the Croatian languages is not always used a sentence
> delimiter, but is often used in conjuction with numbers and acronyms. So
> I wanted to use a regular expression to split a string on every dot
> ocurence but not when a dot is precedeed by a number or a 'd' or a 'o'.
> This is my best shot at it:
>
> $string='Glavna skupština Društva će se održati 27.12.2007. (četvrtak) u
> 11 sati u prostorijama Doma hrvatske voj­ske u Lori u Splitu.Atlas
> turistička agencija d.d. stekla je 22. i 23. studenog 2007. godine 2800
> vlastitih dionica.';
> $uvod=preg_split('/((d\.o\.o\.)!|(d\.d\.)!|[0-9]!)|\./', $string);
> print_r($uvod);
>
> But it doesn't work right. If someone knows how to slove this problem.
> Any help will be really appreciated.

Well, you main problem here is to decide WHEN a dot is ending a sentence.
Not a very simple task without lists of known acronyms. Also, a dot after
a number can end a sentence:"The coldest winter I remember was in 1985.
The temperature the day my sister was born dropped as low as -21C.". How
do you propose this is handled? Formulating the exact requirements before
writing the regex is more then half the work.

A start for you (by no means complete):
A sentence is ended by a dot:
- followed by either $ (for which case we don't need a regex, it will end
automatically there), or:
- at least one whitespace character (\s) (well, it should be, damn those
kids nowadays), followed by a capital letter (\p{Lu}, use utf-8 mode).

That would be:
preg_split('/\.\s+(?=\p{Lu})/u',$string);
....which doesn't split your string anywhere, as my rules for
'sentence-ending' seem to be inadequeate for your string, or no sentence
is ended.

I think this will require some hefty '(not) pre/proceded by' operators, as
you'd like the matched text to be in the split. Even then a 100% success
rate will most definitly by out of the question.
--
Rik Wasmus

Re: preg_split problem

am 28.11.2007 13:58:18 von taps128

Rik Wasmus wrote:
> On Wed, 28 Nov 2007 10:44:14 +0100, taps128 wrote:
>
>> Hi, all. I have a poroblem, and I'd really appreciate if someone
>> helped me to solve it.
>>
>> The problem is, I want to take the first two sentences of a string. To
>> do that i need to split them whenever a dot occurs, and the join the
>> first two array occurences in a new string but I have a problem
>> beacuse the dot in the Croatian languages is not always used a
>> sentence delimiter, but is often used in conjuction with numbers and
>> acronyms. So I wanted to use a regular expression to split a string on
>> every dot ocurence but not when a dot is precedeed by a number or a
>> 'd' or a 'o'.
>> This is my best shot at it:
>>
>> $string='Glavna skupština Društva će se održati 27.12.2007. (četvrtak)
>> u 11 sati u prostorijama Doma hrvatske voj­ske u Lori u Splitu.Atlas
>> turistička agencija d.d. stekla je 22. i 23. studenog 2007. godine
>> 2800 vlastitih dionica.';
>> $uvod=preg_split('/((d\.o\.o\.)!|(d\.d\.)!|[0-9]!)|\./', $string);
>> print_r($uvod);
>>
>> But it doesn't work right. If someone knows how to slove this problem.
>> Any help will be really appreciated.
>
> Well, you main problem here is to decide WHEN a dot is ending a
> sentence. Not a very simple task without lists of known acronyms. Also,
> a dot after a number can end a sentence:"The coldest winter I remember
> was in 1985. The temperature the day my sister was born dropped as low
> as -21C.". How do you propose this is handled? Formulating the exact
> requirements before writing the regex is more then half the work.
>
> A start for you (by no means complete):
> A sentence is ended by a dot:
> - followed by either $ (for which case we don't need a regex, it will
> end automatically there), or:
> - at least one whitespace character (\s) (well, it should be, damn those
> kids nowadays), followed by a capital letter (\p{Lu}, use utf-8 mode).
>
> That would be:
> preg_split('/\.\s+(?=\p{Lu})/u',$string);
> ...which doesn't split your string anywhere, as my rules for
> 'sentence-ending' seem to be inadequeate for your string, or no sentence
> is ended.
>
> I think this will require some hefty '(not) pre/proceded by' operators,
> as you'd like the matched text to be in the split. Even then a 100%
> success rate will most definitly by out of the question.
> --Rik Wasmus
tnx, that was what i feared.