Regular expression example on PHP.net

Regular expression example on PHP.net

am 07.09.2007 08:02:07 von Zenofobe

Howdy folks,

On this page at php.net
http://www.php.net/features.http-auth
there's a regular expression in Example 34.2. It's supposed to parse out
the different values being passed in the header. I know what it's
supposed to do, so I have a vague idea of what's being done in the RE,
but I've been having a heck of a time figuring out what each part of the
RE is actually doing. Here's what I have so far:

preg_match_all('@(\w+)=(?:([\'"])([^\2]+)\2|([^\s,]+))@', $txt, $matches,
PREG_SET_ORDER);

//'@
//(\w+) Any word character (letter/digit/_), 1 or more
//= Equal sign
//(?: This submatch will not be captured (still available for
later matching)
//([\'"]) A single or double quote
//([^\2]+) Not start of text (STX)?, 1 or more
//\2|
//([^\s,]+) Not whitespace or comma, 1 or more
//)
//@'

I'm unclear as to what the second \2 does, as well as which parts the OR
applies to. And what are the @s for?

Thanks for any help,
ZF

--
Posted via a free Usenet account from http://www.teranews.com

Re: Regular expression example on PHP.net

am 07.09.2007 09:56:02 von luiheidsgoeroe

On Fri, 07 Sep 2007 08:02:07 +0200, Zenofobe > =

wrote:

> Howdy folks,
>
> On this page at php.net
> http://www.php.net/features.http-auth
> there's a regular expression in Example 34.2. It's supposed to parse =
out
> the different values being passed in the header. I know what it's
> supposed to do, so I have a vague idea of what's being done in the RE,=

> but I've been having a heck of a time figuring out what each part of t=
he
> RE is actually doing. Here's what I have so far:
>
> preg_match_all('@(\w+)=3D(?:([\'"])([^\2]+)\2|([^\s,]+))@', $txt, $mat=
ches,
> PREG_SET_ORDER);
>
> //'@
> //(\w+) Any word character (letter/digit/_), 1 or more
> //=3D Equal sign
> //(?: This submatch will not be captured (still available for
> later matching)
> //([\'"]) A single or double quote
> //([^\2]+) Not start of text (STX)?, 1 or more
> //\2|
> //([^\s,]+) Not whitespace or comma, 1 or more
> //)
> //@'

Quick tip for starting with regexes: use the x modifier, so you can =

comment this is in the regex itself for later.

preg_match_all('@ #starting delimiter
(\w+) #any word character (one er more) in match 1
=3D #literal '=3D'
(?: #start of non-capturing subpattern
([\'"]) #either \' or " in match 2
([^\2]+) #match one or more characters in match 3 that are NOT=
in =

match 2
\2 #match the same character as matched in 2
| #or
([^\s,]+) #character not whitespace or comma in match 4
) #end of non-capturing subpattern
@ #ending delimiter
x', $txt, $matches,PREG_SET_ORDER);

> I'm unclear as to what the second \2 does,

It's a 'reference' to the match allready captured in match 2

> as well as which parts the OR
> applies to.

The pattern seems to try to capture name/value pairs, where either the =

value is quoted with a ' or ", or consist of "characters not whitespace =
or =

comma". So it will match "foo=3D'bar'" & "foo=3Dbar", but in "foo=3Dbar =
baz" =

still only 'bar' will be matched in 4, not 'bar baz'.

> And what are the @s for?

(Almost) any character can be used as 'delimiter' of the pattern, usuall=
y =

/, but it's @ here. Being able to choose a delimiter for the pattern hel=
ps =

you to avoid having to quote an often matched character that is used as =
a =

delimiter. Any characters following the second delimiter (x in mine) wil=
l =

be considered modifiers to the pattern.
-- =

Rik Wasmus

Re: Regular expression example on PHP.net

am 07.09.2007 11:47:58 von gosha bine

On 07.09.2007 09:56 Rik Wasmus wrote:

> ([^\2]+) #match one or more characters in match 3 that are NOT
> in match 2

[^\2] doesn't mean "negate group 2" as you and the manual people seem to
think. It means "any character except that with ascii code 2".


--
gosha bine

makrell ~ http://www.tagarga.com/blok/makrell
php done right ;) http://code.google.com/p/pihipi

Re: Regular expression example on PHP.net

am 07.09.2007 14:55:58 von luiheidsgoeroe

On Fri, 07 Sep 2007 11:47:58 +0200, gosha bine
wrote:

> On 07.09.2007 09:56 Rik Wasmus wrote:
>
>> ([^\2]+) #match one or more characters in match 3 that are NOT
>> in match 2
>
> [^\2] doesn't mean "negate group 2" as you and the manual people seem to
> think. It means "any character except that with ascii code 2".

Hmmz, a quick check indicates you're right, mea culpa.

The manual iq quite confusing at this point though:
"Inside a character class, or if the decimal number is greater than 9 and
there have not been that many capturing subpatterns, PCRE re-reads up to
three octal digits following the backslash, and generates a single byte
from the least significant 8 bits of the value. Any subsequent digits
stand for themselves. For example:
.....
\7
is always a back reference
\11
might be a back reference, or another way of writing a tab"

According to this, I'd expect it to be a back reference. Which brings me
to the question: what is the way to get a beckreference into a negated
character class, if there is one?

--
Rik Wasmus

Re: Regular expression example on PHP.net

am 07.09.2007 15:07:52 von gosha bine

On 07.09.2007 14:55 Rik Wasmus wrote:
> On Fri, 07 Sep 2007 11:47:58 +0200, gosha bine
> wrote:
>
>> On 07.09.2007 09:56 Rik Wasmus wrote:
>>
>>> ([^\2]+) #match one or more characters in match 3 that are
>>> NOT in match 2
>>
>> [^\2] doesn't mean "negate group 2" as you and the manual people seem
>> to think. It means "any character except that with ascii code 2".
>
> Hmmz, a quick check indicates you're right, mea culpa.
>
> The manual iq quite confusing at this point though:
> "Inside a character class, or if the decimal number is greater than 9
> and there have not been that many capturing subpatterns, PCRE re-reads
> up to three octal digits following the backslash, and generates a single
> byte from the least significant 8 bits of the value. Any subsequent
> digits stand for themselves. For example:
> ....
> \7
> is always a back reference
> \11
> might be a back reference, or another way of writing a tab"

Well, it's clear enough: "Inside a character class..."

>
> According to this, I'd expect it to be a back reference. Which brings me
> to the question: what is the way to get a beckreference into a negated
> character class, if there is one?
>

Character classes are... hm... classes of _characters_, there's no way
to put references (which are _strings_) there.



--
gosha bine

makrell ~ http://www.tagarga.com/blok/makrell
php done right ;) http://code.google.com/p/pihipi

Re: Regular expression example on PHP.net

am 07.09.2007 15:29:38 von luiheidsgoeroe

On Fri, 07 Sep 2007 15:07:52 +0200, gosha bine =

wrote:

> On 07.09.2007 14:55 Rik Wasmus wrote:
>> On Fri, 07 Sep 2007 11:47:58 +0200, gosha bine =
=

>> wrote:
>>
>>> On 07.09.2007 09:56 Rik Wasmus wrote:
>>>
>>>> ([^\2]+) #match one or more characters in match 3 that are=
=

>>>> NOT in match 2
>>>
>>> [^\2] doesn't mean "negate group 2" as you and the manual people see=
m =

>>> to think. It means "any character except that with ascii code 2".
>> Hmmz, a quick check indicates you're right, mea culpa.
>> The manual iq quite confusing at this point though:
>> "Inside a character class, or if the decimal number is greater than 9=
=

>> and there have not been that many capturing subpatterns, PCRE re-read=
s =

>> up to three octal digits following the backslash, and generates a =

>> single byte from the least significant 8 bits of the value. Any =

>> subsequent digits stand for themselves. For example:
>> ....
>> \7
>> is always a back reference
>> \11
>> might be a back reference, or another way of writing a tab"
>
> Well, it's clear enough: "Inside a character class..."

Yes it states "inside a character class IF THE NUMBER IS GREATER THEN 9"=

And continues on saying that inside a character class \7 should still be=
a =

backreference..

>> According to this, I'd expect it to be a back reference. Which bring=
s =

>> me to the question: what is the way to get a beckreference into a =

>> negated character class, if there is one?
>>
>
> Character classes are... hm... classes of _characters_, there's no way=
=

> to put references (which are _strings_) there.

Allthough single characters can, and that's all we're after, we're not =

matching 'a specific string' just 'not any collection of characters', bu=
t =

I see your point. It should be done with something like '/=3D(\'|").*?\1=
/', =

allthough escaped (by \) quoting characters require some more care (alwa=
ys =

'ignore' a single character after '\')
-- =

Rik Wasmus

Re: Regular expression example on PHP.net

am 07.09.2007 16:03:47 von gosha bine

On 07.09.2007 15:29 Rik Wasmus wrote:
> On Fri, 07 Sep 2007 15:07:52 +0200, gosha bine
> wrote:
>
>> On 07.09.2007 14:55 Rik Wasmus wrote:
>>> On Fri, 07 Sep 2007 11:47:58 +0200, gosha bine
>>> wrote:
>>>
>>>> On 07.09.2007 09:56 Rik Wasmus wrote:
>>>>
>>>>> ([^\2]+) #match one or more characters in match 3 that are
>>>>> NOT in match 2
>>>>
>>>> [^\2] doesn't mean "negate group 2" as you and the manual people
>>>> seem to think. It means "any character except that with ascii code 2".
>>> Hmmz, a quick check indicates you're right, mea culpa.
>>> The manual iq quite confusing at this point though:
>>> "Inside a character class, or if the decimal number is greater than 9
>>> and there have not been that many capturing subpatterns, PCRE
>>> re-reads up to three octal digits following the backslash, and
>>> generates a single byte from the least significant 8 bits of the
>>> value. Any subsequent digits stand for themselves. For example:
>>> ....
>>> \7
>>> is always a back reference
>>> \11
>>> might be a back reference, or another way of writing a tab"
>>
>> Well, it's clear enough: "Inside a character class..."
>
> Yes it states "inside a character class IF THE NUMBER IS GREATER THEN 9"
> And continues on saying that inside a character class \7 should still be
> a backreference..

No, just read it again. "Inside a character class OR if the decimal
number etc". Can everybody see OR? ;)))

The wording is unambiguous, but I agree it might be confusing.

>
>>> According to this, I'd expect it to be a back reference. Which
>>> brings me to the question: what is the way to get a beckreference
>>> into a negated character class, if there is one?
>>>
>>
>> Character classes are... hm... classes of _characters_, there's no way
>> to put references (which are _strings_) there.
>
> Allthough single characters can, and that's all we're after, we're not
> matching 'a specific string' just 'not any collection of characters',
> but I see your point. It should be done with something like
> '/=(\'|").*?\1/', allthough escaped (by \) quoting characters require
> some more care (always 'ignore' a single character after '\')

I'd use the class for quotes (\\w+)=([\'"])(.*?)\\1

As for escaping, it's practical not to rely on (fuzzy) php escaping
rules and to double every pcre-specific slash.

--
gosha bine

makrell ~ http://www.tagarga.com/blok/makrell
php done right ;) http://code.google.com/p/pihipi

Re: Regular expression example on PHP.net

am 08.09.2007 03:53:50 von Zenofobe

gosha bine wrote in news:46e11e53$0$31118
$6e1ede2f@read.cnntp.org:

> On 07.09.2007 09:56 Rik Wasmus wrote:
>
>> ([^\2]+) #match one or more characters in match 3 that are NOT
>> in match 2
>
> [^\2] doesn't mean "negate group 2" as you and the manual people seem to
> think. It means "any character except that with ascii code 2".

Thanks for your help, guys.

This seems strange to me though, why bother matching anything but STX? Is
STX actually used by anything anywhere? Googling seems to indicate that
this is something that printers use, not very useful when parsing HTTP.

--
Posted via a free Usenet account from http://www.teranews.com

Re: Regular expression example on PHP.net

am 08.09.2007 03:57:40 von Zenofobe

"Rik Wasmus" wrote in
news:op.tx9xnong5bnjuv@metallium.lan:
> On Fri, 07 Sep 2007 08:02:07 +0200, Zenofobe
>
>
> wrote:
>
>> Howdy folks,
>>
>> On this page at php.net
>> http://www.php.net/features.http-auth
>> there's a regular expression in Example 34.2. It's supposed to parse
> out
>> the different values being passed in the header. I know what it's
>> supposed to do, so I have a vague idea of what's being done in the
>> RE,
>> but I've been having a heck of a time figuring out what each part of
>> the
>> RE is actually doing. Here's what I have so far:
>>
>> preg_match_all('@(\w+)=(?:([\'"])([^\2]+)\2|([^\s,]+))@', $txt, $mat
> ches,
>> PREG_SET_ORDER);
>>
>> //'@
>> //(\w+) Any word character (letter/digit/_), 1 or more
>> //= Equal sign
>> //(?: This submatch will not be captured (still
>> available for later matching)
>> //([\'"]) A single or double quote
>> //([^\2]+) Not start of text (STX)?, 1 or more
>> //\2|
>> //([^\s,]+) Not whitespace or comma, 1 or more
>> //)
>> //@'
>
> Quick tip for starting with regexes: use the x modifier, so you can
> comment this is in the regex itself for later.
>
> preg_match_all('@ #starting delimiter
> (\w+) #any word character (one er more) in match 1
> = #literal '='
> (?: #start of non-capturing subpattern
> ([\'"]) #either \' or " in match 2
> ([^\2]+) #match one or more characters in match 3 that are
> NOT in
> match 2
> \2 #match the same character as matched in 2
> | #or
> ([^\s,]+) #character not whitespace or comma in match 4
> ) #end of non-capturing subpattern
> @ #ending delimiter
> x', $txt, $matches,PREG_SET_ORDER);
>
>> I'm unclear as to what the second \2 does,
>
> It's a 'reference' to the match allready captured in match 2
>
>> as well as which parts the OR
>> applies to.
>
> The pattern seems to try to capture name/value pairs, where either the
> value is quoted with a ' or ", or consist of "characters not
> whitespace or comma". So it will match "foo='bar'" & "foo=bar", but
> in "foo=bar baz" still only 'bar' will be matched in 4, not 'bar
> baz'.

So in other words the OR selects between

([\'"])([^\2]+)\2

and

([^\s,]+).


Correct? In other words, it binds least strongly in comparison with all
the other operators.

--
Posted via a free Usenet account from http://www.teranews.com

Re: Regular expression example on PHP.net

am 08.09.2007 05:41:37 von Jerry Stuckle

Zenofobe wrote:
> gosha bine wrote in news:46e11e53$0$31118
> $6e1ede2f@read.cnntp.org:
>
>> On 07.09.2007 09:56 Rik Wasmus wrote:
>>
>>> ([^\2]+) #match one or more characters in match 3 that are NOT
>>> in match 2
>> [^\2] doesn't mean "negate group 2" as you and the manual people seem to
>> think. It means "any character except that with ascii code 2".
>
> Thanks for your help, guys.
>
> This seems strange to me though, why bother matching anything but STX? Is
> STX actually used by anything anywhere? Googling seems to indicate that
> this is something that printers use, not very useful when parsing HTTP.
>

It's used in the very lowest level of many protocols, but by the time
you see it, the STX and related control characters have been stripped.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================