Help with Regular Expression

Help with Regular Expression

am 16.05.2009 00:18:35 von Barry Brevik

I am running Active Perl 5.8.8.

I am converting a large enterprise database into a new system and have
run across a free-form text field in which users have entered all manner
of garbage.

One scenario is where two sentences have been run together with no
ending '.' or space. Here are some examples:

madeStyle
facilitatedOne
Anti-magneticQuality

As you can see, the new sentence begins with an upper-case letter, so if
I can just break apart the construct like this I'll be OK: "madeStyle"
should become "made. Style".

Difficulty: the fields contain hundreds of words both preceding and
following the "bad" words, so I have to be able to pick out the
lower-case words that contain one embedded upper-case character.

Ant ideas?

Barry Brevik
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Help with Regular Expression

am 16.05.2009 01:20:22 von Ari Constancio

On Fri, May 15, 2009 at 11:18 PM, Barry Brevik w=
rote:
> I am running Active Perl 5.8.8.
>
> I am converting a large enterprise database into a new system and have
> run across a free-form text field in which users have entered all manner
> of garbage.
>
> One scenario is where two sentences have been run together with no
> ending '.' or space. Here are some examples:
>
> =A0 =A0madeStyle
> =A0 =A0facilitatedOne
> =A0 =A0Anti-magneticQuality
>
> As you can see, the new sentence begins with an upper-case letter, so if
> I can just break apart the construct like this I'll be OK: =A0"madeStyle"
> should become =A0"made. Style".
>
> Difficulty: the fields contain hundreds of words both preceding and
> following the "bad" words, so I have to be able to pick out the
> lower-case words that contain one embedded upper-case character.
>
> Ant ideas?
>
> Barry Brevik

Hi Barry,

Maybe something like this would help:

$ cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality

$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
made. Style
facilitated. One
Anti-magnetic. Quality

Regards,
Ari Constancio
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: Help with Regular Expression

am 16.05.2009 03:55:01 von Williamawalters

--===============0014407029==
Content-Type: multipart/alternative;
boundary="part1_cd1.4f44e578.373f76f5_boundary"


--part1_cd1.4f44e578.373f76f5_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi ari and barry --

In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio@gmail.com writes:

> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik
wrote:
>
> > I am running Active Perl 5.8.8.
> > ...
> > Difficulty: the fields contain hundreds of words both preceding and
> > following the "bad" words, so I have to be able to pick out the
> > lower-case words that contain one embedded upper-case character.
> > ...
> > Barry Brevik
>
> Hi Barry,
>
> Maybe something like this would help:
>
> $ cat test.txt
> madeStyle
> facilitatedOne
> Anti-magneticQuality
>
> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> made. Style
> facilitated. One
> Anti-magnetic. Quality
>
> Regards, Ari Constancio

the replacement string in a s/// should use capture variables rather
than backreferences; perl warns about this if warnings are on (always
a good idea). a '.' (period) character in a replacement string is not
a metacharacter and needs no escape.

also, the regex used, /(\w+)([A-Z])/, will allow any number greater than
zero of upper case letters, digits or underscores to precede the uc letter
that is supposed to be the initial letter of a new sentence: probably not
what is intended.

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO

>cat test.txt | perl -wMstrict -pe
"s/(\w+)([A-Z])/\1\. \2/g"
\1 better written as $1 at -e line 1.
\2 better written as $2 at -e line 1.
made. Style
facilitated. One
Anti-magnetic. Quality
123FO. O

a better approach might be something like:

>cat test.txt | perl -wMstrict -pe
"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO

hth -- bill walters


**************
Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)

--part1_cd1.4f44e578.373f76f5_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

hi ari and=
barry --   



In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time, ari.con=
stancio@gmail.com writes:



> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik@stella=
rmicro.com> wrote:

>

> > I am running Active Perl 5.8.8.

> > ...

> > Difficulty: the fields contain hundreds of words both preced=
ing and

> > following the "bad" words, so I have to be able to pick out=
the

> > lower-case words that contain one embedded upper-case charac=
ter.

> > ...

> > Barry Brevik

>

> Hi Barry,

>

> Maybe something like this would help:

>

> $ cat test.txt

> madeStyle

> facilitatedOne

> Anti-magneticQuality

>

> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'

> made. Style

> facilitated. One

> Anti-magnetic. Quality

>

> Regards, Ari Constancio



the replacement string in a  s///  should use capture variab=
les rather

than backreferences; perl warns about this if warnings are on (always=


a good idea).   a '.' (period) character in a replacement st=
ring is not

a metacharacter and needs no escape.   



also, the regex used, /(\w+)([A-Z])/, will allow any number greater th=
an

zero of upper case letters, digits or underscores to precede the uc le=
tter

that is supposed to be the initial letter of a new sentence: probably=
not

what is intended.   



>cat test.txt

madeStyle

facilitatedOne

Anti-magneticQuality

123FOO



>cat test.txt | perl -wMstrict -pe

"s/(\w+)([A-Z])/\1\. \2/g"

\1 better written as $1 at -e line 1.

\2 better written as $2 at -e line 1.

made. Style

facilitated. One

Anti-magnetic. Quality

123FO. O



a better approach might be something like:   



>cat test.txt | perl -wMstrict -pe

"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"

made. Style

facilitated. One

Anti-magnetic. Quality

123FOO



hth -- bill walters   



**************
Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)

--part1_cd1.4f44e578.373f76f5_boundary--

--===============0014407029==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0014407029==--

Re: Help with Regular Expression

am 16.05.2009 12:37:15 von Williamawalters

--===============0317845474==
Content-Type: multipart/alternative;
boundary="part1_ce3.4c6fbb02.373ff15b_boundary"


--part1_ce3.4c6fbb02.373ff15b_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi guys --

In a message dated 5/15/2009 8:55:30 PM Eastern Standard Time,
Williamawalters@aol.com writes:

> In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio@gmail.com writes:
>
> > On Fri, May 15, 2009 at 11:18 PM, Barry Brevik > wrote:
> >
> > > I am running Active Perl 5.8.8.
> > > ...
> > > Difficulty: the fields contain hundreds of words both preceding and
> > > following the "bad" words, so I have to be able to pick out the
> > > lower-case words that contain one embedded upper-case character.
> > > ...
> > > Barry Brevik
> >
> > Hi Barry,
> >
> > Maybe something like this would help:
> >
> > $ cat test.txt
> > madeStyle
> > facilitatedOne
> > Anti-magneticQuality
> >
> > $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> > made. Style
> > facilitated. One
> > Anti-magnetic. Quality
> >
> > Regards, Ari Constancio
>
> ...
>
> a better approach might be something like:
>
> >cat test.txt | perl -wMstrict -pe
> "s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
> made. Style
> facilitated. One
> Anti-magnetic. Quality
> 123FOO
>
> hth -- bill walters

well, english is a complicated thing, as, i guess, are all natural
languages.

it occurred to me that the solution i suggested, that a new sentence begins
with a uc letter and at least one lc letter (which was how i interpreted
the
original 'lower-case words that contain one embedded upper-case character'
spec), fails for a very common word. the approach below makes separate
regex definitions for end-of-sentence and beginning-of-sentence patterns;
these are more easily adapted as requirements mature.

of course, the new approach fails for BiCapitalized words. sigh.
using separate regex definitions might come into play here: one might,
for instance, define a list of bi-capitalized words that would be used with
a look-around to avoid improper substitutions.

(i cannot think of a case in which a proper sentence ends with
anything other than an lc letter before the period. if there is such,
the separate regex approach could, i think, be easily adapted to handle
it.)

>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO
the endA new
PowerPoint

>cat test.txt | perl -wMstrict -pe
"INIT {
my $sen_end = qr{ [[:lower:]] }xms;
my $new_sen = qr{ [[:upper:]] }xms;
sub S { s{ ($sen_end) ($new_sen) }{$1. $2}xmsg }
}
S;
"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO
the end. A new
Power. Point

again, hth -- bill walters


**************
Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)

--part1_ce3.4c6fbb02.373ff15b_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

hi guys --=
  



In a message dated 5/15/2009 8:55:30 PM Eastern Standard Time, William=
awalters@aol.com writes:



> In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time, ar=
i.constancio@gmail.com writes:

>

> > On Fri, May 15, 2009 at 11:18 PM, Barry Brevik <BBrevik@s=
tellarmicro.com> wrote:

> >

> > > I am running Active Perl 5.8.8.

> > > ...

> > > Difficulty: the fields contain hundreds of words both=
preceding and

> > > following the "bad" words, so I have to be able to pick=
out the

> > > lower-case words that contain one embedded upper-case=
character.

> > > ...

> > > Barry Brevik

> >

> > Hi Barry,

> >

> > Maybe something like this would help:

> >

> > $ cat test.txt

> > madeStyle

> > facilitatedOne

> > Anti-magneticQuality

> >

> > $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'

> > made. Style

> > facilitated. One

> > Anti-magnetic. Quality

> >

> > Regards, Ari Constancio

>

> ...

>

> a better approach might be something like:    

>

> >cat test.txt | perl -wMstrict -pe

> "s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"

> made. Style

> facilitated. One

> Anti-magnetic. Quality

> 123FOO

>

> hth -- bill walters    



well, english is a complicated thing, as, i guess, are all natural lan=
guages.   



it occurred to me that the solution i suggested, that a new sentence=
begins

with a uc letter and at least one lc letter (which was how i interpret=
ed the

original 'lower-case words that contain one embedded upper-case charac=
ter'

spec), fails for a very common word.   the approach below ma=
kes separate

regex definitions for end-of-sentence and beginning-of-sentence patter=
ns;

these are more easily adapted as requirements mature.   



of course, the new approach fails for BiCapitalized words.   =
;sigh.   

using separate regex definitions might come into play here: one might,=


for instance, define a list of bi-capitalized words that would be used=
with

a look-around to avoid improper substitutions.   



(i cannot think of a case in which a proper sentence ends with

anything other than an lc letter before the period.   if the=
re is such,

the separate regex approach could, i think, be easily adapted to handl=
e

it.)   



>cat test.txt

madeStyle

facilitatedOne

Anti-magneticQuality

123FOO

the endA new

PowerPoint



>cat test.txt | perl -wMstrict -pe

"INIT {

  my $sen_end =3D qr{ [[:lower:]] }xms;

  my $new_sen =3D qr{ [[:upper:]] }xms;

  sub S { s{ ($sen_end) ($new_sen) }{$1. $2}xmsg }

  }

S;

"

made. Style

facilitated. One

Anti-magnetic. Quality

123FOO

the end. A new

Power. Point



again, hth -- bill walters   



**************
Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)

--part1_ce3.4c6fbb02.373ff15b_boundary--

--===============0317845474==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0317845474==--

RE: Help with Regular Expression

am 16.05.2009 14:55:40 von Curtis Leach

This is a multi-part message in MIME format.

--===============1378440988==
Content-class: urn:content-classes:message
Content-Type: multipart/alternative;
boundary="----_=_NextPart_001_01C9D625.9E9268EC"

This is a multi-part message in MIME format.

------_=_NextPart_001_01C9D625.9E9268EC
Content-Type: text/plain;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable

Here's something a bit simpler based on the original example Barry sent.
Basically looks for a single upper case letter with a single non-upper
case, non-white space char before it. \w doesn't do that, we also don't
need to use the "+" modifier since all we care about is matching a
single char. (Better performance if not searching for a variable length
string.)
=20
perl -we 'my =
$t=3D"madeStyle\nfacilitatedOne\nAnti-magneticQuality\n123FO O
BAR";
$t=3D~s/([^A-Z\s])([A-Z])/$1. $2/g;
print "----------\n$t\n";'
----------
made. Style
facilitated. One
Anti-magnetic. Quality
123. FOO BAR
=20

Curtis


________________________________

From: activeperl-bounces@listserv.ActiveState.com
[mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of
Williamawalters@aol.com
Sent: Friday, May 15, 2009 8:55 PM
To: ari.constancio@gmail.com
Cc: activeperl@listserv.activestate.com
Subject: Re: Help with Regular Expression


hi ari and barry -- =20

In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio@gmail.com writes:=20

> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik
wrote:=20
>=20
> > I am running Active Perl 5.8.8.=20
> > ...=20
> > Difficulty: the fields contain hundreds of words both preceding and=20
> > following the "bad" words, so I have to be able to pick out the=20
> > lower-case words that contain one embedded upper-case character.=20
> > ...=20
> > Barry Brevik=20
>=20
> Hi Barry,=20
>=20
> Maybe something like this would help:=20
>=20
> $ cat test.txt=20
> madeStyle=20
> facilitatedOne=20
> Anti-magneticQuality=20
>=20
> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'=20
> made. Style=20
> facilitated. One=20
> Anti-magnetic. Quality=20
>=20
> Regards, Ari Constancio=20

the replacement string in a s/// should use capture variables rather=20
than backreferences; perl warns about this if warnings are on (always=20
a good idea). a '.' (period) character in a replacement string is not=20
a metacharacter and needs no escape. =20

also, the regex used, /(\w+)([A-Z])/, will allow any number greater than

zero of upper case letters, digits or underscores to precede the uc
letter=20
that is supposed to be the initial letter of a new sentence: probably
not=20
what is intended. =20

>cat test.txt=20
madeStyle=20
facilitatedOne=20
Anti-magneticQuality=20
123FOO=20

>cat test.txt | perl -wMstrict -pe=20
"s/(\w+)([A-Z])/\1\. \2/g"=20
\1 better written as $1 at -e line 1.=20
\2 better written as $2 at -e line 1.=20
made. Style=20
facilitated. One=20
Anti-magnetic. Quality=20
123FO. O=20

a better approach might be something like: =20

>cat test.txt | perl -wMstrict -pe=20
"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"=20
made. Style=20
facilitated. One=20
Anti-magnetic. Quality=20
123FOO=20

hth -- bill walters =20


**************
Recession-proof vacation ideas. Find free things to do in the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=3D=

emlcntustrav00000002)=20

------_=_NextPart_001_01C9D625.9E9268EC
Content-Type: text/html;
charset=us-ascii
Content-Transfer-Encoding: quoted-printable



charset=3Dus-ascii">


face=3DArial=20
color=3D#0000ff size=3D2>Here's something a bit simpler based on the =
original=20
example Barry sent.  Basically looks for a single upper case letter =
with a=20
single non-upper case, non-white space char before it.  \w doesn't =
do that,=20
we also don't need to use the "+" modifier since all we care about is =
matching a=20
single char.  (Better performance if not searching for a variable =
length=20
string.)

 


size=3D2>perl=20
-we 'my $t=3D"madeStyle\nfacilitatedOne\nAnti-magneticQuality\n123FO O=20
BAR";


size=3D2>          &nbs=
p;=20
 $t=3D~s/([^A-Z\s])([A-Z])/$1. $2/g;


size=3D2>          &nbs=
p;=20
 print "----------\n$t\n";'


size=3D2>----------

size=3D2>made.=20
Style
facilitated. One
Anti-magnetic. Quality
123. FOO=20
BAR

class=3D987141812-16052009> 

class=3D987141812-16052009> face=3DArial>C class=3D987141812-16052009>urtis





From:=20
activeperl-bounces@listserv.ActiveState.com=20
[mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of=20
Williamawalters@aol.com
Sent: Friday, May 15, 2009 8:55=20
PM
To: ari.constancio@gmail.com
Cc:=20
activeperl@listserv.activestate.com
Subject: Re: Help with =
Regular=20
Expression


hi =
ari and barry=20
--   

In a message dated 5/15/2009 6:20:40 PM Eastern =
Standard=20
Time, ari.constancio@gmail.com writes:

> On Fri, May 15, 2009 =
at=20
11:18 PM, Barry Brevik <BBrevik@stellarmicro.com> wrote:
> =

>=20
> I am running Active Perl 5.8.8.
> > ...
> > =
Difficulty:=20
the fields contain hundreds of words both preceding and
> > =
following=20
the "bad" words, so I have to be able to pick out the
> > =
lower-case=20
words that contain one embedded upper-case character.
> > ... =

>=20
> Barry Brevik
>
> Hi Barry,
>
> Maybe =
something=20
like this would help:
>
> $ cat test.txt
> =
madeStyle=20

> facilitatedOne
> Anti-magneticQuality
>
> =
$ cat=20
test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> made. Style =

>=20
facilitated. One
> Anti-magnetic. Quality
>
> =
Regards, Ari=20
Constancio

the replacement string in a  s///  should =
use=20
capture variables rather
than backreferences; perl warns about this =
if=20
warnings are on (always
a good idea).   a '.' (period) =
character=20
in a replacement string is not
a metacharacter and needs no escape.=20
  

also, the regex used, /(\w+)([A-Z])/, will allow =
any number=20
greater than
zero of upper case letters, digits or underscores to =
precede=20
the uc letter
that is supposed to be the initial letter of a new =
sentence:=20
probably not
what is intended.   

>cat test.txt =


madeStyle
facilitatedOne
Anti-magneticQuality
123FOO=20


>cat test.txt | perl -wMstrict -pe
"s/(\w+)([A-Z])/\1\. =
\2/g"=20

\1 better written as $1 at -e line 1.
\2 better written as $2 at =
-e line=20
1.
made. Style
facilitated. One
Anti-magnetic. Quality =

123FO. O=20


a better approach might be something like:    =


>cat=20
test.txt | perl -wMstrict -pe
"s{ ([[:lower:]]) ([[:upper:]] =
[[:lower:]])=20
}{$1. $2}xmsg"
made. Style
facilitated. One
Anti-magnetic. =
Quality=20

123FOO

hth -- bill walters   =20



**************
Recession-proof vacation ideas. =
Find free=20
things to do in the U.S.=20
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=3D=
emlcntustrav00000002)=20


------_=_NextPart_001_01C9D625.9E9268EC--


--===============1378440988==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1378440988==--

Re: Help with Regular Expression

am 16.05.2009 16:53:11 von Williamawalters

--===============1652949719==
Content-Type: multipart/alternative;
boundary="part1_d0f.4a58efa0.37402d57_boundary"


--part1_d0f.4a58efa0.37402d57_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit

hi curtis --

In a message dated 5/16/2009 7:56:26 AM Eastern Standard Time,
cleach@harrahs.com writes:

> 123FOO BAR
> ...
> ----------
> ...
> 123. FOO BAR

but i was thinking that 123FOO was *not* something
that would need punctuation: it's probably not the end of
one sentence and the beginning of the next.

br -- bill walters


**************
Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)

--part1_d0f.4a58efa0.37402d57_boundary
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable

hi curtis=
--   



In a message dated 5/16/2009 7:56:26 AM Eastern Standard Time, cleach@=
harrahs.com writes:



> 123FOO BAR

> ...

> ----------

> ...

> 123. FOO BAR



but i was thinking that  123FOO  was *not* something

that would need punctuation: it's probably not the end of

one sentence and the beginning of the next.   



br -- bill walters   



**************
Recession-proof vacation ideas. Find=
free things to do in the U.S. (http://travel.aol.com/travel-ideas/domesti=
c/national-tourism-week?ncid=3Demlcntustrav00000002)

--part1_d0f.4a58efa0.37402d57_boundary--

--===============1652949719==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1652949719==--

Re: Help with Regular Expression

am 18.05.2009 19:09:41 von Andy_Bach

This is a multipart message in MIME format.
--===============0498470677==
Content-Type: multipart/alternative;
boundary="=_alternative 005DF262862575BA_="

This is a multipart message in MIME format.
--=_alternative 005DF262862575BA_=
Content-Type: text/plain; charset="US-ASCII"

$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
made. Style
facilitated. One
Anti-magnetic. Quality


RE pedanticism: \1 et alia are only supposed to be used on the LHS of the
subst cmd. You'd want:
cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'

or no need for cat (ye olde pipeline debate ;-):
perl -pe 's/(\w+)([A-Z])/$1. $2/g' test.txt


You're supposed to use the \1 format to match a current match, like a
duplicated word
$ echo "her here hear hear hop hip hip ho!" | perl -pe
's/(\w+)\s+\1\s+/double "${1}s" /g;'

her here double "hears" hop double "hips" ho!

Might you need to worry about 2 capital letters?
perl -pe 's/([a-z])([A-Z])/$1. $2/g' test.txt

Non-ascii text (ranges like 'a-z' are only true ranges in ascii)? Use
POSIX class shorthand names (if your Perl is new enough):
perl -pe 's/([[:lower:]])([[:upper:]])/$1. $2/g' test.txt

a
a
----------------------
Andy Bach
Systems Mangler
Internet: andy_bach@wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Civilization advances by the number of important operations
which we can perform without thinking about them.
--Alfred North Whitehead
--=_alternative 005DF262862575BA_=
Content-Type: text/html; charset="US-ASCII"

$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'

made. Style

facilitated. One

Anti-magnetic. Quality





RE pedanticism: \1 et alia are only
supposed to be used on the LHS of the subst cmd. You'd want:


cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'



or no need for cat (ye olde pipeline
debate ;-):


perl -pe 's/(\w+)([A-Z])/$1. $2/g' test.txt





You're supposed to use the \1 format to match a current
match, like a duplicated word


$ echo "her here hear hear hop hip hip ho!"
| perl -pe 's/(\w+)\s+\1\s+/double "${1}s" /g;'




her here double "hears" hop double "hips"
ho!




Might you need to worry about 2 capital letters?

perl -pe 's/([a-z])([A-Z])/$1. $2/g' test.txt



Non-ascii text (ranges like 'a-z' are only true ranges
in ascii)? Use POSIX class shorthand names (if your Perl is new enough):


perl -pe 's/([[:lower:]])([[:upper:]])/$1. $2/g' test.txt



a

a

----------------------

Andy Bach

Systems Mangler

Internet: andy_bach@wiwb.uscourts.gov

Voice: (608) 261-5738;

Cell: (608) 658-1890



Civilization advances by the number of important operations

which we can perform without thinking about them.

--Alfred North Whitehead

--=_alternative 005DF262862575BA_=--

--===============0498470677==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0498470677==--

RE: Help with Regular Expression

am 19.05.2009 19:17:45 von Andy_Bach

This is a multipart message in MIME format.
--===============0642896885==
Content-Type: multipart/alternative;
boundary="=_alternative 005EAED3862575BB_="

This is a multipart message in MIME format.
--=_alternative 005EAED3862575BB_=
Content-Type: text/plain; charset="US-ASCII"

>> You're supposed to use the \1 format to match a current match,
>> like a duplicated word
>> $ echo "her here hear hear hop hip hip ho!" | perl \
-pe 's/(\w+)\s+\1\s+/double "${1}s" /g;'

Barry B wrote:
> I am confused about this. I thought that a back-reference looks like
"$1", not "\1". Is there a difference?

Yeah, mostly w/ placement. Back refs on the left hand side (LHS) of the
subst:
s/(\w+)\s+\1\s+/

are backslash digit. Backreferences to the captured match on the RHS use
$1 as they do outside the subst command. This got made more concrete
somewhere in early v5, I believe. As noted, warnings will tell you:
\1 better written as $1 at -e line 1

if you had tried:
$ echo "her here hear hear hop hip hip ho!" | perl \
-w -pe 's/(\w+)\s+\1\s+/double "\1s" /g;'


though it still works. But the idea is the \1 version can be used during
the course of the matching phase, but $1 version is used during the
replacement phase. In a sense, the \1 'magic var' is supposed to be
localized to the LHS:
s/ ... /

context, while $1 et alia are actual globals so you can do:
> -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a $1\n"; }'

and have a value outside the subst command. Trying "\1" in the warn():
warn "found a \1\n";

would get you ... well you get the "001" char ;->

$ echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a $1\n" };'
found a hear
her here double "hears" hop hip hip ho

but (note, I dropped the "/g"):
$ echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found a \1\n" };'
found a
her here double "hears" hop hip hip ho

Interesting, in a way, is how with the '/g' you get:
echo "her here hear hear hop hip hip ho" | perl -pe 'if (
s/(\w+)\s+\1\s+/double "${1}s" /g ) { warn "found a $1\n" };'
found a h
her here double "hears" hop double "hips" ho

I think what happens here is the capture parens matched the 'h' of the
final 'ho' but there's no match for the \1 part. So no subst is done.
However, $1 keeps the captured value (it did match a \w+ char). Not
exactly what I expected, to be honest - I'd've thought if the LHS RE
failed, $1 wouldn't be 'updated' but would keep the last full match (i.e.
'hip').

Wrong again ...

a
----------------------
Andy Bach
Systems Mangler
Internet: andy_bach@wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Civilization advances by the number of important operations
which we can perform without thinking about them.
--Alfred North Whitehead
--=_alternative 005EAED3862575BB_=
Content-Type: text/html; charset="US-ASCII"

>> You're
supposed to use the \1 format to match a current match,
 

>>
like a duplicated word
 

>>
 $
echo "her here hear hear hop hip hip ho!" | perl
 
\


  -pe 's/(\w+)\s+\1\s+/double "${1}s"
/g;'
 

 

Barry B wrote:

> I am confused about this. I thought
that a back-reference looks like "$1", not "\1". Is
there a difference?


 

Yeah, mostly w/ placement. Back refs
on the left hand side (LHS) of the subst:


s/(\w+)\s+\1\s+/



are backslash digit. Backreferences
to the captured match on the RHS use $1 as they do outside the subst command.
 This got made more concrete somewhere in early v5, I believe. As
noted, warnings will tell you:


\1 better written as $1 at -e line 1



if you had tried:

$ echo "her here hear hear hop hip hip ho!"
| perl
  \

  -w  -pe 's/(\w+)\s+\1\s+/double "\1s"
/g;'
 





though it still works.  But the
idea is the \1 version can be used during the course of the matching phase,
but $1 version is used during the replacement phase. In a sense, the \1
'magic var' is supposed to be localized to the LHS:


s/ ... /



context, while $1 et alia are actual
globals so you can do:


> 
-pe 'if ( s/(\w+)\s+\1\s+/double "${1}s" / ) { warn "found
a $1\n"; }'
 



and have a value outside the subst command.
Trying "\1" in the warn():


warn "found a \1\n";



would get you ... well you get the "001"
char ;->




$  echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/ ) { warn "found a $1\n" };'


found a hear

her here double "hears" hop
hip hip ho




but (note, I dropped the "/g"):

$  echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/ ) { warn "found a \1\n" };'


found a <unprintable>

her here double "hears" hop
hip hip ho




Interesting, in a way, is how with the
'/g' you get:


 echo "her here hear hear
hop hip hip ho" | perl -pe 'if ( s/(\w+)\s+\1\s+/double "${1}s"
/g ) { warn "found a $1\n" };'


found a h

her here double "hears" hop
double "hips" ho




I think what happens here is the capture
parens matched the 'h' of the final 'ho' but there's no match for the \1
part. So no subst is done. However, $1 keeps the captured value (it did
match a \w+ char). Not exactly what I expected, to be honest  - I'd've
thought if the LHS RE failed, $1 wouldn't be 'updated' but would keep the
last full match (i.e. 'hip').




Wrong again ...



a

----------------------

Andy Bach

Systems Mangler

Internet: andy_bach@wiwb.uscourts.gov

Voice: (608) 261-5738;

Cell: (608) 658-1890



Civilization advances by the number of important operations

which we can perform without thinking about them.

--Alfred North Whitehead

--=_alternative 005EAED3862575BB_=--

--===============0642896885==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0642896885==--

RE: Help with Regular Expression

am 19.05.2009 19:49:12 von Andy_Bach

This is a multipart message in MIME format.
--===============1003059762==
Content-Type: multipart/alternative;
boundary="=_alternative 00618FA3862575BB_="

This is a multipart message in MIME format.
--=_alternative 00618FA3862575BB_=
Content-Type: text/plain; charset="US-ASCII"

Sorry, I really don't do the concept justice - a more comprehensive answer
is to say - if you want help w/ REs, get "Mastering Regular Expressions"
by J. Friedl
http://oreilly.com/catalog/9780596528126/

It's it a great, great book, up there w/ the Camel and "Perl Best
Practices". It covers more than just Perl REs too. Even if you just go to
the book store and read the appropriate parts. But buy it and read it -
you'll solve lots of your RE problems.

Wait until Perl6 RE/regexs get here ;->

http://search.cpan.org/~dconway/Perl6-Rules-0.03/Rules.pm
http://www.ibm.com/developerworks/linux/library/l-cpregex.ht ml?ca=dgr-lnxw01Perl6Gram

I know there are better links, but I don't have them at the moment.

a

----------------------
Andy Bach
Systems Mangler
Internet: andy_bach@wiwb.uscourts.gov
Voice: (608) 261-5738;
Cell: (608) 658-1890

Some people, when confronted with a problem, think
"I know, I'll use regular expressions."
Now they have two problems
-- Jamie Zawinski
--=_alternative 00618FA3862575BB_=
Content-Type: text/html; charset="US-ASCII"

Sorry, I really don't do the concept justice
- a more comprehensive answer is to say - if you want help w/ REs, get
"Mastering Regular Expressions" by J. Friedl






It's it a great, great book, up there
w/ the Camel and "Perl Best Practices". It covers more than just
Perl REs too.  Even if you just go to the book store and read the
appropriate parts. But buy it and read it - you'll solve lots of your RE
problems.




Wait until Perl6 RE/regexs get here
;->










I know there are better links, but I
don't have them at the moment.




a



----------------------

Andy Bach

Systems Mangler

Internet: andy_bach@wiwb.uscourts.gov

Voice: (608) 261-5738;

Cell: (608) 658-1890




Some people, when confronted with a problem, think


 "I know, I'll use regular
expressions."


Now they have two problems  

-- Jamie Zawinski
--=_alternative 00618FA3862575BB_=--

--===============1003059762==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============1003059762==--