Help with Regular Expression
am 16.05.2009 00:18:35 von Barry Brevik
I am running Active Perl 5.8.8.
I am converting a large enterprise database into a new system and have
run across a free-form text field in which users have entered all manner
of garbage.
One scenario is where two sentences have been run together with no
ending '.' or space. Here are some examples:
madeStyle
facilitatedOne
Anti-magneticQuality
As you can see, the new sentence begins with an upper-case letter, so if
I can just break apart the construct like this I'll be OK: "madeStyle"
should become "made. Style".
Difficulty: the fields contain hundreds of words both preceding and
following the "bad" words, so I have to be able to pick out the
lower-case words that contain one embedded upper-case character.
Ant ideas?
Barry Brevik
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Re: Help with Regular Expression
am 16.05.2009 01:20:22 von Ari Constancio
On Fri, May 15, 2009 at 11:18 PM, Barry Brevik w=
rote:
> I am running Active Perl 5.8.8.
>
> I am converting a large enterprise database into a new system and have
> run across a free-form text field in which users have entered all manner
> of garbage.
>
> One scenario is where two sentences have been run together with no
> ending '.' or space. Here are some examples:
>
> =A0 =A0madeStyle
> =A0 =A0facilitatedOne
> =A0 =A0Anti-magneticQuality
>
> As you can see, the new sentence begins with an upper-case letter, so if
> I can just break apart the construct like this I'll be OK: =A0"madeStyle"
> should become =A0"made. Style".
>
> Difficulty: the fields contain hundreds of words both preceding and
> following the "bad" words, so I have to be able to pick out the
> lower-case words that contain one embedded upper-case character.
>
> Ant ideas?
>
> Barry Brevik
Hi Barry,
Maybe something like this would help:
$ cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
$ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
made. Style
facilitated. One
Anti-magnetic. Quality
Regards,
Ari Constancio
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
Re: Help with Regular Expression
am 16.05.2009 03:55:01 von Williamawalters
--===============0014407029==
Content-Type: multipart/alternative;
boundary="part1_cd1.4f44e578.373f76f5_boundary"
--part1_cd1.4f44e578.373f76f5_boundary
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
hi ari and barry --
In a message dated 5/15/2009 6:20:40 PM Eastern Standard Time,
ari.constancio@gmail.com writes:
> On Fri, May 15, 2009 at 11:18 PM, Barry Brevik
wrote:
>
> > I am running Active Perl 5.8.8.
> > ...
> > Difficulty: the fields contain hundreds of words both preceding and
> > following the "bad" words, so I have to be able to pick out the
> > lower-case words that contain one embedded upper-case character.
> > ...
> > Barry Brevik
>
> Hi Barry,
>
> Maybe something like this would help:
>
> $ cat test.txt
> madeStyle
> facilitatedOne
> Anti-magneticQuality
>
> $ cat test.txt |perl -pe 's/(\w+)([A-Z])/\1\. \2/g'
> made. Style
> facilitated. One
> Anti-magnetic. Quality
>
> Regards, Ari Constancio
the replacement string in a s/// should use capture variables rather
than backreferences; perl warns about this if warnings are on (always
a good idea). a '.' (period) character in a replacement string is not
a metacharacter and needs no escape.
also, the regex used, /(\w+)([A-Z])/, will allow any number greater than
zero of upper case letters, digits or underscores to precede the uc letter
that is supposed to be the initial letter of a new sentence: probably not
what is intended.
>cat test.txt
madeStyle
facilitatedOne
Anti-magneticQuality
123FOO
>cat test.txt | perl -wMstrict -pe
"s/(\w+)([A-Z])/\1\. \2/g"
\1 better written as $1 at -e line 1.
\2 better written as $2 at -e line 1.
made. Style
facilitated. One
Anti-magnetic. Quality
123FO. O
a better approach might be something like:
>cat test.txt | perl -wMstrict -pe
"s{ ([[:lower:]]) ([[:upper:]] [[:lower:]]) }{$1. $2}xmsg"
made. Style
facilitated. One
Anti-magnetic. Quality
123FOO
hth -- bill walters
**************
Recession-proof vacation ideas. Find free things to do in
the U.S.
(http://travel.aol.com/travel-ideas/domestic/national-touris m-week?ncid=emlcntustrav00000002)