Help needed for a bit of PERL and REGEX

am 22.07.2006 04:31:18 von Chris Newman

I am working on a script to process a large number of old electoral records.
There are about 100,000 records in all but here is a representative sample

BTW hd =household duties

ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

Note that the first names are in the same sequence as the occupations. An
occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
though other records include up to six family members. In all cases there is
a pattern:

1 person . . . occupation is immediately followed by a line return
(naturally)
2 people . . . first occupation is followed by an '&', last occupation by
line return
3 or more people . . . the first and up to the second last occupation are
followed by commas and the remainder of the line follows the aforementioned
patterns

My initial thoughts
Use a global REGEX that would step though and match the next occupation but
it has not proved that easy. Need a way to move the 'matching point forward
to a ampersand, comma or line return depending on context. If anyone could
provide some insights into whether RE can provide this level of control or
point me to a more appropriate solution.

Here the relevant code snippet:

#preceding code to do with last name, addresses etc This part works well

@matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record

foreach $FirstName (@matches ) {

(m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation

$Occupation =$1; # stores the next matching occupation with each successive
loop

print ("\"$FirstName\",\"$Occupation\");

}

Re: Help needed for a bit of PERL and REGEX

am 23.07.2006 03:16:13 von Petroleum

Chris Newman wrote:
> I am working on a script to process a large number of old electoral records.
> There are about 100,000 records in all but here is a representative sample
>
> BTW hd =household duties
>
>
> ALLISON, Winifred hd
> BRACKENREG, Helen & James hd & lands officer
> MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
>
> Note that the first names are in the same sequence as the occupations. An
> occupation may consist of one or two words eg 'hd' or 'tractor driver'. The
> last of these sample records has 3 'Marshalls' Margaret, Charles and Herbert
> though other records include up to six family members. In all cases there is
> a pattern:
>
> 1 person . . . occupation is immediately followed by a line return
> (naturally)
> 2 people . . . first occupation is followed by an '&', last occupation by
> line return
> 3 or more people . . . the first and up to the second last occupation are
> followed by commas and the remainder of the line follows the aforementioned
> patterns
>
>
> My initial thoughts
> Use a global REGEX that would step though and match the next occupation but
> it has not proved that easy. Need a way to move the 'matching point forward
> to a ampersand, comma or line return depending on context. If anyone could
> provide some insights into whether RE can provide this level of control or
> point me to a more appropriate solution.
>
>
> Here the relevant code snippet:
>
>
> #preceding code to do with last name, addresses etc This part works well
>
> @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
>
> foreach $FirstName (@matches ) {
>
> (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
>
> $Occupation =$1; # stores the next matching occupation with each successive
> loop
>
> print ("\"$FirstName\",\"$Occupation\");
>
> }
>

I'm very sorry to not be of help, but if i could just ask something
simple as i'm very new to perl.

the array of @matches.
would that hold the entire line of matches?

or would it just hold the words/strings matched?

thanks

:)

Re: Help needed for a bit of PERL and REGEX

am 24.07.2006 03:15:48 von mgarrish

Re: Help needed for a bit of PERL and REGEX

am 24.07.2006 03:21:01 von mgarrish

mgarr...@gmail.com wrote:

> Chris Newman wrote:
>
> > I am working on a script to process a large number of old electoral records.
> > There are about 100,000 records in all but here is a representative sample
> >
> > BTW hd =household duties
> >
> >
> > ALLISON, Winifred hd
> > BRACKENREG, Helen & James hd & lands officer
> > MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
> >
>
> The problem I gather you're running into is how to separate the names
> from the occupations, and without an explicit dellimiter your results
> aren't going to be terribly reliable.
>
> The following might give you some ideas:
>
> use strict;
> use warnings;
>
> while (my $line = ) {
>
> my $info;
>
> if ($line =~ /^([^,]+), (.*)/) {
> print "last name: $1\n";
> $info = $2;
> }
>
> else {
> warn "Invalid line: $line";
> next;
> }
>
> # here's where the guesswork begins...
>
> my @parts = split(/ /, $info);
>

Sorry, that array was unnecessary. The code from the comment down can
be compacted to the following:

# here's where the guesswork begins...

if ($info =~ /^(.*?\w) (\w.*)/) {

print "first name(s): $1\n";
print "occupation(s): $2\n";

}

else {
print "couldn't match resident info from: $info\n";
}

}

__DATA__
ALLISON, Winifred hd
BRACKENREG, Helen & James hd & lands officer
MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver

Re: Help needed for a bit of PERL and REGEX

am 28.07.2006 03:15:04 von Greg Jetter

Petroleum wrotesimplifyyris Newman wrote:
>> I am working on a script to processeparateddede number of old electoral
>> records. There are about 100,000 records in all but here is a
>> representative sample
>>
>> BTW hd =household duties
>>
>>
>> ALLISON, Winifred hd
>> BRACKENREG, Helen & James hd & lands officer
>> MARSHALL, Margaret, Charles & Herbert hd, ganger & tractor driver
>>
>> Note that the first names are in the same sequence as the occupations. An
>> occupation may consist of one or two words eg 'hd' or 'tractor driver'.
>> The
>> last of these sample records has 3 'Marshalls' Margaret, Charles and
>> Herbert
>> though other records include up to six family members. In all cases
>> there is a pattern:
>>
>> 1 person . . . occupation is immediately followed by a line return
>> (naturally)
>> 2 people . . . first occupation is followed by an '&', last occupation
>> by line return
>> 3 or more people . . . the first and up to the second last occupation
>> are followed by commas and the remainder of the line follows the
>> aforementioned patterns
>>
>>
>> My initial thoughts
>> Use a global REGEX that would step though and match the next occupation
>> but it has not proved that easy. Need a way to move the 'matching point
>> forward
>> to a ampersand, comma or line return depending on context. If anyone
>> could provide some insights into whether RE can provide this level of
>> control or point me to a more appropriate solution.
>>
>>
>> Here the relevant code snippet:
>>
>>
>> #preceding code to do with last name, addresses etc This part works well
>>
>> @matches = (m/\s([A-Z][a-z]+\s)/g); # holds all first names for a record
>>
>> foreach $FirstName (@matches ) {
>>
>> (m/(\s[a-z]+(.*?)(&|,|$))/g); # fails to match but the first occupation
>>
>> $Occupation =$1; # stores the next matching occupation with each
>> successive loop
>>
>> print ("\"$FirstName\",\"$Occupation\");
>>
>> }
>>
>
>
>
>
> I'm very sorry to not be of help, but if i could just ask something
> simple as i'm very new to perl.
>
> the array of @matches.
> would that hold the entire line of matches?
>
> or would it just hold the words/strings matched?
>
> thanks
>
> :)
Why don't you simplify the problem by extracting each unique record into
a seprate file. That is one file for single record lines , one for
those that have two records and one for more than three, then you can
write a regexp to process each one. each would have only to do one pass .
--
"You are what you is" - Frank Zappa