Parse transcripts on speaker"s name and grab subsequent paragraphs
am 26.01.2008 23:26:26 von Perchance
Here's the sort of text I'm looking at that's driving me nuts.
####
JOE: Hello, Jane.
How are you?
Has it been a good day?
JANE: Hey, Joe.
It's been good for me.
JOE: Great.
####
I'd like to parse the transcripts into an ordered hash that would have
[speaker => name,
statement => concatenation of multiple lines of text spoken by that
person
order => For instance, Joe's first statement is 1, Jane's 2, et
cetera.
]
I've tried stepping through the text file with a foreach $line, or as
a total string, with split()'s and regexes built around /[A-Z]+:/ but
I can't get it line up. I fear the regex is beyond me. Can anyone
help?
Thanks.
Re: Parse transcripts on speaker"s name and grab subsequent paragraphs
am 27.01.2008 01:28:31 von Tad J McClellan
perchance wrote:
> I'd like to parse the transcripts into an ordered hash that would have
There is no such thing as an "ordered hash"...
> [speaker => name,
> statement => concatenation of multiple lines of text spoken by that
> person
> order => For instance, Joe's first statement is 1, Jane's 2, et
> cetera.
> ]
>
> I've tried stepping through the text file with a foreach $line, or as
> a total string, with split()'s and regexes built around /[A-Z]+:/ but
BILLY BOB: But what about matching my name Perchance?
> I can't get it line up. I fear the regex is beyond me.
The regex is of "Hello World" complexity, it must be something
else that is beyond you.
:-)
> Can anyone
> help?
You simply need a better data structure.
If you want ordering, then you want an array.
If you want to save several attributes in each array element,
then you want a hash.
If you want ordering and named attributes, you want a LoH.
(List of Hashes, really an array containing hash references.)
See:
perldoc perlreftut
etc...
--------------------------------
#!/usr/bin/perl
use warnings;
use strict;
my($speaker, $stmt);
my @stmts;
while ( ) {
next if /^\s+$/;
if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
$speaker = $1;
$stmt = $2;
}
else { # more dialog
chomp;
$stmt .= " $_";
}
}
push @stmts, { speaker => $speaker, stmt => $stmt};
foreach ( 0 .. $#stmts ) { # Hash Slice to get attributes out
my($speaker, $stmt) = @{ $stmts[$_] }{ qw/ speaker stmt / };
print "$_: $speaker\n $stmt\n\n";
}
__DATA__
JOE: Hello, Jane.
How are you?
Has it been a good day?
JANE: Hey, Joe.
It's been good for me.
JOE: Great.
--------------------------------
--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"
Re: Parse transcripts on speaker"s name and grab subsequent paragraphs
am 27.01.2008 03:31:11 von Ilya Zakharevich
[A complimentary Cc of this posting was sent to
Tad J McClellan
], who wrote in article :
> my($speaker, $stmt);
> my @stmts;
> while ( ) {
> next if /^\s+$/;
Do not see a switch to a paragraph mode.
>
> if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
> push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
> $speaker = $1;
> $stmt = $2;
> }
> else { # more dialog
> chomp;
> $stmt .= " $_";
Chomp()ing looks suspicious... I would remove NL from each paragraph,
and would separate same-speaker paragraphs by a double-NL (if this is
what the OP wanted).
Hope this helps,
Ilya
Re: Parse transcripts on speaker"s name and grab subsequent paragraphs
am 27.01.2008 13:35:13 von rvtol+news
Tad J McClellan schreef:
> __DATA__
> JOE: Hello, Jane.
>
> How are you?
>
> Has it been a good day?
>
> JANE: Hey, Joe.
>
> It's been good for me.
>
> JOE: Great.
Yesterday I asked
BOB: How are you?
;)
--
Affijn, Ruud
"Gewoon is een tijger."