Parse transcripts on speaker"s name and grab subsequent paragraphs

Parse transcripts on speaker"s name and grab subsequent paragraphs

am 26.01.2008 23:26:26 von Perchance

Here's the sort of text I'm looking at that's driving me nuts.

####

JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.

####

I'd like to parse the transcripts into an ordered hash that would have

[speaker => name,
statement => concatenation of multiple lines of text spoken by that
person
order => For instance, Joe's first statement is 1, Jane's 2, et
cetera.
]

I've tried stepping through the text file with a foreach $line, or as
a total string, with split()'s and regexes built around /[A-Z]+:/ but
I can't get it line up. I fear the regex is beyond me. Can anyone
help?

Thanks.

Re: Parse transcripts on speaker"s name and grab subsequent paragraphs

am 27.01.2008 01:28:31 von Tad J McClellan

perchance wrote:


> I'd like to parse the transcripts into an ordered hash that would have


There is no such thing as an "ordered hash"...


> [speaker => name,
> statement => concatenation of multiple lines of text spoken by that
> person
> order => For instance, Joe's first statement is 1, Jane's 2, et
> cetera.
> ]
>
> I've tried stepping through the text file with a foreach $line, or as
> a total string, with split()'s and regexes built around /[A-Z]+:/ but


BILLY BOB: But what about matching my name Perchance?


> I can't get it line up. I fear the regex is beyond me.


The regex is of "Hello World" complexity, it must be something
else that is beyond you.

:-)


> Can anyone
> help?


You simply need a better data structure.

If you want ordering, then you want an array.

If you want to save several attributes in each array element,
then you want a hash.

If you want ordering and named attributes, you want a LoH.

(List of Hashes, really an array containing hash references.)

See:
perldoc perlreftut
etc...

--------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my($speaker, $stmt);
my @stmts;
while ( ) {
next if /^\s+$/;

if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
$speaker = $1;
$stmt = $2;
}
else { # more dialog
chomp;
$stmt .= " $_";
}
}
push @stmts, { speaker => $speaker, stmt => $stmt};

foreach ( 0 .. $#stmts ) { # Hash Slice to get attributes out
my($speaker, $stmt) = @{ $stmts[$_] }{ qw/ speaker stmt / };
print "$_: $speaker\n $stmt\n\n";
}

__DATA__
JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.
--------------------------------



--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: Parse transcripts on speaker"s name and grab subsequent paragraphs

am 27.01.2008 03:31:11 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Tad J McClellan
], who wrote in article :
> my($speaker, $stmt);
> my @stmts;
> while ( ) {
> next if /^\s+$/;

Do not see a switch to a paragraph mode.

>
> if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
> push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
> $speaker = $1;
> $stmt = $2;
> }
> else { # more dialog
> chomp;
> $stmt .= " $_";

Chomp()ing looks suspicious... I would remove NL from each paragraph,
and would separate same-speaker paragraphs by a double-NL (if this is
what the OP wanted).

Hope this helps,
Ilya

Re: Parse transcripts on speaker"s name and grab subsequent paragraphs

am 27.01.2008 13:35:13 von rvtol+news

Tad J McClellan schreef:

> __DATA__
> JOE: Hello, Jane.
>
> How are you?
>
> Has it been a good day?
>
> JANE: Hey, Joe.
>
> It's been good for me.
>
> JOE: Great.

Yesterday I asked

BOB: How are you?

;)

--
Affijn, Ruud

"Gewoon is een tijger."