Parsing table in rtf file

Parsing table in rtf file

am 30.12.2007 05:17:49 von Peter Jamieson

I am trying to extract data from the table in a large number of rtf files.
I tried RTF::Tokenizer and RTF::Parser but could not make progress
so have decided to try regular expressions.

My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.

Any advice or suggestions appreciated!


###########################################
# Perl code to parse table in rtf files #
###########################################

#!/usr/bin/perl -w
use strict;
use warnings;

use Time::Local;
use Win32::ODBC;
# use RTF::Tokenizer; # unsuccessful
# use RTF::Parser; # unsuccessful
use dbi;
use Getopt::Long;

my $ett = localtime();
print "\n Time : $ett \n";

my $file_ = 'BURN_RDX_01.rtf';
my @lines;

open(INFO, $file_) || die("Unable to open file!");
@lines = ;
close(INFO);


# get the useful line data
my $line;
my $useful_data;

foreach $line (@lines) {
if ($line =~ /\\pard\\intbl/) {
$useful_data = "$useful_data.$line \n";
}
}
print "useful_data are: $useful_data \n";


Inspection of the table headings reveals they may vary (sometimes no
telemetry data for a particular range or table has different

ranges) but typical headings are like this:

\pard\intbl {\b\f1\fs24\qc Propellant Burn Times \cell }\pard\intbl
{\f1\fs20\qc 22000m\par 20000m\cell
20000m\par 18000m\cell 18000m\par 16000m\cell 16000m\par 14000m\cell
14000m\par 12000m\cell 12000m\par
10000m\cell 10000m\par 8000m\cell 8000m\par 6000m\cell 6000m\par 4000m\cell
4000m\par 2000m\cell
2000m\par BURN CUT OFF\cell }\pard\intbl {\b\f1\qc 17812\cell }\pard\intbl
{\row }

There may be 6 to 30 data rows in the table, typical row looks like this:

\pard\intbl {\b\f1\fs20\qc 1\cell 40\cell Composition (RDX1)\cell \b0\fs16
\cell \b \cell \cell
1319\cell [90]\cell 1293\cell [90]\cell 1321\cell [90]\cell 1273\cell
[90]\cell 1245\cell [90]\cell
1173\cell [90]\cell 1117\cell [100]\cell 1102\cell [70]\cell 1119\cell
[10]\cell 1218\cell [10]\cell
17817 \cell }\pard\intbl {\row }

Re: Parsing table in rtf file

am 31.12.2007 00:07:48 von skye.shaw

Peter Jamieson wrote:
> I am trying to extract data from the table in a large number of rtf files.
> I tried RTF::Tokenizer and RTF::Parser but could not make progress
> so have decided to try regular expressions.

What problem(s) were you having with the RTF modules?

I know looking at RTF can be fun and all, but why hammer out some
regexes to parse RTF
when a module already exists for this?

> My project is to get the tabular data into a db for further analysis.
> My problem is that I cannot see how to parse the data rows so
> that they match the correct field headings.
>
> Any advice or suggestions appreciated!

Not familiar with the format's tokens, but from looking at it quickly,
it appears as though the type of token is given after the text
portion, so you can try something like:

#your sub class of RTF::Parser
#not tested

my $tables = [];
my $cells = [];
my $rows = [];

my $token;

#define tokens...


sub text {
$token = $_[1];
}


my %do_on_control = (

'__DEFAULT__' => sub {

my ( $self, $type, $arg ) = @_;

if($arg) {
if($arg eq $CELL_END ) {
push @$cells, $tok;
}
elsif($arg eq $ROW_END ) {
push @$rows, $cells;
$cells = []
}
elsif($arg eq $TABLE_END ) {
push @$tables, $rows;
$rows = []
}

}
});

sub parse
{
my ($self,$file) = @_;
$self->control_definition( \%do_on_control );
open(my $IN,$file) || die $!;
$self->parse_stream($IN);
close($IN);

$tables;
}

Re: Parsing table in rtf file

am 31.12.2007 04:04:54 von Peter Jamieson

"Skye Shaw!@#$" wrote in message
news:f51eccde-8c5d-444a-8cb0-bcdefe81c399@i7g2000prf.googleg roups.com...
>
> Peter Jamieson wrote:
>> I am trying to extract data from the table in a large number of rtf
>> files.
>> I tried RTF::Tokenizer and RTF::Parser but could not make progress
>> so have decided to try regular expressions.
>
> What problem(s) were you having with the RTF modules?
>
> I know looking at RTF can be fun and all, but why hammer out some
> regexes to parse RTF
> when a module already exists for this?
>
>> My project is to get the tabular data into a db for further analysis.
>> My problem is that I cannot see how to parse the data rows so
>> that they match the correct field headings.
>>
>> Any advice or suggestions appreciated!
>
> Not familiar with the format's tokens, but from looking at it quickly,
> it appears as though the type of token is given after the text
> portion, so you can try something like:
>
> #your sub class of RTF::Parser
> #not tested
>
> my $tables = [];
> my $cells = [];
> my $rows = [];
>
> my $token;
>
> #define tokens...
>
>
> sub text {
> $token = $_[1];
> }
>
>
> my %do_on_control = (
>
> '__DEFAULT__' => sub {
>
> my ( $self, $type, $arg ) = @_;
>
> if($arg) {
> if($arg eq $CELL_END ) {
> push @$cells, $tok;
> }
> elsif($arg eq $ROW_END ) {
> push @$rows, $cells;
> $cells = []
> }
> elsif($arg eq $TABLE_END ) {
> push @$tables, $rows;
> $rows = []
> }
>
> }
> });
>
> sub parse
> {
> my ($self,$file) = @_;
> $self->control_definition( \%do_on_control );
> open(my $IN,$file) || die $!;
> $self->parse_stream($IN);
> close($IN);
>
> $tables;
> }
>

Thanks for the input Skye!
I read up all I could find on the rtf parsing and tokenizing modules
and came to the conclusion that they were good for text data but
not well suited to tabular data. However I would be more than happy
to be proven wrong!. I can get the header and footer info from the
rtf files OK into a db but could not make progress with the tabular
data. The sticking point was getting the data rows to line up with the
field headings. I had previously used VBA code in MS Excel and MS Word
for this project but file bloat and unreliability has me searching for a
Perl solution.
I will have a close look at your suggestions asap.
Thanks for your help...very much appreciated!...all the best for 2008!
....cheers, Peter