Reading contents of an excel file from a test file

am 15.05.2007 08:49:12 von Karan Arora

Hi

I am writing a script to read various file types (doc, xls, pdf, html
etc.) and search for certain keywords. Without caring for the file
formats, I used 'findstr' system call from perl with the keyword for
each file and directed the output to a text file. The results were not
as bad as I had expected :)
In the txt file created, I have the list of all the files and the
original text from the file where the string to be searched occurs but
this file has some unicode characters which prevents it to be read and
processed properly. :(
I basically get a lot of "" in the result test file which makes perl
act wierd.

Can u please suggest a way to read a text file which has unicode
characters??
I do NOT want to create seperate parsers for the different file types
(things like ParseExcel) as it will increase the complexity and will
need a lot of effort.

Cheeeeers!!

KRN!!?!

Re: Reading contents of an excel file from a test file

am 15.05.2007 10:16:34 von Karan Arora

sorry guys.. but the unicode characters as they appear as rectangles
in my text file (all appear as the same), are not printed when posting
a message on this forum!!

On May 15, 11:49 am, Mick wrote:
> Hi
>
> I am writing a script to read various file types (doc, xls, pdf, html
> etc.) and search for certain keywords. Without caring for the file
> formats, I used 'findstr' system call from perl with the keyword for
> each file and directed the output to a text file. The results were not
> as bad as I had expected :)
> In the txt file created, I have the list of all the files and the
> original text from the file where the string to be searched occurs but
> this file has some unicode characters which prevents it to be read and
> processed properly. :(
> I basically get a lot of " " in the result test file which makes perl
> act wierd.
>
> Can u please suggest a way to read a text file which has unicode
> characters??
> I do NOT want to create seperate parsers for the different file types
> (things like ParseExcel) as it will increase the complexity and will
> need a lot of effort.
>
> Cheeeeers!!
>
> K R N!!?!

Re: Reading contents of an excel file from a test file

am 15.05.2007 11:36:11 von Ian Wilson

Top-posting corrected, Please don't top-post.

Mick wrote:
>
> On May 15, 11:49 am, Mick wrote:
>
>> I am writing a script to read various file types (doc, xls, pdf,
>> html etc.) and search for certain keywords. Without caring for the
>> file formats, I used 'findstr' system call from perl with the
>> keyword for each file and directed the output to a text file. The
>> results were not as bad as I had expected :) In the txt file
>> created, I have the list of all the files and the original text
>> from the file where the string to be searched occurs but this file
>> has some unicode characters which prevents it to be read and
>> processed properly. :( I basically get a lot of " " in the result
>> test file which makes perl act wierd.
>>
>> Can u please suggest a way to read a text file which has unicode
>> characters??

I'm pretty sure that current versions of Perl are happy to process Unicode
perldoc perlunicode

If I wanted to ignore characters that are outside the ASCII printable
set then I'd investigate Perl's 'tr'. `perldoc perlop` suggests
tr/a-zA-Z/ /cs; # change non-alphas to single space

>> I do NOT want to create seperate parsers for the
>> different file types (things like ParseExcel) as it will increase
>> the complexity and will need a lot of effort.
>>

I suspect there's no guarantee that arbitrary file types will store your
keywords in a recognisable form. A file might store "KEYWORD" as
"KExxxxxYWxxxxxOxxxxRD" for example. I'd guess this is particularly
likely in PDF, especially if it is kerning text. Some might use UTF8
encoding others might use UTF16 or some non-unicode encoding. Some might
compress or encode the text so it no longer appears in ASCII.

>
> sorry guys.. but the unicode characters as they appear as rectangles
> in my text file (all appear as the same), are not printed when
> posting a message on this forum!!
>

You are using Google Groups and it seems to think your character set is
Latin1 not Unicode. Your posting has this header:
Content-Type: text/plain; charset="iso-8859-1"

Possibly you are viewing your "text file" in an application that is not
Unicode aware or is not using a font that has glyphs for the particular
Unicode characters in the file.