Word to text translation

Word to text translation

am 11.04.2008 21:56:14 von kenoli

Does anyone know a class or other script for translating the contents
of a MSWord document into a text file with simple formatting, e.g.
paragraph breaks, not totally mangling lists, etc. so it can be stored
in a text field in a mysql database.

The point of this is storing data from documents so that selections
can be cut and pasted into another database where it will be utilized
as text content in a database driven web site.

I realize that one way to do this is to simply link to the actual
MSWord file located in a directory. Putting it into a database field,
however, would be useful as I don't care about the formatting, aside
from keeping it readable. Having it in this form makes it possible to
easily copy and paste stuff from fields in the one database to fields
in the database driving the web site.

Thanks,

--Kenoli

Re: Word to text translation

am 12.04.2008 04:19:14 von Preventer of Work

kenoli wrote:
> Does anyone know a class or other script for translating the contents
> of a MSWord document into a text file with simple formatting, e.g.
> paragraph breaks, not totally mangling lists, etc. so it can be stored
> in a text field in a mysql database.
>
> The point of this is storing data from documents so that selections
> can be cut and pasted into another database where it will be utilized
> as text content in a database driven web site.
>
> I realize that one way to do this is to simply link to the actual
> MSWord file located in a directory. Putting it into a database field,
> however, would be useful as I don't care about the formatting, aside
> from keeping it readable. Having it in this form makes it possible to
> easily copy and paste stuff from fields in the one database to fields
> in the database driving the web site.
>
> Thanks,
>
> --Kenoli

Don't know of anything that does that directly.
You could export them from Word as html files - it is at least text, and
there are parsers for html.

Re: Word to text translation

am 12.04.2008 06:05:41 von kenoli

Have you ever seen the gack that Word puts in its html files? They
are really xml files with all kinds of special definitions. I have
found a web site that will remove it all, one file at a time, which is
useful for cleaning up a file now and then. What I am trying to do is
find something that will let me batch upload files and let a php
script do the work. I have more material than I can handle one file
at a time.

Thanks,

--Kenoli

On Apr 11, 7:19=A0pm, Preventer of Work wrote:
> kenoli wrote:
> > Does anyone know a class or other script for translating the contents
> > of a MSWord document into a text file with simple formatting, e.g.
> > paragraph breaks, not totally mangling lists, etc. so it can be stored
> > in a text field in a mysql database.
>
> > The point of this is storing data from documents so that selections
> > can be cut and pasted into another database where it will be utilized
> > as text content in a database driven web site.
>
> > I realize that one way to do this is to simply link to =A0the actual
> > MSWord file located in a directory. =A0Putting it into a database field,=

> > however, would be useful as I don't care about the formatting, aside
> > from keeping it readable. Having it in this form makes it possible to
> > easily copy and paste stuff from =A0fields in the one database to fields=

> > in the database driving the web site.
>
> > Thanks,
>
> > --Kenoli
>
> Don't know of anything that does that directly.
> You could export them from Word as html files - it is at least text, and
> there are parsers for html.

Re: Word to text translation

am 12.04.2008 06:24:30 von No_One

On 2008-04-12, kenoli wrote:
> Have you ever seen the gack that Word puts in its html files? They
> are really xml files with all kinds of special definitions. I have
> found a web site that will remove it all, one file at a time, which is
> useful for cleaning up a file now and then. What I am trying to do is
> find something that will let me batch upload files and let a php
> script do the work. I have more material than I can handle one file
> at a time.
>
> Thanks,
>
> --Kenoli

First of all, there are no reliable progs to convert Msword to text. The
binary info is based on proprietary formating commands, that can and have
changed from version to version. By and large, they are hit and miss.

You can search google for convertors and try them....but the result will be
less than you might like, antiword comes to mind.

However, if you have access to MSword you can try the following:

check the save as option and see if it has a save as text option, it had
this option at one time. If it does, write a quick down and dirty macro to
save a list of files as text files.

Check the file menu for various exporting options..get it into another
format then convert that format, maybe word => rtl => text or word => strict
html (this option should be there) => text

You might also try looking at hotscripts.com

ken

Re: Word to text translation

am 12.04.2008 06:39:12 von Preventer of Work

kenoli wrote:
> Have you ever seen the gack that Word puts in its html files? They
> are really xml files with all kinds of special definitions. I have
> found a web site that will remove it all, one file at a time, which is
> useful for cleaning up a file now and then. What I am trying to do is
> find something that will let me batch upload files and let a php
> script do the work. I have more material than I can handle one file
> at a time.
>
> Thanks,
>
> --Kenoli
>
> On Apr 11, 7:19 pm, Preventer of Work wrote:
>> kenoli wrote:
>>> Does anyone know a class or other script for translating the contents
>>> of a MSWord document into a text file with simple formatting, e.g.
>>> paragraph breaks, not totally mangling lists, etc. so it can be stored
>>> in a text field in a mysql database.
>>> The point of this is storing data from documents so that selections
>>> can be cut and pasted into another database where it will be utilized
>>> as text content in a database driven web site.
>>> I realize that one way to do this is to simply link to the actual
>>> MSWord file located in a directory. Putting it into a database field,
>>> however, would be useful as I don't care about the formatting, aside
>>> from keeping it readable. Having it in this form makes it possible to
>>> easily copy and paste stuff from fields in the one database to fields
>>> in the database driving the web site.
>>> Thanks,
>>> --Kenoli
>> Don't know of anything that does that directly.
>> You could export them from Word as html files - it is at least text, and
>> there are parsers for html.
>

The MS Visual Studio langauges come with Word APIs. You can search,
extract text, stuff like that (I've not used them, but do know they
exist). You could write a program that pulls out all the text from as
many files as you want at one time.

You can also do that with OpenOffice.org on any platform. You can have a
program tell it to open and import Word files, then pull content out -
same as VS/Word operations.
http://api.openoffice.org/

I know this isn't what you wanted, but maybe someone else will remember
seeing something based on these. Such a tool should be handy to lots of
people,

Re: Word to text translation

am 19.04.2008 22:19:14 von John Hosking

On Fri, 11 Apr 2008 12:56:14 -0700 (PDT), kenoli wrote:

> Does anyone know a class or other script for translating the contents
> of a MSWord document into a text file with simple formatting, e.g.
> paragraph breaks, not totally mangling lists, etc. so it can be stored
> in a text field in a mysql database.

I believe HTML Tidy can do good things with Word docs, although I have
never tried using it for that myself. (I have cleaned Word docs *by hand*,
however, and I can say that software that can do it is something valuable.)

Go to http://www.w3.org/People/Raggett/tidy/#word2000 and nose around a
little. See if it does (or can be made to do) what you need.

--
John