html to ascii conversion: quick google translate from the command line

html to ascii conversion: quick google translate from the command line

am 16.04.2008 06:29:27 von Andre Steinert

Is there a way to convert a html snippet "sensibly" to ascii plain-text.
I just want to display a no-frills version of this google translate
query quickly from the command-line:

curl -s
'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'

"cat" could be replaced by "dog" "beer" whatever and lo and behold I've
a German translation on the command line (I wish!). This snippet throws
a load of html at me. Is there a easy way to convert it to a
"displayable" format? Basically just column-formatting or at most using
bold etc. that my xterm-color console can support. html has all this
info. embedded in its tags, right? So looks possible in theory; just
wondering what's the best tool for the job.

I have no intention of browsing further from that page so lynx seems an
overkill.



--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 07:03:20 von Barry Margolin

In article ,
Rahul wrote:

> Is there a way to convert a html snippet "sensibly" to ascii plain-text.
> I just want to display a no-frills version of this google translate
> query quickly from the command-line:
>
> curl -s
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
>
> "cat" could be replaced by "dog" "beer" whatever and lo and behold I've
> a German translation on the command line (I wish!). This snippet throws
> a load of html at me. Is there a easy way to convert it to a
> "displayable" format? Basically just column-formatting or at most using
> bold etc. that my xterm-color console can support. html has all this
> info. embedded in its tags, right? So looks possible in theory; just
> wondering what's the best tool for the job.
>
> I have no intention of browsing further from that page so lynx seems an
> overkill.

How about the -dump option to lynx? It just displays the result,
without going into an interactive browser.

--
Barry Margolin, barmar@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***
*** PLEASE don't copy me on replies, I'll read them in the group ***

Re: html to ascii conversion: quick google translate from thecommand line

am 16.04.2008 10:08:45 von Stephane CHAZELAS

2008-04-16, 04:29(+00), Rahul:
> Is there a way to convert a html snippet "sensibly" to ascii plain-text.
> I just want to display a no-frills version of this google translate
> query quickly from the command-line:
>
> curl -s
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
[...]

See elinks or w3m. In the old ages, you would have used lynx,
but it's quite bad on tables and frames.

Compare:

elinks -no-references -no-numbering -dump \
'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'

w3m -dump \
'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'

lynx -dump -nolist \
'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'

--
Stéphane

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 18:43:34 von Andre Steinert

Stephane CHAZELAS wrote in
news:slrng0bd0d.8cn.stephane.chazelas@spam.is.invalid:

> See elinks or w3m. In the old ages, you would have used lynx,
> but it's quite bad on tables and frames.
>
> Compare:
>
> elinks -no-references -no-numbering -dump \
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> Cde'
>
> w3m -dump \
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> Cde'
>
> lynx -dump -nolist \
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> Cde'
>

I like these options much better. Thanks Stephane! I only have to solve
some font issues now. Seem to be a problem with all three.

dünne Eisschicht --> dünne Eisschicht
Kätzin --> Kätzin
Hühner -->Hühner

Seems like something to do with umlaut rendering in my font set.....Any
ideas?

--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 20:12:43 von Allodoxaphobia

On Wed, 16 Apr 2008 10:08:45 +0200 (CEST), Stephane CHAZELAS wrote:
> 2008-04-16, 04:29(+00), Rahul:
>> Is there a way to convert a html snippet "sensibly" to ascii plain-text.
>> I just want to display a no-frills version of this google translate
>> query quickly from the command-line:
>>
>> curl -s
>> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
> [...]
>
> See elinks or w3m. In the old ages, you would have used lynx,
> but it's quite bad on tables and frames.
>
> Compare:
>
> lynx -dump -nolist \
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'

lynx won't work unless you spoof the useragent.

$ lynx -dump -nolist 'http://google.com/'
Google
Error
Bad Request

Your client has issued a malformed or illegal request.
Please see Google's Terms of Service posted at
http://www.google.com/terms_of_service.html

......

They have no compunction about crawling all over your web site, indexing
all your images, and enabling email and usenet spam. But, gawd forbid
that you might try to use a text-only browser to visit their website(s).

Jonesy
--
Marvin L Jones | jonz | W3DHJ | linux
38.24N 104.55W | @ config.com | Jonesy | OS/2
*** Killfiling google posts:

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 20:23:13 von Allodoxaphobia

On 16 Apr 2008 18:12:43 GMT, Allodoxaphobia wrote:
> On Wed, 16 Apr 2008 10:08:45 +0200 (CEST), Stephane CHAZELAS wrote:
>> 2008-04-16, 04:29(+00), Rahul:
>>> Is there a way to convert a html snippet "sensibly" to ascii plain-text.
>>> I just want to display a no-frills version of this google translate
>>> query quickly from the command-line:
>>>
>>> curl -s
>>> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
>> [...]
>>
>> See elinks or w3m. In the old ages, you would have used lynx,
>> but it's quite bad on tables and frames.
>>
>> Compare:
>>
>> lynx -dump -nolist \
>> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
>
> lynx won't work unless you spoof the useragent.
>
> $ lynx -dump -nolist 'http://google.com/'
> Google
> Error
> Bad Request
>
> Your client has issued a malformed or illegal request.
> Please see Google's Terms of Service posted at
> http://www.google.com/terms_of_service.html
>
> ......

mea culpa.

I now see this is b0rk3d on only one of the machines here. And,
Murphy's Law required that it be _my_ workstation. sigh...

OK, now to slink off and find out what the problem is on this box.

Jonesy
--
Marvin L Jones | jonz | W3DHJ | linux
38.24N 104.55W | @ config.com | Jonesy | OS/2
*** Killfiling google posts:

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 20:42:33 von Jan Kandziora

Rahul schrieb:
>
> curl -s
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7Cde'
>
> "cat" could be replaced by "dog" "beer" whatever and lo and behold I've
> a German translation on the command line (I wish!).
>
BTW if you are just looking for an english<->german dictionary and
phrasebook, there are far better ones than google's. Try

http://dict.leo.org/ende?lang=en&search=cat

Kind regards

Jan

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 20:44:31 von Jan Kandziora

Rahul schrieb:
> dünne Eisschicht --> dünne Eisschicht
> Kätzin --> Kätzin
> Hühner -->Hühner
>
> Seems like something to do with umlaut rendering in my font set.....Any
> ideas?
>
LANG=de_DE.utf8 lynx... or similar could help.

Kind regards

Jan

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 20:51:39 von Andre Steinert

Allodoxaphobia wrote in
news:slrng0cgcr.2kbf.bit-bucket@shell.config.com:

>
> lynx won't work unless you spoof the useragent.
> [snip]
> They have no compunction about crawling all over your web site,
> indexing all your images, and enabling email and usenet spam. But,
> gawd forbid that you might try to use a text-only browser to visit
> their website(s).

Funny! They all worked for me. Google didn't crib.



--
Rahul

Re: html to ascii conversion: quick google translate from thecommand line

am 16.04.2008 21:16:47 von Dave Uhring

On Wed, 16 Apr 2008 16:43:34 +0000, Rahul wrote:

> I like these options much better. Thanks Stephane! I only have to solve
> some font issues now. Seem to be a problem with all three.
>
> dünne Eisschicht --> dÃŒnne Eisschicht Kätzin --> KÀtzin
> HÃŒhner -->HÃŒhner
>
> Seems like something to do with umlaut rendering in my font set.....Any
> ideas?

View the output in xterm or similar. You should see dünne, but maybe not
in that windows POS you are using. There is no need to set any special
locale.

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 22:07:52 von ebenZEROONE

In article ,
Rahul wrote:
> Stephane CHAZELAS wrote in
> news:slrng0bd0d.8cn.stephane.chazelas@spam.is.invalid:
>
> > See elinks or w3m. In the old ages, you would have used lynx,
> > but it's quite bad on tables and frames.
> >
> > Compare:
> >
> > elinks -no-references -no-numbering -dump \
> > 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> > Cde'
> >
> > w3m -dump \
> > 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> > Cde'
> >
> > lynx -dump -nolist \
> > 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%7
> > Cde'
> >
>
> I like these options much better. Thanks Stephane! I only have to solve
> some font issues now. Seem to be a problem with all three.
>
> dünne Eisschicht --> dünne Eisschicht
> Kätzin --> Kätzin
> Hühner -->Hühner
>
> Seems like something to do with umlaut rendering in my font set.....Any
> ideas?

Maybe it's assuming UTF8 and your screen is assuming ISO9660-1? Or
something like that. The brute force method would to run the page
through "sed -e 's/ü/ü/g' -e 's/ä/ä/g'" etc. Is the last one meant
to be "Höhner"? Are there capital versions of these letters? How about
things like "Æ"?

--
-eben QebWenE01R@vTerYizUonI.nOetP royalty.mine.nu:81
Your pretended fear lest error might step in is like the man who
would keep all wine out of the country lest men should be drunk.
-- Oliver Cromwell

Re: html to ascii conversion: quick google translate from the command line

am 16.04.2008 23:53:44 von Andre Steinert

ebenZEROONE@verizon.net (Hactar) wrote in
news:180id5-309.ln1@royalty.mine.nu:

> Maybe it's assuming UTF8 and your screen is assuming ISO9660-1?

echo $LANG
en_US.UTF-8
echo $TERM
xterm-color

Is that helpful?

> Or
> something like that. The brute force method would to run the page
> through "sed -e 's/ü/ü/g' -e 's/ä/ä/g'" etc. Is the last one meant
> to be "Höhner"?

Almost....It supposed to be a Huehner: "u umlaut".

I tried to paste the right words in my Xnews client when I posted. Looked
correct here on my screen but from the reply-quoted-snippets they seem
messed up. So you guys probably couldn't see the correct versions. Sorry.
Again seems an encoding issue! I guess talking between two languages
seems harder than it seems! The exact correct render is online at google:

http://translate.google.com/translate_dict?q=cat&hl=en&langp air=en%7Cde'

Are there capital versions of these letters? How
> about things like "Æ"?
>
The letters it prints are really strange. Some are ones I've never seen
before. Like a small stylzsed A. Like the Angstroms symbol. Plus more.


--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 01:07:53 von unknown

Post removed (X-No-Archive: yes)

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 01:32:31 von Andre Steinert

ebenZEROONE@verizon.net (Hactar) wrote in
news:4maid5-02j.ln1@royalty.mine.nu:

>
> Try setting LANG to en.US or C.

Thanks eben! Partial success.

Tried both en.US and C: lynx and w3m still are not cured.

But elinks now does not spit out the strange characters anymore:

It uses the "compromise" sort of a "poor-man's-umlaut": u-umlaut = ue o-
umlaut=oe etc. (common convention).

Could live with that; unless someone has any other ideas to get my console to
print "real" umlauts! :)


--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 06:07:52 von ebenZEROONE

In article ,
Rahul wrote:
> ebenZEROONE@verizon.net (Hactar) wrote in
> news:4maid5-02j.ln1@royalty.mine.nu:
>
> >
> > Try setting LANG to en.US or C.
>
> Thanks eben! Partial success.
>
> Tried both en.US and C: lynx and w3m still are not cured.
>
> But elinks now does not spit out the strange characters anymore:
>
> It uses the "compromise" sort of a "poor-man's-umlaut": u-umlaut = ue o-
> umlaut=oe etc. (common convention).
>
> Could live with that; unless someone has any other ideas to get my console to
> print "real" umlauts! :)

LANG=de.DE?

--
-eben QebWenE01R@vTerYizUonI.nOetP royalty.mine.nu:81
Your pretended fear lest error might step in is like the man who
would keep all wine out of the country lest men should be drunk.
-- Oliver Cromwell

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 18:16:11 von Andre Steinert

ebenZEROONE@verizon.net (Hactar) wrote in
news:49sid5-kj5.ln1@royalty.mine.nu:

>
> LANG=de.DE?
>

Thanks eben! Almost... LANG = de_DE seems to be the option that works for
me. I don't know why; but only that one works! Now all my umlauts are
perfect. Thanks for all those helpful leads guys.

Although, there's a problem with apostrope marks still. Was with EN and
also persists now. All my apostrophe's seem to be rendered as '*'by elinks.

elinks -no-references -no-numbering -dump
'http://translate.google.com/translate_dict?q=dog&hl=en&lang pair=en%7Cde'

the dog doesn't bite ==> the dog doesn*t bite etc.

This is a funny one. Tried looking at the curl op for the raw html.

curl -s 'http://translate.google.com/translate_dict?
q=dog&hl=en&langpair=en%7Cde'

the dog doesn’t bite


On my screen again I see an space (probably unprintable character). But
here I copy-paste it into Xnews and my apostrophe is again visible!

Sorry guys, I seem to have a really messed up terminal!



--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 23:07:52 von ebenZEROONE

In article ,
Rahul wrote:
> ebenZEROONE@verizon.net (Hactar) wrote in
> news:49sid5-kj5.ln1@royalty.mine.nu:
>
> > LANG=de.DE?
>
> Thanks eben! Almost... LANG = de_DE seems to be the option that works for
> me. I don't know why; but only that one works! Now all my umlauts are
> perfect. Thanks for all those helpful leads guys.
>
> Although, there's a problem with apostrope marks still. Was with EN and
> also persists now. All my apostrophe's seem to be rendered as '*'by elinks.
>
> elinks -no-references -no-numbering -dump
> 'http://translate.google.com/translate_dict?q=dog&hl=en&lang pair=en%7Cde'
>
> the dog doesn't bite ==> the dog doesn*t bite etc.
>
> This is a funny one. Tried looking at the curl op for the raw html.
>
> curl -s 'http://translate.google.com/translate_dict?
> q=dog&hl=en&langpair=en%7Cde'
>
> the dog doesn’t bite


That's not a quote (0x27), it's an 0x92, some sort of "smart quote", I
presume. Your browser may render it as a quote, but that's not relevant
here. "pr" may fix that, "tr" or "sed" definitely will. "curl" may have
a relevant option.

Just for kicks, "od" wouldn't show that:

0003460 64 6f 65 73 6e 27 74 20 62 69 74 65 20 3d 3d 3e >doesn't bite ==><
^^
so I had to use less:

the dog doesn<92>t bite

^^
Anyone know a more reliable method?

--
-eben QebWenE01R@vTerYizUonI.nOetP royalty.mine.nu:81
Your pretended fear lest error might step in is like the man who
would keep all wine out of the country lest men should be drunk.
-- Oliver Cromwell

Re: html to ascii conversion: quick google translate from the command line

am 17.04.2008 23:18:18 von Bill Marcum

["Followup-To:" header set to comp.unix.shell.]
On 2008-04-17, Hactar wrote:
>
>
> That's not a quote (0x27), it's an 0x92, some sort of "smart quote", I
> presume. Your browser may render it as a quote, but that's not relevant
> here. "pr" may fix that, "tr" or "sed" definitely will. "curl" may have
> a relevant option.
>
> Just for kicks, "od" wouldn't show that:
>
> 0003460 64 6f 65 73 6e 27 74 20 62 69 74 65 20 3d 3d 3e >doesn't bite ==><
> ^^
> so I had to use less:
>
> the dog doesn<92>t bite

> ^^
> Anyone know a more reliable method?
>
It's strange that od and less show different characters. Were you using
the contents of a file or a pipe? Try "LANG=C od" or pipe the text
through "recode cp1252..iso-8859-15" or "recode cp1252..utf-8"

Re: html to ascii conversion: quick google translate from the command line

am 18.04.2008 02:50:10 von Andre Steinert

ebenZEROONE@verizon.net (Hactar) wrote in
news:2hokd5-vbt.ln1@royalty.mine.nu:

> That's not a quote (0x27), it's an 0x92, some sort of "smart quote", I
> presume. Your browser may render it as a quote, but that's not
> relevant here. "pr" may fix that, "tr" or "sed" definitely will.
> "curl" may have a relevant option.

THanks Hactar and Bill! I could tr / sed it on the raw curl op. But the
best way out for me seems to be either lynx or elinks. Here's the alias I
hacked so far in my tcshell:

alias germanfor "setenv LANG de_DE; elinks -no-references -no-numbering -
dump 'http://translate.google.com/translate_dict?q=\!*&hl=en&lang pair=en%
7Cde' ; setenv LANG en_US.UTF-8"

[or similarly with lynx]

What I have not been able to figure out is how to set up lynx or elinks
to do this translation from 0x92 to 0x27 before they display it.

I prefer lynx a shade better since I had an idea of calling lynx -color -
dump .....Too ambitious it seems (or I am doing something stupid). Lynx
seems to not like it when I call it with -color plus the -dump option.
(it does display color highlighting really nicely if I just open a site
in lynx: lynx www.google.com etc.)


--
Rahul

Re: html to ascii conversion: quick google translate from the command line

am 18.04.2008 07:07:52 von unknown

Post removed (X-No-Archive: yes)

Re: html to ascii conversion: quick google translate from thecommand line

am 19.04.2008 18:56:39 von Enrique Perez-Terron

On Wed, 16 Apr 2008 04:29:27 +0000, Rahul wrote:

> Is there a way to convert a html snippet "sensibly" to ascii plain-text.
> I just want to display a no-frills version of this google translate
> query quickly from the command-line:
>
> curl -s
> 'http://translate.google.com/translate_dict?q=cat&hl=en&lang pair=en%
7Cde'
>
> "cat" could be replaced by "dog" "beer" whatever and lo and behold I've
> a German translation on the command line (I wish!). This snippet throws
> a load of html at me. Is there a easy way to convert it to a
> "displayable" format? Basically just column-formatting or at most using
> bold etc. that my xterm-color console can support. html has all this
> info. embedded in its tags, right? So looks possible in theory; just
> wondering what's the best tool for the job.
>
> I have no intention of browsing further from that page so lynx seems an
> overkill.

I just tried
elinks -dump 'http:....'

in my gnome-terminal, and it displayed the text just fine, including some
umlauts, like "dünne Eisschict". However, in the tabular context, the
next entry got displaced one step to the left, as elinks got confused
about the number of characters in the word "dünne".

I have en_US.UTF-8.

To further investigate, I created the following file /tmp/test.html:




Trallala hopp’sann. Æh, bæ.


The special characters are <2019> (’) and AE and ae ligatures.


Then I ran elink -dump file:///tmp/test.html, and it printed perfectly:

Trallala hopp’sann. Æh, bæ.

Then I changed the charset=UTF-8 to charset=ISO-8859-1, and the output
became

Trallala hoppâ**sann. Ã*h, bæ.

I think that indicates pretty much what the problem might be.

Regards