HTML::Parser & <plaintext> tag

HTML::Parser & <plaintext> tag

am 02.11.2004 19:57:49 von kappa

Good day to all!

As far as I can understand HTML::Parser simply ignores closing
tag. I read the tests and Changes so I see that this is
intended behaviour and

is special-cased of all CDATA<br /> elements.<br /> <br /> Does someone know the reasoning of this decision? :) It is just plain<br /> interesting. Does HTML::Parser imitate some old browser here? It<br /> results in weird effects for me as I write a HTML sanitizer for<br /> WebMail.<br /> <br /> -- <br /> Alex Kapranoff,<br /> #!/usr/bin/perl -w<br /> $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for<br /> '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;</p> </article> <article> <h2>Re: HTML::Parser &amp; &lt;plaintext&gt; tag</h2><span>am 10.11.2004 19:25:47 von gisle</span> <p>Alex Kapranoff <kappa@rambler-co.ru> writes:<br /> <br /> > As far as I can understand HTML::Parser simply ignores closing<br /> > </plaintext> tag. I read the tests and Changes so I see that this is<br /> > intended behaviour and <plaintext> is special-cased of all CDATA<br /> > elements.<br /> > <br /> > Does someone know the reasoning of this decision? :) It is just plain<br /> > interesting.<br /> <br /> A long time ago the HTTP protocol did not have MIME-like headers. The<br /> client sent a "GET foo" line and the server responded with HTML and<br /> then closed the connection. Since there was no way for the server to<br /> indicate any other Content-Type than text/html the <plaintext> tag was<br /> introduced so that text files could be served by just prefixing the<br /> file content with this tag.<br /> <br /> This was before the <img> tag was invented so luckily we don't have a<br /> similar unclosed <gif> tag :)<br /> <br /> > Does HTML::Parser imitate some old browser here?<br /> <br /> Yes, it was there in the beginning but still seems well supported. Of<br /> my current browsers both Konqueror and MSIE support this. Firefox<br /> support it in the same way as <xmp>, i.e. it allow you to escape out<br /> of it with </plaintext>.<br /> <br /> The <plaintext> tag is described in this historic document:<br /> <br /> http://www.w3.org/History/19921103-hypertext/hypertext/WWW/M arkUp/Tags.html#7<br /> <br /> > It results in weird effects for me as I write a HTML sanitizer for<br /> > WebMail.<br /> <br /> Howcome? Do you have a need to suppress this behaviour in HTML::Parser?<br /> <br /> Regards,<br /> Gisle</p> </article> <article> <h2>Re: HTML::Parser &amp; &lt;plaintext&gt; tag</h2><span>am 11.11.2004 09:11:14 von kappa</span> <p>* Gisle Aas <gisle@activestate.com> [November 10 2004, 21:25]:<br /> > then closed the connection. Since there was no way for the server to<br /> > indicate any other Content-Type than text/html the <plaintext> tag was<br /> > introduced so that text files could be served by just prefixing the<br /> > file content with this tag.<br /> > <br /> > This was before the <img> tag was invented so luckily we don't have a<br /> > similar unclosed <gif> tag :)<br /> <br /> Thank you very much for this enlightment! It explains everything!<br /> BTW, by that time I had even seen computers once or twice from far<br /> away :)<br /> <br /> > my current browsers both Konqueror and MSIE support this. Firefox<br /> > support it in the same way as <xmp>, i.e. it allow you to escape out<br /> > of it with </plaintext>.<br /> <br /> This Firefox behaviour is likely to have confused me. Look, what if<br /> I've got such a html: `<plaintext></plaintext><script>nasties;</script>'?<br /> HTML::Parser stops parsing after `<plaintext>' so that no interesting<br /> event is triggered on `<script>' tag and my sanitizer has no chance to<br /> rip out the nasties. Firefox (my 1st browser to test) happily resumes<br /> parsing after `</plaintext>' and that's the problem. Maybe it is the<br /> gecko people who are at fault.<br /> <br /> > > It results in weird effects for me as I write a HTML sanitizer for<br /> > > WebMail.<br /> > Howcome? Do you have a need to suppress this behaviour in HTML::Parser?<br /> <br /> Yes, I'd like to have an option to resume parsing after `</plaintext>'<br /> just as firefox does. As I understand the original intentions now I'll<br /> try to produce a patch.<br /> <br /> -- <br /> Alex Kapranoff,<br /> #!/usr/bin/perl -w<br /> $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for<br /> '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;</p> </article> <article> <h2>Re: HTML::Parser &amp; &lt;plaintext&gt; tag</h2><span>am 11.11.2004 10:52:00 von kappa</span> <p>* Alex Kapranoff <kappa@rambler-co.ru> [November 11 2004, 11:11]:<br /> > > > It results in weird effects for me as I write a HTML sanitizer for<br /> > > > WebMail.<br /> > > Howcome? Do you have a need to suppress this behaviour in HTML::Parser?<br /> > Yes, I'd like to have an option to resume parsing after `</plaintext>'<br /> > just as firefox does. As I understand the original intentions now I'll<br /> > try to produce a patch.<br /> <br /> I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an<br /> additional boolean attribute `closing_plaintext'. Not that I insist on<br /> naming.<br /> <br /> -- <br /> Alex Kapranoff,<br /> #!/usr/bin/perl -w<br /> $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for<br /> '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43;</p> </article> <article> <h2>Re: HTML::Parser &amp; &lt;plaintext&gt; tag</h2><span>am 11.11.2004 11:21:12 von gisle</span> <p>Alex Kapranoff <kappa@rambler-co.ru> writes:<br /> <br /> > * Alex Kapranoff <kappa@rambler-co.ru> [November 11 2004, 11:11]:<br /> > > > > It results in weird effects for me as I write a HTML sanitizer for<br /> > > > > WebMail.<br /> > > > Howcome? Do you have a need to suppress this behaviour in HTML::Parser?<br /> > > Yes, I'd like to have an option to resume parsing after `</plaintext>'<br /> > > just as firefox does. As I understand the original intentions now I'll<br /> > > try to produce a patch.<br /> > <br /> > I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an<br /> > additional boolean attribute `closing_plaintext'. Not that I insist on<br /> > naming.<br /> <br /> Seems good; and I've just uploaded HTML-Parser-3.38 with this patch.</p> </article> <footer> <a href="/">Index</a> | <a href="/impressum.php">Impressum</a> | <a href="/datenschutz.php">Datenschutz</a> | <a href="https://www.xodox.de/">XODOX</a> </footer> </main> </body> </html>