HTML::Parser & <plaintext> tag

am 02.11.2004 19:57:49 von kappa

Good day to all!

As far as I can understand HTML::Parser simply ignores closing
tag. I read the tests and Changes so I see that this is
intended behaviour and

is special-cased of all CDATA elements. Does someone know the reasoning of this decision? :) It is just plain interesting. Does HTML::Parser imitate some old browser here? It results in weird effects for me as I write a HTML sanitizer for WebMail. -- Alex Kapranoff, #!/usr/bin/perl -w $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43; </article> <article> <h2>Re: HTML::Parser & <plaintext> tag</h2>am 10.11.2004 19:25:47 von gisle Alex Kapranoff <kappa@rambler-co.ru> writes: > As far as I can understand HTML::Parser simply ignores closing > </plaintext> tag. I read the tests and Changes so I see that this is > intended behaviour and <plaintext> is special-cased of all CDATA > elements. > > Does someone know the reasoning of this decision? :) It is just plain > interesting. A long time ago the HTTP protocol did not have MIME-like headers. The client sent a "GET foo" line and the server responded with HTML and then closed the connection. Since there was no way for the server to indicate any other Content-Type than text/html the <plaintext> tag was introduced so that text files could be served by just prefixing the file content with this tag. This was before the <img> tag was invented so luckily we don't have a similar unclosed <gif> tag :) > Does HTML::Parser imitate some old browser here? Yes, it was there in the beginning but still seems well supported. Of my current browsers both Konqueror and MSIE support this. Firefox support it in the same way as <xmp>, i.e. it allow you to escape out of it with </plaintext>. The <plaintext> tag is described in this historic document: http://www.w3.org/History/19921103-hypertext/hypertext/WWW/M arkUp/Tags.html#7 > It results in weird effects for me as I write a HTML sanitizer for > WebMail. Howcome? Do you have a need to suppress this behaviour in HTML::Parser? Regards, Gisle </article> <article> <h2>Re: HTML::Parser & <plaintext> tag</h2>am 11.11.2004 09:11:14 von kappa * Gisle Aas <gisle@activestate.com> [November 10 2004, 21:25]: > then closed the connection. Since there was no way for the server to > indicate any other Content-Type than text/html the <plaintext> tag was > introduced so that text files could be served by just prefixing the > file content with this tag. > > This was before the <img> tag was invented so luckily we don't have a > similar unclosed <gif> tag :) Thank you very much for this enlightment! It explains everything! BTW, by that time I had even seen computers once or twice from far away :) > my current browsers both Konqueror and MSIE support this. Firefox > support it in the same way as <xmp>, i.e. it allow you to escape out > of it with </plaintext>. This Firefox behaviour is likely to have confused me. Look, what if I've got such a html: `<plaintext></plaintext><script>nasties;</script>'? HTML::Parser stops parsing after `<plaintext>' so that no interesting event is triggered on `<script>' tag and my sanitizer has no chance to rip out the nasties. Firefox (my 1st browser to test) happily resumes parsing after `</plaintext>' and that's the problem. Maybe it is the gecko people who are at fault. > > It results in weird effects for me as I write a HTML sanitizer for > > WebMail. > Howcome? Do you have a need to suppress this behaviour in HTML::Parser? Yes, I'd like to have an option to resume parsing after `</plaintext>' just as firefox does. As I understand the original intentions now I'll try to produce a patch. -- Alex Kapranoff, #!/usr/bin/perl -w $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43; </article> <article> <h2>Re: HTML::Parser & <plaintext> tag</h2>am 11.11.2004 10:52:00 von kappa * Alex Kapranoff <kappa@rambler-co.ru> [November 11 2004, 11:11]: > > > It results in weird effects for me as I write a HTML sanitizer for > > > WebMail. > > Howcome? Do you have a need to suppress this behaviour in HTML::Parser? > Yes, I'd like to have an option to resume parsing after `</plaintext>' > just as firefox does. As I understand the original intentions now I'll > try to produce a patch. I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an additional boolean attribute `closing_plaintext'. Not that I insist on naming. -- Alex Kapranoff, #!/usr/bin/perl -w $SIG{__WARN__}=sub{print substr("@_",-43+ord$_,1)for '6.823O1US90:350:739OJ;0:*'=~m}.}g},$}='PJlshrk';reset$}+43; </article> <article> <h2>Re: HTML::Parser & <plaintext> tag</h2>am 11.11.2004 11:21:12 von gisle Alex Kapranoff <kappa@rambler-co.ru> writes: > * Alex Kapranoff <kappa@rambler-co.ru> [November 11 2004, 11:11]: > > > > It results in weird effects for me as I write a HTML sanitizer for > > > > WebMail. > > > Howcome? Do you have a need to suppress this behaviour in HTML::Parser? > > Yes, I'd like to have an option to resume parsing after `</plaintext>' > > just as firefox does. As I understand the original intentions now I'll > > try to produce a patch. > > I've filed a ticket 8362 in rt.cpan.org with the patch. It creates an > additional boolean attribute `closing_plaintext'. Not that I insist on > naming. Seems good; and I've just uploaded HTML-Parser-3.38 with this patch. </article> <footer> <a href="/">Index</a> | <a href="/impressum.php">Impressum</a> | <a href="/datenschutz.php">Datenschutz</a> | <a href="https://www.xodox.de/">XODOX</a> </footer> </main> </body> </html>