Forum

"Tidy" bug with utf-8

The tidy node seems to not work properly with all the new unicorn stuff.
Please look at attached sample patch.

Tidy_Bug.v4p (8.1 kB)

thanks. presumably fixed. please check latest alpha and report your findings. note that i’ve removed the In/Out Encoding pins as i’d argue they no longer make sense. vvvv is now unicorn only and you cannot get any other string into tidy anyway…

hei herbst, can you confirm this working? or still having problems?

Mh, tidy is used to tidy up “real world HTML” which still comes in many flavors (encodings). How’d you suppose to deal with those?

right, then i’d argue a conversion has to happen before tidy and even before the xml-string is available on any vvvv string-pin.

so the question is where do you get the xml-string from?

or am i missing something?

So when I get my string with HTTP (Network Get String) and specify the the right encoding it is automatically converted to utf-8? If not, the encoding (when different from utf-8) has still to be specified on Tidy, doesn’t it? Is there any other node that does that kind of conversion?

What happens when I want to output another encoding (for whatever reason)?

actually not utf-8, but vvvvs internal representation of strings (which is in fact utf16, as are all delphi, .net, windows strings).

the important thing is that once the string is in vvvv the encoding is no longer relevant. only when getting strings into vvvv (reader, http, …) or writing strings out of vvvv (writer, …) you need to deal with encodings. since Tidy is just a string-manipulation node it does not deal with encodings.

“output” is the trigger here. for that you need an output node that can convert a vvvv-string to anything else, eg. a string with a special encoding, which then no longer is a vvvv-string.

hope that makes sense.
but more important still: herbst! does it work?

Actually it does, it changes the encodings. Setting an input encoding makes it assume the incoming string is encoded that way. Setting the output-encoding makes tidy (re-)encode the string accordingly.

nono, yesyes.
so the original tidy library has this functionality of dealing with in/out-encodings. useful for people using the library dealing with strings of different encodings.

but vvvv-strings cannot have different encodings. so what goes into and comes out of the tidy-node is (and must always be) just a vvvv string.

if you want to mess with encodings you have to do this when getting the string into vvvv or getting it out of vvvv. no encoding-mess inside of vvvv.

no?

Well, ok. One thing though (just a side note) besides changing the actual encoding tidy also changed the charset information in the html header:

If someone wants to use it the way it was supposed to be he/she has to build a module that mimics this behavior.

fair point. thats indeed changed behaviour. but charset should now always be utf-8 which will be correct as long as you don’t save that to a file choosing a different encoding in the process in which case, as you mentioned one would have to take care of this manually.

Still not working (or another problem, don’t know - have a quite big string input here).

See attached patch. If you change the doctype of Tidy to “XML” or “XML (No Header)”, vvvv freezes for some time and then just crashes.

Edit: tested in latest beta and latest alpha, both have the same problem.

download-instrument-library.v4p (7.1 MB)

please try with a pre-unicorn beta and see if it worked there.

beta25.1 crashes on both XML and XMLNoHeader, using ASCII or utf8 (had to restart the PC after the crash, like in the good 'ol days - seems you did some very good improvements to vvvvs crashing behaviour in the last few versions).

beta 27.2, on the other hand, works with XML but crashes on XMLNoHeader.

probably won’t help, but here is the errors tidy reports for your document before it crashes.

2:251:Error: unexpected in
2:255:Error: unexpected in
2:260:Error: unexpected in
2:508:Error: unexpected in
2:512:Error: unexpected in
2:517:Error: unexpected in

the error is inside tidy so there isn’t much we can do about it…

Hm, interestingly tidy seems to resolve that error in non-xml mode correctly.
I know the error is there (some “oldstyle” html formatting: ), and I thought this is exactly what tidy is supposed to solve. And it does in non-xml mode (correctly formats it as ).
And with less input (only feeding “” into tidy) it works as advertised. Strange.