HTML Parsing

Does anyone have an example handy of how one could parse human-readable text out of an http stream? I know it’s got to be a simple solution, and I feel like I’m just missing something here. Or is this an example of having to learn way more about RegEx than I really want to?

In case my summary isn’t clear enough: I’m trying to, for example, feed vvvv the url of my livejournal and have it spit out just the entries, without visible html. I can parse html, but I don’t expect my audience to be able to do so ;>


you could simply use renderer (HTML)…
… or check out the transmedia redef.v4p in your girlpower folder…

hm. Seems the latest release Beta 9.11 no longer shows the HTTP (Network Get) node? Quite likely a bug.
Suggest you try an older beta to understand the transmedia demo…

or you do it by parsing the xml behind the html. HTTP (Network Get) --> Tidy (XML) --> XPath (XML)

Tidy makes nice well formed XML out of bad HTML so the XML parser in the XPath-Node can read it. XPath extracts html content from the tags it is pointing on.

By the way, we have just figured out, that HTTP (Network Get) and HTTP (Network Post) are missing in the latest release. Joreg?


i am not sure what you mean. for me HTTP (Network Get) and HTTP (Network Post) nodes + their helpfiles are in the release. i just re-downloaded the thing and it is still there.

also the transmedia patch is running with the get-node inside.

you are right. somebody here at meso was wrong! sorry for the confusion.

max: renderer(html) doesn’t do what I want (I want to output just, say, text as strings and manipulate them), however the transmedia redef gives me a good place to start. I guess I need to work my way more carefully through girlpower. ;>



The HTTP nodes are in the new beta, as that’s what I’m using right now. (and it’s working great, except that it seems to not actually cleanly quit sometimes, but I need to research that more to see if it’s a bug or just my system)

david: I will definitely try that approach, I didn’t even think of Tidy. Thanks for the feedback.

I had used the now-withdrawn beta 11. The latest release is Beta 11.1, and that has got all the nodes.