Filtering imagetags from HTML-content with RegExp node

guest · September 22, 2011, 2:07pm

Hey guys,

i’m currently trying to filter images from a list of websites. However, i’m stuck with the RegExp-Node, because my regular expression just won’t do it the right way.

This

<img .*? src=["'](http://.*?[\.jpg]("'](http://.*?[\.jpg)+)["'>\s]("'>\s)+

should get me all the *.jps of the sourcecode, which partly works, but say 20% of the filtered strings look like this afterwards:

http://bla.com/bla/somepicture.gif" alt="bla"></a></tag><whatever>

I just can’t find the reason why 1) some “gif"s get through and 2) why it doesn’t cut them at the first “>”, “””, or space. I’ve tried everything from ^, using the dollar sign and the “.*?”, but I just can’t get it right.

Anyway, does anyone have a hint for me? Thanks in advance! :)

david · September 22, 2011, 4:51pm

he mr. trompeter

without going through your expression code…
why dont you use Xpath (XML) to retrieve your images pathes? this would at least avoid the issues with mal formatted HTML Strings which is probably the reason for your regexpr problems.