Web Sources Framework: Help needed

sundance · October 14, 2010, 11:47am

Hello,

I would like to iterate through a HTML table that looks like this:

.
.
<table>
  <tr>
    <td>Info 1a</td>
    <td>Info 1b</td>
  </tr>
  <tr>
    <td>Info 2a</td>
    <td>Info 2b</td>
  </tr>  
</table>
.
.

(there could be blank lines in between, also)

I tried with something like this:

findline "<table">
do
  findline "<tr>"
    findline "<td>"
    findinline "<td>"
    sayuntil "</td>"
  findline "</tr>"
  say "|"

  findline "<tr>"
    findline "<td>"
    findinline "<td>"
    sayuntil "</td>"
  findline "</tr>"
  saynewline
until ???

How would you create the until condition?

.sundance.

pone · October 14, 2010, 1:15pm

I don't think there is a "do ... until" command. Would be usefull, but i have never seen it. If it is, please tell me.

Instead there is "do ... while" which requires that there is always the same text before the information like

.
The proplem with the blank lines is easy to handle with a command.

You want output lilke this:

Info 1a | Info 1b
Info 2a | Info 2b

so i would try: findline "

"

joinuntil "</table>"
regexpreplace "</td><td>" " \| "
regexpreplace "</tr><tr>" "\r\n"
regexpReplace "<(.*?)>" ""
sayrest

sundance · October 14, 2010, 5:55pm

Hello pone,

thanks for your replies!
And yes, you're right, there's no "until", it's a do/while loop...

Nice idea to do it with regexes, but it's not as easy as in my example, which I tried to simplify (too much, I guess...)

The actual output looks like this:

.
.
  <tr class="significant">
    <td class="relevance text-center">
      <div class="bar" style="width:100%" title="100%"></div>
    </td>

    <td class="text-center">
      <a href="[some link]"><img src="[some pic]" width="9px" height="10px"></a>
    </td>

    <td><a href="http://[some album url]">Post Card</a></td>

    <td>Mary Hopkin</td>

    <td>Toshiba EMI</td>

    <td>1969</td>

    <td>Pop/Rock</td>
  </tr>
.
.

From the first

only "100%" is needed,
the second must be discarded,
from the 3rd I need the URL and the album title "Post Card"
and the remaining 4 I need completely, so my result string is:

(I can't specify the name of the web site this script is intended for (but it's not difficult to guess...), since it's somehow "uncorrect" to run scripts against their web site. But lately they changed their code and so there's work to do...)

By now, I'm able to collect all the information needed for the indexpage, but my do/while loop won't work. Maybe I have to reconstruct it to find a proper "while" condition.

.sundance.

dano · October 14, 2010, 6:16pm

findline "

" 1 1
unspace
while ""

sundance · October 18, 2010, 1:05pm

dano,

thanks for the hint, you made my day...
Meanwhile I managed to to fill the index page properly.

Being the regex specialist that you are, it's probably a no-brainer for you to help me with my next problem:
In the [ParserScriptAlbum] section I'm trying to enumerate the tracks and their composers/artists from the web site in question (www.dummy.com). I get this result:

.
.
<td id="expand-title" class="cell">
  <a href="http://www.dummy.com/artist/blackmore-p15867">Blackmore</a>, <a href="http://www.dummy.com/artist/gillan-p4366">Gillan</a>, <a href="http://www.dummy.com/artist/glover-p17896">Glover</a>, <a href="http://www.dummy.com/artist/lord-p19001">Lord</a>&hellip;</td>
.
...

To extract the list of composers, I tried with "SayRegExp":

SayRegExp "(?<=<a href=.+>)[^<]+(?=</a>)" ", " "</td>"

But all I get is an error message:

Script-Line    : 236
Command        : sayregexp
Parameter 1    : >(?<=<a href=.+>)[^<]+(?=</a>)<
Parameter 2    : >, <
Parameter 3    : ></td><

Output         : >Regular expression

Invalid lookbehind assertion encountered in the regular expression.<

I checked my regex with this tool, which delivers the result I expect.
-> I'm clueless...

.sundance.

dano · October 18, 2010, 1:42pm

Mp3tag regex is based on PERL regex. So the lookbehind has limitations, i.e. it needs a fixed width.
In (?<=) you can't use .+

.NET regex is a bit more powerful that's why it works in your test tool.

sundance · October 19, 2010, 5:29am

dano,

thank you very much for your valuable feedback.
Now it works. I added another RegexpReplace to be able to supply a fixed width lookbehind parameter:

.
RegexpReplace "<a href=\x22http://www.dummy.com/artist/.+?\x22>" "<ax>"
.
OutputTo "Composer"
SayRegExp "(?<=<ax>)[^<]+(?=</a>)" ", " "</td>"

(to be continued when the next issue surfaces...)

.sundance.

sundance · October 20, 2010, 4:57am

@Admin,
I was trying to upload a file here (an XML file) but it was not possible. Do I need a special permission to do so?

Solved:
Notepad___mp3tag.zip (905 Bytes)
Thanks for the hint, DetlevD.
I uploaded a language extension for (my favourite) text editor Notepad++, which enables syntax highlighting for WebSources Framework. I find it quite useful, because you'll instantly see typos, when you create/edit your .src files.

.sundance.

Notepad___mp3tag.zip (905 Bytes)

DetlevD · October 20, 2010, 5:10am

I am not "@Admin", but ...
... what is when you rename the file to e. g. "YourXMLFile.xml.txt" or pack the file into a zip package?

DD.20101020.0910.CEST

LyricsLover · January 12, 2022, 9:00am

I have here added my UDL version with detailed instructions.