edrikk
August 31, 2010, 5:38pm
1
Hi guys,
Could you please help me with a regular expression please? I'm trying to pull the list of Producers from IMDB's combined page? (see below as an example):
http://www.imdb.com/title/tt0126029/combined
I have this so far, but it's not working:
findline "Produced by" 1 1
unspace
if "
"
findinline "Produced by"
outputto "PRODUCERS"
joinuntil ""
sayregexp "/">[^<]+(?=<)" "@@" ""
Thanks in advance folks!
dano
August 31, 2010, 6:11pm
2
It's working fine but you don't have endif at the end.
edrikk, because I rather hate this sort of crippled webscript language I cannot support much to your request.
Following there is the complete TABLE structure which contains the producer names.
How would you get all the names from the first TD cell from each TR using the webscript language?
<table border="0" cellpadding="1" cellspacing="1">
<tr>
<td colspan="3" align="left"><h5><a class="glossary" name="producers" href="/glossary/P#producer">Produced by</a></h5></td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0254645/">Ted Elliott</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/C#co-producer">co-producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0277896/">Penney Finkelman Cox</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/E#executive_producer">executive producer</a> </td>
</tr>
<tr> <td valign="top"><a href="/name/nm0367286/">Jane Hartwell</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/A#assoc_producer">associate producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0005076/">Jeffrey Katzenberg</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/P#producer">producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0513502/">David Lipman</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top">co-executive producer </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0704968/">Sandra Rabins</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/E#executive_producer">executive producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0744429/">Terry Rossio</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/C#co-producer">co-producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0912403/">Aron Warner</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/P#producer">producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0930964/">John H. Williams</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/P#producer">producer</a> </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm1306049/">Linda Olszewski</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top">assistant producer (uncredited) </td>
</tr>
<tr>
<td valign="top"><a href="/name/nm0000229/">Steven Spielberg</a></td>
<td valign="top" nowrap="1"> .... </td>
<td valign="top"><a href="http://www.imdb.com/glossary/E#executive_producer">executive producer</a> (uncredited) </td>
</tr>
<tr>
<td colspan="4"> </td>
</tr>
</table>
DD.20100831.2215.CEST
edrikk
August 31, 2010, 6:57pm
4
Thanks Dano,
Sorry, the endif was missed in a copy-and-paste.
It works fine, EXCEPT each producer has an extra /"> before their name.
/">Ted Elliott@@/">Penney Finkelman Cox@@/">Jane Hartwell@@/">Jeffrey Katzenberg@@/">David Lipman@@/">Sandra Rabins@@/">Terry Rossio@@/">Aron Warner@@/">John H. Williams@@/">Linda Olszewski@@/">Steven Spielberg
My error is in the first portion of my regular expression (bolded below), for which I'm seeking help:
sayregexp "/"> [^<]+(?=<)" "@@" ""
edrikk
August 31, 2010, 7:12pm
5
I seem to have fixed it...
(?<=/">)[^<]+(?=<)
This should work also to detect all html tags ...
</?([a-zA-Z][a-zA-Z0-9])[^>] >
DD.20100831.2321.CEST