Can't get full browser HTML source for IMDB titles

Hello,

I have an IMDB sources script to scrape movie metadata for my library. A few months ago it appears the structure of the returned HTML changed significantly. I am now working to fix the script. I have noticed that the HTML in the debug output is significantly different from the HTML source I see directly in my browser (i.e. Chrome), and the problem is there is a lot of data I want to scrape that is only available in the HTML source Chrome receives. I presume IMDB is detecting the browser type and returning different HTML to the web sources framework. Does anyone have a suggestion on how to work around this so I can receive the richer HTML the browser receives?

Thank you

Hello Chuck,

are you familiar with the curl command? You could use this command to test whether sending a different user agent string would solve this.

Can you maybe post the URL for the request that gives the different output?

Kind regards
— Florian

Hello Florian,

Thanks for responding. Here is an example URL (Blade Runner 2049):

I will experiment with cURL in parallel. How would I override the agent string in the mp3tag web sources framework as well?

Thank you,

Chuck

This is not possible (yet). We'd first need to check whether this would address the issue at all.

There is a setting for sending Mp3tag's user agent, e.g., Mp3tag/2.86 and you can check whether this already solves the issue by adding

[UserAgent]=1

to the upper header part of the tag source. By default, it's sending Mozilla/5.0 (compatible) currently.

P.S. Good choice of URL by the way :smiley:

Thanks Florian. Unfortunately, it does not appear the HTML I received is any different :frowning: Any other suggestions?

Can you tell me which are the most significant parts that are missing when accessing the page through the Web Sources Framework?

You can wrap code in a code block using the </> icon from the toolbar here.

Here is an example, specifically plot keywords. The plot keywords are completely different.

First is what I see in my IMDB output from the web sources framework:

<tr class="ipl-zebra-list__item">
    <td class="ipl-zebra-list__label">Plot Keywords</td>
    <td>
        <ul class="ipl-inline-list">
                <li class="ipl-inline-list__item">
                    <a href="/keyword/shepard-tone">shepard-tone</a>
                </li>
                <li class="ipl-inline-list__item">
                    <a href="/keyword/snow">snow</a>
                </li>
                <li class="ipl-inline-list__item">
                    <a href="/keyword/los-angeles-california">los-angeles-california</a>
                </li>
                <li class="ipl-inline-list__item">
                    <a href="/keyword/las-vegas-nevada">las-vegas-nevada</a>
                </li>
                <li class="ipl-inline-list__item">
                    <a href="/keyword/blade-runner">blade-runner</a>
                </li>
            <li class="ipl-inline-list__item">
                <a href="/title/tt1856101/keywords">See All (575) &raquo;</a>
            </li>
        </ul>
    </td>
</tr>

Next is the HTML I see in Chrome via view source

<div class="see-more inline canwrap" itemprop="keywords">
    <h4 class="inline">Plot Keywords:</h4>
<a href="/keyword/short-skirt?ref_=tt_stry_kw"
> <span class="itemprop" itemprop="keywords">short skirt</span></a>
                        <span>|</span>
<a href="/keyword/bare-breasts?ref_=tt_stry_kw"
> <span class="itemprop" itemprop="keywords">bare breasts</span></a>
                        <span>|</span>
<a href="/keyword/female-nudity?ref_=tt_stry_kw"
> <span class="itemprop" itemprop="keywords">female nudity</span></a>
                        <span>|</span>
<a href="/keyword/micro-mini-skirt?ref_=tt_stry_kw"
> <span class="itemprop" itemprop="keywords">micro mini skirt</span></a>
                        <span>|</span>
<a href="/keyword/miniskirt?ref_=tt_stry_kw"
> <span class="itemprop" itemprop="keywords">miniskirt</span></a>
            <span>|</span>&nbsp;<nobr><a href="/title/tt1856101/keywords?ref_=tt_stry_kw"
>See All (575)</a>&nbsp;&raquo;</nobr>

I've tried it with a simple test script and got the very same keywords via the Web Source Framework as you're getting when using Chrome.

# ###################################################################
# Mp3tag Tag Source Test
# ###################################################################

[Name]=imdb.com
[BasedOn]=www.imdb.com
[PreviewUrl]=https://www.imdb.com/title/tt1856101/?ref_=nv_sr_1
[AlbumUrl]=https://www.imdb.com/title/tt1856101/?ref_=nv_sr_1
[WordSeparator]=%20
[SearchBy]=%title%
[Encoding]=url-utf-8

[ParserScriptIndex]=...
#
[ParserScriptAlbum]=...
# ###################################################################
#					A  L  B  U  M
# ###################################################################
debugwriteinput "C:\Users\florian\Desktop\debug.out"
debug "on" "C:\Users\florian\Desktop\debug.txt"

Thanks Florian. That was the key I needed. The script I started with would use the IMDB "references" view for the AlbumUrl, which dramatically changed at the end of last year. This explains why it is different from what I saw in Chrome. The reason the data, such as keywords, is different is due to the data in the standard and references view being different (i.e. plot keywords).

1 Like