Discogs: Parse Album (Release)


#1

Web Sources logic is to locate first release id (album) in ParserScriptIndex an then to parse Album info in parserScriptAlbum step.
Source (.Src) files posted in mp3tag forums use different approaches of parsing album info: HTML and XML.
Wondering why not to use always XML since it is much better formatted then HTML. On top of that HTML is not stable (discogs might modify HTML tags construct in release pages at any time making ParserScriptAlbum unstable/unusable).

Ideally the usage of XML in both ParserScriptIndex and ParserScriptAlbum but combined with appropriate customizing would make it really powerful.
E.g. search by Artist and album. First, search only by artist (API artist search) and then filter (XML) results with given album title. Filtering can be extended with all possible fields (Year, Format, Genre, etc). A kind of Filtering popup similar to current .src popup but dedicating a label and a field for each possible tag would be needed in this case.
I know the interface between Filtering popup and Parser script is difficult but I believe that might be feasible.
Customizing would be needed to let user chose which fields should be extracted from ALBUM release. Scripting code would take into account the flag from customizing and then include/exclude the tag from process.

Less script files would be needed by this approach.


Changes at discogs.com
#2

Scripts are different because they are user generated and every user can do them as he wants.
In my case, I simply didn't know about the XML documents of discogs' API when I created the scripts first and I never changed.
XML is more stable, yes, but not more powerful.

The filtering you propose can't be done with Mp3tag web scripts as far as I know. I have tried to repeat ParserScriptIndex with the relsults of the first, but that had no success. I have tried this with searching for master releases in a firlst step and indexing all releases belonging to the master in a second step. But as I sayed, no success
The filtering you propose doesn't make much sense anyway. As the API artist search is now constructed, you can sort the releases alpabetically, and scroll to the releases of the album you want. That's as easy as using a extra filter to see only these release.

You have seen my new scripts? The customization you propose it possible there. For the first time in such an easy way for several scripts at once, if I may add this with a certain proudness :wink:


#3

Of course I've seen (I had posted a comment also there: INCLUDE command :wink: ).

I still believe album parser based on XML is not inferior to HTML. XML response has some great advantages (for me at least). E.g. Artist full name Vs abbreviations. I prefer artist full name (e.g. Pat Benatar instead of P. Benatar or even Benatar). Discogs put artist abbreviation between tags in HTML (artist name can be extracted from the tag itself but you increase the complexity of your scripts) while in XML you have them both.

HTML:
<a href="/artist/Pat+Benatar?anv=P.+Benatar" class="rollover_link">P. Benatar</a>

XML:
<artist>
<name>Pat Benatar</name>
<anv>P. Benatar</anv>

Another example is the track info of persons involved. In HTML you need some fancy (complex) code to assign the personnel to correct track. XML is simple.
I encourage you to switch albumparser to XML. It will save you time.
Nevertheless I respect the good work you have done in your scripts. CONGRATULATIONS.


#4

Yes, you are right. That's would be the big advantag of XML/API besides the more stable character of the scripts.
It's not full artist name vs. artist abbreviation but main artist name vs. artist name variation (anv). It appear on every place artist names are listed: Albumartist, Artist, Track Extra Artists, Credits, Notes.

I could extract that from the HTML documents too in most cases. But I have problems ther with special characters as the main artist name is hidden in a linke and there speical characters are URL encoded (%21 for !, %22 ", %23 for #, ...)

So yes, switching to XML/API is on my to do list. I just haven't done it yet because personally I prefer the artist names as given on the release pages.


#5

Actually you find both artist variations in XML/API.

<artist>
<name>Pat Benatar</name>
<anv>P. Benatar</anv>
<join/>
<role>Written-By</role>
<tracks>4, 5, 7, 8, 15, 19</tracks>
</artist>

There are some things Web source development should be amended for.

  • It is impossible (AFAIK) to process/access TAG value filled from parse process. E.g. you want to merge Genre and Styles in Genre. Value separator should be added to Genre if it is already filled before adding Styles. No way to know Genre is (not) blank.
  • I am struggling to assign involved artists to each track but it is a pain. e.g. xxx yyyy Written-By 2 to 5, 7 to 8, 15, 19 Single track nrs can be parsed with different searches and regexp but is impossible to process ranges N1 to N2.
  • No returncode is provided for find (FindInLine/FindLine) commands. If a specifig tag is not provided in the page being parsed then the whole script will fail.

Discogs provides compressed XML dump of all Artists/Labels/Releases on monthly basis.
I am seriously thinking to convert those 3 XML files in more friendly format to mp3tag.
mp3tag source will be used to find the correct release ID then local converted XML files will be used to fill desired tags.


#6

I don't know what you mean. I think you are talking about my discogs scripts. I write Syle befor Genre and I found a way to check if Style is blank, which leaves the value seperator out.
I think there is no discogs release without Genre. At least I have not found one, and I looked at many for script developing and also with tagging my own music collection.

This will be hard, I think there is no standard at discogs for the way this is written. And it's not enough to be able to parse a number. You would have to sort these things in the order of the tracklistlisting to be able to parse it. Texts which are different for every track are difficult, because you need a do ... while loop which must also take care of the tracks for which this credits should not be written.

use
findline "..." 1 1
and
findinline "..." 1 1
this jumps to the end of the line or script is text is not found and prevents the script form failing

What is the compressed XML dump? Can you download the whole discogs database???

It is a bit hard to answer your questions. I never know exactly if you do

  • questions for how things can be done,
  • proposals how things should be done,
  • or proposals how Mp3tag's web sources script sytem should be changed by the programmer of Mp3tag.

Are you working already on a web sources script for discogs' XML/API? I'm alway ready to help if you have some problems. Just post your questions here!


#7

My point is that is would be much better if we could use some commands on already filled tag. To achieve this we use workarounds (you already have done also) which increase a lot script complexity.

THANKS. I had seen the extension "1 1" but didn't find any info in forums.

Yes you can Discogs RAW data
It is not difficult to read those XML files and amend them.

  • Put credits in correct tracklist
  • Add discnumber
  • Add artist info into release XML file
    etc.
    Then with a simple script we save discogs release ID to files and 2nd script would parse the local (huge) XML file.
    It could be possible to even split XML release id file to several ones by using range of release id.

Thank you.
I have attached a simple .src file based on Florian discogs.src but your scripts helped a lot also.
Please note you have to add your discogs API_KEY in the script file. Anyone can get one for free in discogs site.

_discogs___VKOSTAS__Search_by__Artist___Album.src (6.25 KB)


#8

See section "Web Sources Framework/List of Parser Commands" in the Mp3tag help manual.
FindLine S n Find line with first or Nth occurrence of S (starting from the current position)
FindInLine S n Find the next/Nth occurrence of S within the current line

DD.20110411.1120.CEST


#9

I've seen and clearly understood that.

What I couldn't find was the what if "Find" command fails?
Pone made clear to me.

Thank you for your comment


#10

Regarding the discussion about using discogs monthly updated xml database as a data source for Mp3tag I want to mention ...
... I downloaded the three gz files and unpacked them ...
discogs_20110401_artists.xml 216 MB (227.083.986 Bytes)
discogs_20110401_labels.xml 32,9 MB (34.554.966 Bytes)
discogs_20110401_releases.xml 4,62 GB (4.961.080.036 Bytes)

The huge "releases" xml file with about a million lines is not readable, neither into the text editor KEDIT nor into Notepad++, nor into XML Notepad, perhaps it is because the file is larger than 4 GB, but astonishingly Textpad can read the file, it takes its time, good for a break for a cup of tea.

I assume, that there is no practical chance to use such huge file as a source for the Mp3tag source scripting feature.
Even the use of XML Path Language and high end professional xml reader let occur hiccups, I know it from other applications.
XML is a lame duck when it comes to large amounts of data.
For practical use the "releases" xml file has to be split into smaller parts.

DD.20110411.1640.CEST


#11

Yep you are CORRECT. First I thought to split each the huge release IDs XML on each release ID but then I a better idea came to my mind.
Since we locate the release id then we can download and convert the release ID XML file :wink: .
Actually with "debugging ON" option we download the source XML file (ParseScriptAlbum section).
I can write a small program to convert the discogs XML file to a mp3tag web source simplified format XML file.
Now, can Web Script file be read directly the converted XML file in local disk? Possibly yes but need to check. Can someone confirm this? Thank you.

The question is what should be amended in the discogs XML file?

  • Move credits, artists, etc to track level.
  • Merge Genre/Styles under same XML tag (Genres).
  • Add new tag on XML tags containing number of values (e.g. nr of tracks, nr of artists, etc).

Please send your comments/suggestions.


#12

Hmm, as I understand the Web Source Feature, at least it needs a local webserver to call a webpage. See configuration in the src header.
See also: Possubke to process locally saved file (file://) instead of http:// ?
But that might easily to be modifed.

DD.20110411.1811.CEST


#13

Local Web server works ok. I am working on converting the XML file now...
In the next few days I will come back.

Thank you :sunglasses:


#14

Just a question:
What do you wnat to do with Allbum Credits which are not assigned to special tracks? These credits are sometimes for all tracks, and sometimes they have nothing to do with the tracks at all:

See here:
http://www.discogs.com/release/378017
You have credits for A&R, Artwork, Booking, Tour Managment, ....
In what tag field do you want to write this?


#15

Discogs_Credits_Multi is ok I guess.
You are right. There are credits which are not directly related to the track itself. Someone might not care about who the photographer is or who did the design of the album cover, etc.
First step is to apply a simple rule: assign credits to correct track; No track info means all tracks.
Later we can add options/filters so they can define which credits can be included or not in the tracks. Even we can split credits in different track tags.
Although don't I see a big issue here. If more information is assigned to track it can be removed somehow (automatically/manually).
First priority is to be able to convert Discogs XML file with less hustle for user. The rest will come.
:sunglasses:


#16

Could you please elaborate this idea in more detail?

DD.20110412.1613.CEST


#17

Currently, implemented Web source scripts are split in 2 steps:

  1. Parse candidate releases (ParseIndex). Discogs provides HTML results but no XML ones.
  2. Parse selected release (ParseAlbum). Discogs provides both HTML and XML results.

Thanks to your suggestion (use Local Web Server) I came with one idea: AlbumURL (Parse no 2) to not request Release directly from Discogs but from local Web Server instead. Local server will request the release info (XML format) from Discogs. The XML will be amended and then returned to mp3tag Web Source. A PHP script must be implemented in the local Web Server.
Nothing is changed in the way Web source works.

There are plenty Local Web Servers in the net. I prefer portable versions (no installation required).
I used XAMPP package which contains Web Server, PHP, FileZilla and others.

What do you think?


#18

I think ... well ... today many people are permanently linked together via several networks and services ... and they all have open doors to their harddisks ... but you cannot demand it from a normal user that he agrees to the installation of a web server on his home computer.
There might be a simpler way.
No extra web server, just a simple HTTP requester.

Maybe of interest ... there is a source where different access methods and tools are described ...
http://techsupt.winbatch.com/webcgi/webbat...h~Web~Pages.txt
Could be XMLStarlet of some use?

DD.20110412.1748.CEST


#19

I will have a look. Thanks.
In general my goal is to implement an intermediate script between (discogs) XML and mp3tag Web source script Album parser. Thus, album XML parser script will be simplified too.
Of course installation of local web server is a disadvantage.
Thank you again for your findings.


#20

I have finished (a beta version) of 2 PHP scripts which will help a lot with discogs site XML data.
1st script is used to filter releases of one given artist. Currently discogs returns all artist releases in a big XML without giving the option of any kind of filter... Well until now.

2nd script is "normalization" of discogs release XML.
Distribute Extraartists in track level.
Calculate artistmulti for each artists tag.
Calculate composer, lyricist, involvedartistst in track level.

With those 2 scripts we can deal only with (amended) XML files. Scripts can be easily amended to keep mp3tag webscripts simple.

With a local PHP server it is pretty easy.
I will come with more details.
Any thoughts?