Local freedb lookup based on selected audio files

Per Local freedb – Mp3tag Documentation "accessing the database based on the audio files is not available at local databases".

... Why is that? You know the selected files’ durations, right? For lossless files you know if they are 16/44.1 and multiples of 588 samples. So if all criteria are met, you could calculate a disc ID which should work as long as there was no data track. Even if there was a data track, you could offer a special calculation/lookup option for that.

What am I missing?

It's possible in theory, but practical tests at the time I've implemented this showed that it's not always very reliable to try to reconstruct the actual TOC of a CD. Even if the selected files have the correct duration down to the CD frame level, there are often things like lead-in/lead-out, gaps between tracks, and possibly other things like hidden tracks that influence the calculation of the discid.

IIRC, the freedb server software works in two stages: if it finds an exact match for a discid, it returns the matching result right away. In case it doesn't, it tries to approximate a matching result by inspecting the provided track count and track lengths, and returns a list of matching results. This way, even non-matching discids often returned valid (and a couple of seemingly unrelated) results.

While it's possible to replicate something like this locally, it wasn't something I tried to build at that time, especially given that I already offered the text search locally.

As I see it, the disc ID has three parts:

  • A one-byte checksum derived from audio track start points. In practice, the number of potential checksums is no more than the number of tracks +1.
  • A two-byte “disc length” in seconds. In effect, it is the actual sum of the audio track durations, plus (if present) the data track and the gap preceding it, and rounded either up or down somewhat unpredictably to a whole second.
  • A one-byte count of tracks on the disc, including data track.

Assuming we have selected audio files which are cut on frame boundaries, the only uncertain factors which matter are 1. whether there’s a data track, and 2. what is the start point of the first track—which might be phrased as what is the length of track 01’s “pregap” or “Hidden Track One Audio (HTOA)”. I mean, these are the only other pieces of info which the calculation needs but doesn’t have access to without the CD in the drive. But even without that info, you can make some guesses and come up with some potential matches, at least for discs with no data track.

You mentioned gaps “between” tracks, but that’s not how they work, and they don’t affect the calculation except, as mentioned, to the extent we need to know about the gap prior to the first track—which could be a “hidden track” but almost always is just a split-second of silence or quiet hiss. You also mentioned lead-in, but that’s a fixed length and isn’t part of the calculation. Lead-in is not the same as the HTOA/track 01 pregap.

So for example, given just a selection of audio files, you can calculate a disc ID based only on track durations and the assumption that there’s no data track and that track 1 starts at sector 0. More candidates could be derived by assuming track 1 uses e.g. the very common 32, 33, or 37 sector start points. Candidate IDs for discs with data tracks could be obtained by adding 1 to the track count and 152 seconds to the disc length.

Alternatively, the user could supply some of these details for the lookup. I have several ideas of how it could work.

Regardless, I know this is esoteric, all-but-deprecated stuff. I was just thinking about how I frequently refer to gnudb to try to authenticate rips of obscure CDs not in MusicBrainz, and how it would be nice to be able to do this with a local database.

I was replying from the top of my head on a subject I've dealt with around 20+ years ago. It looks like I have less active knowledge on the subject than you.

You're correct, the gaps I've mentioned are only used for the frame offsets used when querying a freedb/gnudb server. Thanks for pointing this out.

And you can add the obscure CDs to MusicBrainz, c.f, MusicBrainz Beginners Guide.

Just a small addition to the disc ID calculation (I had a look at my code in the meantime :smiley: )

The disc ID consists of three parts:

  • A one-byte checksum: the sum of decimal digits of each track's start time (in seconds), modulo 255. So it's possible to have checksums between 0 to 254.

  • A three-byte total disc length in seconds, computed from the CDs lead-out minus the first track start (with frame values truncated to whole seconds).

  • A one-byte count of audio tracks on the disc.

As you can see, any slight difference to one of these values already gives a different disc ID.

Well, it would not be good to add info about obscure CD rips (found in the wild) to MusicBrainz. My goal is to gather evidence of whether such a rip could be legit. One way to do that is to see if the exact track durations match up with disc TOC data in cddb/freedb/gnudb and MusicBrainz. This is difficult to look up but the info is there.

Disc length is 2 bytes. To need 3 bytes, the CD would have to be over 18 hours long. Maybe you’re using 3 (for minutes, seconds, and frames), but ultimately it is expressed in the disc ID as 2 bytes (4 hex characters).

As for disc length being officially defined by certain things, yes, but my point is that that if we only have a set of track durations, then we have enough info to generate 2 possible disc IDs, only one of which is correct, and this will be fine for narrowing down the candidates from the entire database to a manageable list for the user to choose from. Track durations for all but the last track can also be checked against those entries to narrow the list quite a bit further. (Similarly, the fact that the track 1 start is more likely to be on certain common sectors gives us another opportunity to prioritize possible disc IDs.)

Of course you’re right about the checksum not having a track-based limit. I was doing some cursory tests to see how many disc IDs there can be for a given set of durations with an unknown start point, and got my results mixed up.

[update:] Attached is a tested proposal I worked on with Claude AI for how the lookups based on selected audio files could work. What's interesting is that even with the track 1 start points being uncertain, there are usually only a few possible IDs, and once you dig into the results, most of them can be ruled out because the durations don't match the TOCs in the database.

potential_local_freedb_lookup_strategy.txt (3.6 KB)

The only disappointment was that I couldn't think of a way to generate disc IDs for CDs with a data track, at least not without building a separate index of TOCs to match durations against, and if you're gonna do that, you might as well not even bother with disc IDs and just make everything based on TOCs instead.