Analyzing audio for instrumental or voice

I am interested in a program that can analyze audio tracks in batch mode to determine whether it is a purely instrumental track or a track with human voices. I did an internet search and came across a number of research reports and also a cloud-based AI web application. It's actually clear that there has to be something like this based on AI now that AI research is booming everywhere. There are already solutions that, among other things, differentiate between instruments, genres, moods and, in the process, also promise to produce an analysis result according to music with singing or spoken text or instrumental music.

However, a cloud-based AI web application where you upload your music and get an analysis result is less suitable for my project, which is intended for this special case. I'm thinking more about the possibility of being able to analyze tens of thousands of tracks locally in batch operation. Uploading individual songs to an AI and giving them for analysis is not a sensible undertaking, especially since it would certainly cause problems with music for which you do not have the rights yourself.

So is there something (perhaps not 100% perfect) that could accomplish this task locally in batch operation?

Maybe you can find something suitable if you search for "VAD / Voice activity detection"
(auf deutsch etwas irref├╝hrend "Sprechpausenerkennung")

Two examples:


Google YAM-net (from Tensorflow) can be run locally using Python, and it seem to do the classification you ask. I have no idea of the quality and requirements.

Otherwise I would go through huggingface and see if you find some models and programs there.

Many thanks for the suggestions.

It seems to me that there is currently no halfway finished or ready to use product that takes my project further. I don't want to tackle a solution in, for example, a Python environment or deal with APIs and code snippets. The employment hurdle for my task goal is a bit too high for me.