Regex to find invalid characters in filter or actions

I am trying to develop a regex for the filter view or an action that finds audio files with any unwanted character in its tags. Actually, I have a regex which works fine when used in a text editor applied to a text file exported from Mp3tag. However, I can't get my regexes to work in the Mp3tag filter view; thus, I couldn't try to apply them in any action yet.

Computer setup: Mp3tag version 2.70, Windows 7 64-bit

Background: My music library consists of FLAC audio files only (i.e., Vorbis comments as tags). These are being built mostly by dBpoweramp CD Ripper ripping to FLAC, some by dBpoweramp Music Converter operating on SHN files to convert to FLAC. My car stereo (BMW, model year 2014) fortunately can play FLACs from USB stick. However, the car audioplayer somehow is unable to find all audio files on a given stick, neither in its music librarian nor in its directory viewer. Approx. 1-3% of the audio files are missing. Before trying to convince the car manufacturer and its stereo supplier to check their equipment and soft/firmware, I should like to make sure nothing is wrong with my audio files.

One might suspect that unwanted characters sometimes slip through the programs (e.g., dBpoweramp ripper) that gather tags from music databases. Therefore, I should like to develop regexes that would filter my music library for files to those that contain any character in its tags which is not in a set of "allowed characters" per my definition.

In particular, I try to apply either of the following file filter view expressions in Mp3tag:

%title% MATCHES [^a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
or
%title% MATCHES [^\w '"\-,;.:!?&/*#_+()[\]<>]
The first is a more restrictive version of a character class containing my "allowed" characters, while the second one is more general having the \w. Instead of %title%, I might have %album% or %comment% or whatever other tag field.

In Mp3tag, the filter view still shows the entire music library when one of the filters above is applied. I had expected the filter view to be empty in case none of my audio files would contain any unwanted character. I think this behavior of the filter view happens when the regex in the filter is incorrect.

In order to check these regexes, I have exported the relevant tags from my audio files to a text file using the export action of Mp3tag with the predefined export format. Using a text editor (EditPad Pro) on this text file with the regexes above indicates that the regexes work fine. Remark: The regexes have to be appended by \r\n on a Windows system to avoid finding all line breaks, i.e.

[^a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>\r\n]
or
[^\w '"\-,;.:!?&/*#_+()[\]<>\r\n]
This might function as a workaround for me, but I am interested in why the regexes don't seem to work in Mp3tag as a filter. Of course, I have also tried to employ the appended regexes in Mp3tag in case a tag would contain \r\n.

Thus, my questions are:

  • What's wrong with my regexes in the Mp3tag filter?
  • Is there any other character to be included in the class of allowed characters (FLAC has Vorbis comments)?
  • Maybe more than that, is there anything wrong in Mp3tag itself?
Thanks in advance for support!

I should also like to say how great a program Mp3tag is ...

Don't you need a "NOT"? I gathered that the illegal characters are NOT in the list.

and as the string contains special characters that separate parts of a regexp from others, you would have to look for those separately.

NOT %title% MATCHES "[a-z|A-Z|0-9|äöüÄÖÜß,;.:!?&/*#_+()[]<>]"

Also, I think you have to separate ranges from individual characters. [a-z] might work, but a-zA-z is no range - separate these with a bar |
e.g.

%title% MATCHES "[0-9]|[äöüÄÖÜß]"
(This shows all tracks with a number or an umlaut in it)
If you increase your expression bit by bit you will get the knack of it.

Thanks for the comment, dear ohrenkino. However, I do still think my regexes are correct and do what I want them to do.

Explanation of the regexes: I have a list of my allowed characters, which would be formulated as a character class like

[a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
which has ranges of letters, digits as well as some literal characters from the space to the >. Within this character class, only the hypen and the right square bracket need to be escaped (unlike a regex without character class). Now I want to search for any character NOT in this character class, i.e. it takes the negated character class
[^a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
to be used as a search pattern in
%title% MATCHES [^a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
A file filter like
NOT %title% MATCHES [a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
wouldn't work, since this would filter out files where the title would have NONE of the allowed characters, i.e. consist of unwanted characters only. But I was looking for files that have any one of the unwanted characters buried somewhere in between the allowed ones.

To me, one of the best references for regexes is the Regular Expression Tutorial by Jan Goyvaerts. I have tested my regexes with his editor EditPad Pro as well as his Regex Buddy; both prove them to work fine.

Now, there is more information for the programmers of Mp3Tag:

It seems that my regexes formulated in my original post don't work as a file filter, i.e. something like

%title% MATCHES [^a-zA-Z0-9äöüÄÖÜß '"\-,;.:!?&/*#_+()[\]<>]
or
%title% MATCHES [^\w '"\-,;.:!?&/*#_+()[\]<>]
doesn't work.

However, these regexes work fine in the "Replace with Regular Expression" ("Ersetzen mit regulärem Ausdruck").

Thus, I feel the more inclined to escalate my observations to the bug reports unless there is a more trivial explanation. Awaiting more insight for a little bit of time ...

See Mp3tag help manual, section Filter, there are example filter expressions to try out and to understand.
Here are some more examples, using character classes, which may help to diminish the complexity of the regular expression itself ...

Any control character ...

TITLE MATCHES "[[:cntrl:]]"
TITLE MATCHES [[:cntrl:]]

Any extended character whose code point is above 255 in value ...
TITLE MATCHES "[[:unicode:]]"
TITLE MATCHES [[:unicode:]]

Any word character (alphanumeric characters plus the underscore) ...
TITLE MATCHES "[[:word:]]"
TITLE MATCHES [[:word:]]

Any digit character ...
TITLE MATCHES "[[:digit:]]"
TITLE MATCHES [[:digit:]]

Reverse resp. negate filter ...
NOT TITLE MATCHES "[[:word:]]"
NOT TITLE MATCHES "[[:digit:]]"

Filter combinations ...

NOT TITLE MATCHES "[[:digit:]]" AND NOT TITLE MATCHES "[[:word:]]"
... same as ...
NOT (TITLE MATCHES "[[:digit:]]" OR TITLE MATCHES "[[:word:]]")
... same as ...
NOT (TITLE MATCHES "[[:digit:][:word:]]")

Negated ...
NOT (NOT TITLE MATCHES "[[:digit:]]" AND NOT TITLE MATCHES "[[:word:]]")
... same as ...
TITLE MATCHES "[[:digit:]]" OR TITLE MATCHES "[[:word:]]"
... same as ...
TITLE MATCHES "[[:digit:][:word:]]"

DD.20150701.1206.CEST

Thank you, DetlevD, for your comments to make a file filter easier to read. I like the [:cntrl:] and [:unicode:] expressions - hadn't seen them before. What regex flavor are these from?

Still, I feel inclined to stick to my conclusion that there is something wrong in Mp3tag's regex parsing in the file filter. As said some moments ago, my regexes have been tested in two other programs using a.o. the Perl regex flavor, plus they work fine in Mp3tag's "Ersetzen mit regulärem Ausdruck". I am including a screenshot of Regex Buddy's analysis of my two regexes, chosing Perl as the regex flavor.


So, I'll wait some more time before escalating this to a bug report.


Well, there seems something to be obscure with the Mp3tag filter, regarding the operater MATCHES.
I did some tests and what I can say for now is as following.

This works ...
TITLE MATCHES "[(]"
TITLE MATCHES "[)]"
TITLE MATCHES "[()]"
TITLE MATCHES "[)(]"
TITLE MATCHES [(]
but this does not work ...
TITLE MATCHES [)]
TITLE MATCHES [()]
TITLE MATCHES [)(]

So the rule seems to be ...
once there exists one closing round bracket in the regular expression "character set", ...
then the regular expression has to be enclosed into quote characters.

This filter expression works too ...
NOT TITLE MATCHES "[-äöüÄÖÜß,;.:!?&/*#_+<>[]{}0-9A-Z()]"

DD.20150701.1718.CEST

See also ...
http://www.boost.org/doc/libs/1_50_0/libs/...er_classes.html

DD.20150701.2248.CEST

Now that we agree that there is something strange regarding the MATCHES operator in file list filter expressions, I think it is the time now for Florian Heidenreich (or his coworkers if any) to step in and clarify this matter. Unless you, dear DetlevD, are a programming contributor to Mp3tag ... I think there is not much use in trial and error.

The questions to Florian are:

  1. What exactly is/are the regex flavors talked by Mp3tag?
  2. Please provide more detailed definitions and explanations in the Mp3tag Help regarding regexes. Item to cover would include the applicable regex flavors, the use of quotation marks around a regex in filter expressions (MATCHES), the use of escape characters depending on the regex flavor.
  3. Please revisit the filter expression code, in particular the MATCHES code.
  4. Please do also check other code where regexes are being used.
I hope Florian or his coworkers will come across this thread. Thanks in advance!

My point of concern is: Regexes are awesome and powerful at achieving things quickly, but they also have the power to destroy things pretty fast ... I would not exactly like to put my music library in the hands of a regex engine which I don't know enough about, neither about the flavor, nor about how Mp3tag parses the regexes in filter expressions and actions. Filters aren't that bad, but actions might end in a disaster with a large music library which took quite some time to build.

To DetlevD (we could also talk German, of course ...): Thanks for your testing and agreeing with me, as well as for your pointing me to the Boost C++ library. Are you implying that Mp3tag is using this library, which provides for Perl and POSIX (extended and more) regex handling? Or maybe this is just your favorite place to read about regexes, as mine is Jan Goyvaerts tutorial (cf. above).

Again, I hope for clarification from Florian now! Thanks.

Yes.

In the "About" dialog of older versions of Mp3tag application there was a human readable note about the implemented "boost" regex machine.
Even in the code of the last version 2.70 of Mp3tag.exe one can find text from the boost "regex_constants.hpp".
I am still convinced that Mp3tag uses the boost regex machine.
See also ....
[X] Replace with Regular Expression
Capitalizing names with Mc/Mac
Expression to delete defined strings from tag

DD.20150702.1556.CEST