Merging several Tag Fields into CONTENTGROUP

Hi,

I'm using many Tag Source Scripts to gather all available data about tracks/albums. Since many of them write Tags to the same Tag field I've edited the scrips so that each script writes it's Tags to unique Tag fields. After running those scrips I want and Action to Merge them into the a Tag called CONTENTGROUP (used in iTunes).

The Tags I have from the scrips are:
%lastfmtags% "Female Vocalists, Indie, Indie Rock, Alternative, Rock, Indie Pop"
%dostyle% = "Pop/Rock, Alternative Rock, Punk, Indie Rock"
%rateyourmusicstyles% = " "
%amgstyles% = "Alternative/Indie Rock, Alternative Pop/Rock, Garage Rock Revival"
%amgmoods% = "Bright, Fun, Stylish, Uplifting, Nostalgic, Exciting"
%amgthemes% = "Girls Night Out, Breakup"

I can merge them into one by using a simple action like Field:"CONTENTGROUP": %lastfmtags%, %dostyle%, etc.

but this will give me lots of duplicate entries since many of the Tag values in the different fields are the same, like "Soul" might be one of the values in 4 of the Fields. On top some of the Tag fields might be blank and then I'll just get a double comma with a space between.

What I want is an Action that writes the sum of the values in the Tag fields into CONTENTGROUP.

Any ideas?

Kind Regards,

Windjammer
Regards,

Windjammer

The topic went old without any response, however I must say that I was intersted in this issue personally and I even attempted to craft such regex myself but eventually ended up only with conviction that it can be done using regular expressions.

This is a puzzle for sure. Are you still concerned? Maybe you have found your answer somewhere else?

I wonder if any of forum members skilled with regex (i know that DetlevD and dano are) could give an opinion whether is it possible to achieve :slight_smile:

Good question.
There is a Regular Expression ...
http://www.regular-expressions.info/regexb...plicatecsv.html
... which should match duplicate comma-delimited items.
(?<=,|^)([^,]*)(,\1)+(?=,|$)

Having this list of items ...
"Female Vocalists, Indie, Indie Rock, Alternative, Rock, Indie Pop, Pop/Rock, Alternative Rock, Punk, Indie Rock, Alternative/Indie Rock, Alternative Pop/Rock, Garage Rock Revival, Bright, Fun, Stylish, Uplifting, Nostalgic, Exciting, Girls Night Out, Breakup"
... the Regular Expression matches nothing.

We have to sort the list of items ...
Alternative Pop/Rock,Alternative Rock,Alternative,Alternative/Indie Rock,Breakup,Bright,Exciting,Female Vocalists,Fun,Garage Rock Revival,Girls Night Out,Indie Pop,Indie Rock,Indie Rock,Indie,Nostalgic,Pop/Rock,Punk,Rock,Stylish,Uplifting
... now the Regular Expression can find the red colored duplicates.
But it looks a bit as if there is a quirk in the Regular Expression, because of the also matched comma delimiter between the duplicate items.
The first matching group is: "Indie Rock".
The second matching group is: ",Indie Rock".
Well, when removing one of the matching items, the itemlist will be keep intact as a comma delimited list of items.

Mp3tag has no sort function yet (request has been opened a few days ago).

... to be continued ...

DD.20100920.2332.CEST

... and here it goes ...

Note:
If we could sort a delimited itemlist within a tag-field, Mp3tag life would be much easier!
If we could repeat one action in a loop, Mp3tag life would be much easier!

Example itemlist:
ITEMLIST="aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,ccc,eee,fff,GGG,ggg,iii"

The itemlist contains 15 items:
"aaa" x 3, "bbb" x 2, "ccc" x 2, "ddd" x 3, "eee" x 1, "fff" x 1, "GGG" x 1, "ggg" x 1, "iii" x 1.

This regular expression:
$regexp(%ITEMLIST%,'(?:^|,)([^,]*)(,\1)+(?=,|$)','$2')
will reduce the itemlist by removing adjacent duplicate items
from:
"aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,ccc,eee,fff,GGG,ggg,iii"
to:
",aaa,bbb,ccc,bbb,ddd,ccc,eee,fff,GGG,ggg,iii"

This regular expression:
$regexp(%ITEMLIST%,'(?:^|,)([^,])(.)(,\1)+(?=,|$)',',$1$2')
will further reduce the itemlist by one duplicate pair
from:
",aaa,bbb,ccc,bbb,ddd,ccc,eee,fff,GGG,ggg,iii"
to:
",aaa,bbb,ccc,ddd,ccc,eee,fff,GGG,ggg,iii"

There are still some duplicate items in the itemlist.
Mp3tag cannot execute an action in a loop, so we have to sequentially repeat this step.
How much? Use a count that fits to your needs.

After some steps the itemlist will be reduced
from:
",aaa,bbb,ccc,bbb,ddd,ccc,eee,fff,GGG,ggg,iii"
to:
",aaa,bbb,ccc,ddd,eee,fff,GGG,ggg,iii"

Surrounding comma can easily be trimmed by
$trim(%ITEMLIST%,',')

Now the itemlist contains nine unique mixcased items:
ITEMLIST="aaa,bbb,ccc,ddd,eee,fff,GGG,ggg,iii"

If the process should not respect mixcased items, that means ignore mixcased spelling, then the regular expressions need a modification:
$regexp(%LIST%,'(?i)(?:^|,)([^,])(,\1)+(?=,|$)','$2')
$regexp(%LIST%,'(?i)(?:^|,)([^,])(.*)(,\1)+(?=,|$)',',$1$2')
This will reduce the itemlist
from:
"aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,eee,fff,GGG,ggg,iii"
to:
",aaa,bbb,ccc,ddd,eee,fff,ggg,iii"

Put all steps into an action group:
Test_2010_20100921.RemoveDupItems.mta (1021 Bytes)

Now it is possible to do ...
From:
ITEMLIST="Female Vocalists, Indie, Indie Rock, Alternative, Rock, Indie Pop, Pop/Rock, Alternative Rock, Punk, Indie Rock, Alternative/Indie Rock, Alternative Pop/Rock, Fun, Garage Rock Revival, Alternative Rock, Bright, FUN, Stylish, Uplifting, Nostalgic, Exciting, Fun, Girls Night Out, FUN, Breakup, Indie Rock"
To:
ITEMLIST="Female Vocalists, Indie, Indie Rock, Alternative, Rock, Indie Pop, Pop/Rock, Alternative Rock, Punk, Alternative/Indie Rock, Alternative Pop/Rock, Fun, Garage Rock Revival, Bright, Stylish, Uplifting, Nostalgic, Exciting, Girls Night Out, Breakup"

... alas ... still not sorted!

DD.20100921.0930.CEST

Test_2010_20100921.RemoveDupItems.mta (1021 Bytes)

To be specific: it matches adjacent duplicate comma-delimited items. This limitation determines the need of a new approach to the subject.

I think I've come up with the right one:

Edit: It is not the right one, actually. See below posts. Action: Replace with regular expression

Field: {comma- or comma+space- delimited list}
Regular expression: (?<=,|\A) ?([^,]),(?=.?(?<=,) ?\1(?=,|\z))
Replace matches with:

It's case-insensitive by default (i.e. will also delete items which differ from the others only by casing)
As we may end up with a space at the begininng after removal from comma+space-delimited list I added second action to the group to strip that space.

Here's how it works all together:

From:

aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,ccc,eee,fff,GGG,ggg,iii

To:

aaa,bbb,ddd,ccc,eee,fff,ggg,iii

Yes, I certainly agree with both!

Hey pmj1989, so far, good work!

Your regular expression works in RegEx Tester v3.0.0.0 and matches the red colored items and commas.

aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,ccc,eee,fff,GGG,ggg,iii

Female Vocalists, Indie, Indie Rock, Alternative, Rock, Indie Pop, Pop/Rock, Alternative Rock, Punk, Indie Rock, Alternative/Indie Rock, Alternative Pop/Rock, Fun, Garage Rock Revival, Alternative Rock, Bright, FUN, Stylish, Uplifting, Nostalgic, Exciting, Fun, Girls Night Out, FUN, Breakup, Indie Rock

Sadly the caveat is the same as with the regular expression "(?<=,|^)([^,]*)(,\1)+(?=,|$)" from http://www.regular-expressions.info.

Mp3tag does not like this regular expression: "Invalid lookbehind assertion encountered ..."

DD.20100921.1923.CEST

Oops! I guess I was too lazy and convinced of compatibilty between my regex crafting tool (RegexBuddy with Perl flavor enabled) and Mp3tag.

The regex is not working as intended and that's a fact, however I don't get any error!

Firstly, this is what I get:

Format string:

$regexp('aaa,aaa,aaa,bbb,ccc,bbb,ddd,ddd,ddd,ccc,eee,fff,GGG,ggg,iii','(?<=,|\A) ?([^,]),(?=.?(?<=,) ?\1(?=,|\z))',,1)

Results in: aaa,aaa,bbb,ddd,ccc,eee,fff,ggg,iii

No error. But still not our goal.
Please try to reproduce this by pasting the string inside Tag-Filename dialog (the most convinient way for me) and check if the error occurs.

"Invalid lookbehind assertion encountered..." is the type of error showing up due to lookbehind limitation in Perl flavor

The dubious part of our expression is the initial lookbehind (?<=,|\A) since it's not really fixed-length (Start of string anchor is zero-length). I'll try to fix this problem tomorrow.

Thanks to links you provided me few months ago :slight_smile:

The invalid expression is not recognized by versions earlier than 2.46b

Here it is

(?:(?<=,)|(?<=\A)) ?([^,]),(?=.?(?<=,) ?\1(?=,|\z))

Good you solved that. Any other changes concerning regex engine that don't appear in a changelog?

Remove_duplicates_from_comma__or_comma_space__delimited_list.mta (226 Bytes)

I do not quite understand ...
is the expression invalid
... or ...
the Mp3tag 2.46b regex machine?

DD.20100924.1444.CEST

pmj1989, yes, it is!

Should I say, just perfect or just nearly perfect?
I never thought that the given problem to "de-dup" an item list could be solved with such a small linear expression. Regular expressions can be really mighty.

I modified the expression a little bit to work with multi-line textfields too, but I have not tested it in full depth.

(?:(?<=,|\A|^))\s?([^,]),(?=.?(?<=,)\s?\1(?=,|\Z|$))(?#de-dup comma- or comma+space- delimited list)

DD.20100924.1456.CEST

It's not supported so it must be considered invalid by Mp3tag.

You've made the same mistake as I did. You haven't tested your regex inside Mp3tag (supposing you want to use it there). To use is it in Mp3tag you would need to replace alternation inside lookbehind with alternation of lookbehinds (hope it sounds clear). I mentioned the reason few posts above.

Also, FYI:

There is no need using both ^ and \A tokens since ^ already matches in every place in which \A does (same applies to $ and \Z)

Actually, my regex would work with a list containing multi-line entries. Yours in contrast could produce false positives due to including ^$ anchors e.g.

multi-line

string,

multi-line
string
blabla

or

blabla

multi-line

string,

multi-line
string

The other thing is the actual necessity of this since I can't see a reason one would merge multi-line texts into single field.

Lastly (am I getting dull?), the \s token. Whitespace. Note it matches more than just a space. It does as well match line break characters, tab, form feed and several more exotic ones. Again, it doesn't matter so much in this case.

Cheers : )

Yes you are right, I did my tests mainly with a Regex Tester application to create a regex which should work in general on the most incarnations of regular expression machines. I know that Mp3tag always needs something special, which makes it so outstanding. Yes, the \s whitespace class includes also control characters, which would be too much in this case. and will lead to indiffeent results probably. Rather than cleaning a "horizontal" list of items sometimes I have the need to clean a "vertical" list of items and also free formatted text from imported text file or otherwise combined data. I will look into the problem again. Thank you for your response!

DD.20100927.0600.CEST

Fantastic! Worked great! :smiley: