Case conversion...

yog-sothoth · March 17, 2011, 2:54pm

Another update: I've done some more testing of the script and found a few problems. Firstly, Doug was right in saying that without the "words begin after" command in the native case conversion, it preserves or causes case errors. For instance, any word immediately after a bracket (like "(Featuring..." ) either remains or becomes lower-case after conversion. The original script I described in my first post used the mixed case function, with the following "words begin after" instruction: ({[]})-_",./+&@:;* Therefore, this somehow needs to be accounted for in the new script using a reg-ex action.

Secondly, DetlevD's solution has one minor flaw. If a bracket is left open, i.e. "(Remix", then the following error occurs: [ SYNTAX ERROR IN FORMATTING STRING ], and the field info is lost.

I really appreciate all the help I'm getting. Please help me fix this, once and for all.

DetlevD · March 17, 2011, 3:06pm

I want to mention, that I did not offer a "solution", but only a workaround for a possible bug respectively for possible misuse of the Action "Replace using Regular Expession".
'yog-sothoth' you should raise a bug report about the both error cases which you have detected.

You should read the manual and check out, what the second parameter of the $caps2 function can do for you.

DD.20110317.1706.CET

dano · March 17, 2011, 5:04pm

Instead of the first mixed case action try these two:

Action type: Replace with regular expression
Field: _TAG
Regular expression: ([-({\[\]}) _",./+&@:;*])(\l)
Replace matches with: $1$upper($2)
[x] case-sensitive comparison

Action type: Replace with regular expression
Field: _TAG
Regular expression: ^(\l)
Replace matches with: $upper($1)
[x] case-sensitive comparison

yog-sothoth · March 17, 2011, 5:47pm

Excellent! Thank you so much dano, great job!

One more question. You use the _TAG field, which I assume is all tag fields, but not the filename, right? As I wish to apply these actions to the filename also, would changing the field to _ALL have any potential drawbacks. The same with _DIRECTORY, also. Thanks again

DetlevD · March 17, 2011, 7:07pm

I beg your pardon, just coming in for a moment.

You have no need to apply this cleaning procedure for the filename too, because once all the tag fields have been cleaned and got their correct values, you can assemble the file name and all the folder names in the folder tree from the content of the tag fields.
This is the recommended work flow.

Be aware that the file name resp. full path name needs additional housekeeping to become a valid file name, because there are forbidden characters in the file system.

DD.20110317.2108.CET

dano · March 17, 2011, 7:11pm

I don't recommend _ALL for the first action (the second is ok). It would change your filename extensions: .mp3 -> .Mp3
Well you could make an additional action to fix that.

You can use these actions on _DIRECTORY

The first action can be changed to a Format value action so it does not mess with the extension:

Action type: Format value
Field: _FILENAME
Formatstring: $regexp(%_filename%,'([-({\[\]}) _",./+&@:;*])(\l)',$1\u$2)

yog-sothoth · March 17, 2011, 8:14pm

I accept that. However, I've already long ago edited the tags of my entire collection and formatted the filename and parent directory names accordingly. I have no desire to do it again if I can avoid it, because doing so would only cause more problems. For example, some tracks (when in albums with various artists) include the artist field in the filename, while tracks from single-artist albums don't. There are other idiosyncrasies that together would make it a pretty laborious task to have to differentiate between them. All that I want to do now is standardise all fields to title case.

Correct me if I'm wrong, but doesn't mp3tag automatically remove illegal characters when formatting filenames?

yog-sothoth · March 17, 2011, 8:17pm

Ah, I see. So, just to recapitulate, here is the script in chronological order:

Action type: Replace with regular expression
Field: _TAG
Regular expression: ([-({}) _",./+&@:;*])(\l)
Replace matches with: $1$upper($2)
case-sensitive comparison

Action type: Format value
Field: _FILENAME
Formatstring: $regexp(%_filename%,'([-({}) _",./+&@:;*])(\l)',$1\u$2)

Action type: Replace with regular expression
Field: _All
Regular expression: ^(\l)
Replace matches with: $upper($1)
case-sensitive comparison

Followed by...

Action type: Replace with regular expression
Field: _ALL
Regular expression: (?<![/-:;(){}])\s(a|the|of|for|as|at|an|by|off|on|from|in|to|and|with|or|nor|von|de)(?=\s)(?!\s[-(){}])
Replace matches with: $lower($0)
case-sensitive comparison

Action type: Replace with regular expression
Field: _ALL
Regular expression: (^|\s|(|[|/)'(.{1})
Replace matches with: $1'$upper($2)
case-sensitive comparison

Please let me know if there's anything out of place here. Thanks dano.

DetlevD · March 18, 2011, 1:37am

Hmm, yes, but it does not replace the illegal characters with legal characters.
Additionally the probability rises of obtaining duplicate filenames.

You may get "problems" with ... e. g.

"AC/DC"
"Harp Concerto in B-flat Major Op. 4 Nr. 6: 1: Andante-Allegro"
... but i am not totally sure in this moment, whether it will be a problem for you.
You will lost the slash and the colon.
Check it out yourself.

DD.20110318.0338.CET

Doug_Mackie · March 19, 2011, 10:03pm

Hello Yog,

You may be interested to hear that I have followed your lead and am now using both methods. I now begin with a modified version of Dano's excellent example, followed by corrections, some using word lists. Some of the latter handle less common situations, such as source files that are all lowercase. The combination is working very well.

I did find that Dano's first regular expression could be simplified because many of the characters that he specified as word boundary markers are never used that way in the material that I work with. He used 19 markers:

([-({\[\]}) _",./+&@:;*])(\l)

I find that just seven are enough for my file names:

([-({ _.+])(\l)

and nine are enough for my artist, album, and title tags:

([-({ _.+\x{201c}"])(\l)

The square brackets are omitted as boundaries because I reserve those for my comments, which are always lowercase. The Unicode reference is to left curly quotation marks (“).

At first, I thought that more markers might be needed, but so far not. Having fewer markers speeds up the scripts, since every character in the source string must be compared with every boundary marker.

I've tweaked and refined other elements in my scripts, and have updated the zip file:

Title Case Scripts Link updated 22 April 2016

I also fixed a bug in the Latinisms sections of my previously-posted scripts. There was an unescaped period that could cause the mispelled contraction Ive to be forced to lowercase.

Cheers,
Doug

yog-sothoth · March 19, 2011, 11:38pm

Good work, Doug. That script is taking shape quite nicely. Actually, I'm in the process of putting together my own one, which I intend to post here when finished. I'd like to incorporate some bits of your script into it too - with your permission, of course. There's just one obstacle that remains: Roman numerals. I've made a new thread in the support forum to address this. Perhaps when you have the time you could take a look at it? Thanks.

Doug_Mackie · March 20, 2011, 4:02pm

Yes, certainly. As far as I am concerned, this is a group project.

As for Roman numerals, it looks like DetlevD has a better solution than I could come up with. I simply included the two-letter values below ten in my abbreviation list, which was adequate for my purposes.

dano · March 20, 2011, 5:45pm

([-({ _.+\u201c"])(\l) is not correct in Mp3tag.

This should be the right syntax:
([-({ _.+\x{201c}"])(\l)

Doug_Mackie · March 20, 2011, 9:37pm

Dano, thank you for spotting my error. What confused me was that the \u201c syntax appeared in MTA files (which is how I discovered it), and the fact that it worked as expected. The first time that I typed “ into a regular expression field, it appeared as \u201c in the MTA file. Since I was having trouble seeing this character in the tiny font used in Mp3tag input boxes, I thought that a Unicode reference would be clearer.

It seems that in MTA files, Mp3Tag uses double backslashes (\\u) to distinguish an uppercase character reference from an internal \u-style Unicode reference. Even more confusing is that in languages like .Net that support regular expressions, \x is used for hex values and \u for Unicode numbers!

Anyway, I have edited my posted MTA files and changed my example above to the correct syntax.

Regards,
Doug Mackie

DetlevD · March 21, 2011, 6:58am

The MTA file look like a sort of INI style pure text file, where special characters need 'escaping'.
See also: http://en.wikipedia.org/wiki/INI_file
Well, the Mp3tag developer might answer how he does things to make Mp3tag run, but as MTA files are not documented to view internally by the user or change manually by the user, the answer is already given: Hands off!

DD.20110321.0858.CET

yog-sothoth · March 23, 2011, 3:07pm

Update: The script is now complete, except for a couple of minor bugs which I'd like to get fixed before release. Once again, I'm relying on you guys to help me out here, as I'm completely stuck.

The first bug relates to formatting spaces between acronyms. In the case of acronyms without a stop at the end (i.e. "M.I.A"), the script places a space between the last stop and the last character. Hence, "M.I.A" becomes "M.I. A". Notice the space before "A". Obviously, I want to prevent the extra space from occurring.

I looked closely at the script and noticed that in the "replace matches with" field, there was a trailing space at the end of $0. I then removed the space and repeated the action. This time the acronym "M.I.A" was not split by an extra space. My concern then, is was the space after $0 intentional or is it a mistake? Can I safely remove the space without making the action invalid?
Here's the relevant action:

Description: Word Spacing
Action type: Replace with regular expression
Field: _All
Regular expression: (.)(?![\s.':;)]"])(?![\l\u].)(?!$)(?!\d)(?!(co|net|org|gov|edu|mil))
Replace matches with: $0 (ed: there is a space after 0)
[ ] case-sensitive comparison

The second bug concerns case conversion of "Dj". If Dj is the first word after a parenthesis, a space is placed between the parenthesis and D. Hence, "(Dj Remix)" becomes "( DJ Remix). I'd like to prevent the redundant space from occurring. As in the first bug, there is a space in the "replace matches with" field that if removed stops the error. This time it's at the beginning of the line: " DJ$1". Again, I'm not sure if it is a mistake or intentional. If I remove it, will it invalidate the script? Edit: The space actually occurs anywhere in the line, not only after parenthesis.

Description: Uper-Case conversion of Dj
Action type: Replace with regular expression
Field: _All
Regular expression: dj($|\s)
Replace matches with: DJ$1 (ed: the first character is a space)
[ ] case-sensitive comparison

Thanks again for your assistance, guys.

dano · March 23, 2011, 6:44pm

A change in the first action can solve the problem:
Regular expression: (.)(?![\s.':;)]"])(?![\l\u](.|$))(?!$)(?!\d)(?!(co|net|org|gov|edu|mil))

The space in the second action is much likely a bug.

yog-sothoth · March 23, 2011, 7:58pm

This only works if there are no words after the acronym. For instance, "M.I.A featuring..." converts to "M.I. a Featuring...". Also, a space is now placed between the extension in the filename and the stop before it, a la ". mp3". This makes the file inaccessible until the space is removed.

I'm wondering what exactly this action tries to fix, and if I should remove it all together. The action above is part of a group of eleven actions, all of which concern spacing. From the end of the reg ex, I can deduce that it does something to URL's. I'm not entirely sure what the rest does, though. If you can figure it out, could you explain it to me? Thanks.

BTW, this is the source.

dano · March 23, 2011, 8:39pm

Read the description:

Add a space after a period. [...]

And it does nothing to urls. It excludes them so for example ".net" is not changed to ". net"

You could also add the space:
RE: (.)(?![\s.':;)]"])(?![\l\u](.|$| ))(?!$)(?!\d)(?!(co|net|org|gov|edu|mil))

If you want to use this on the filename use the $regexp() function.

yog-sothoth · March 23, 2011, 9:09pm

Thanks, that works. Much appreciated.