Remove Trailing Special Characters


#1

I frequently see trailing "boxes" which indicate non-word characters (not alpha-numeric) and I would like to remove them.

I've tried creating an ACTION, but I can't get REPLACE WITH REGULAR EXPRESSON to do want I want. The following action does not work.

Field: COMPOSER
Regular Expression: \W$
Replace With: (this is left blank)

This leaves the Composer Field intact.

As a work around, I'm using the following which deletes the last character, but this is not what I want to do b/c it trims descriptive text

Field: COMPOSER
Regular Expression: .{1}$
Replace With: (this is left blank)

This trims the field by one character.

How can I remove/delete trailing special characters?


#2

If you use MP3tag in an XP environment and read files with data showing ISO characters then e.g. Chinese characters could lead to the display of boxes as XP does not have the correct screen display font.
And as these characters then would be valid characters (only that they are not displayed correctly) it is not really possible to find a way to remove them.


#3

Text with "boxes" in singe-line tag fields is not automatic, this can be introduced for example by copy & paste.

In most cases, these are the non-displayable control characters CR for "carriage return" or LF for "line feed", as used in multi-line texts as end of line, sometimes it is the control character for TAB "tabulator".

How 'ohrenkino' in post #2 already said, the box characters can also be readable foreign text, or graphical symbols, but can not be represented correctly by the current screen font, so that just such a box character appears as a replacement.

In case of CR or LF you can use the Mp3tag trimming functions, e. g. ...

$trimRight(%COMPOSER%,$char(13)$char(10))

... or a regexp function, which removes all non-word characters at end of the string ...

$regexp(%COMPOSER%,'\W*\Z',)

See also ...
probleme mit ä ü ö

DD.20140314.0849.CET
DD.20140320.2012.CET


#4

Thank you for your help. I can't wait to try your solutions. At the moment I don't have any "boxes" to remove. I will post back when I can apply your solutions.


#5

I got the chance to experiment with your idea, but I must not understand how to apply it.
This is what I've tried with no success:

Method 1...
I created an ACTION with "REPLACE using REGULAR EXPRESSION"
Field: COMPOSER
Regular Expression:' \W*\Z'
Replace With: (this is left blank)

This leaves the Composer Field intact. I still see the boxes.

Method 2...
I tried the regex: \W*\Z (unquoted)

Method 3...
I tried the regex: \W* (unquoted)

Method 4...
I created an ACTION using FORMAT VALUES
Field: COMPOSER
Format String:$regexp(%COMPOSER%,'\W*\Z',)

This leaves the Composer Field intact. I still see the boxes.

I'm sorry that I didn't know what else to do with your helpful suggestion.

What did I do wrong?


#6

Bad syntax, here it should be only: \W*\Z

This should work.

This should remove at least all spaces, punctuation characters and other non-word characters from the given string.

This should work.

Hm, your tests actually show that there are no non-word characters at end of the string.
When there would have been non-word characters, then they would have been removed.
The ultimate assurance would give a look with a hex editor into the media file's tag.

DD.20140318.2134.CET


#7

I'm unfamiliar with using a hex editor.

  1. Does MP3tag have this feature?
  2. What am I looking for?
  3. Is there any other way to determine what is causing the "boxes" to appear?

#8

I would like to point you again at post #2 ...


#9
  1. No.
  2. Locate the tag-field's key and value and examine the binary characters.
  3. Don't know (in general the user is the cause).
  • You can provide a text file, containing the hex list view of the tag data which exposes the behaviour.

  • You can provide a sample mp3 file, containing tag data which exposes the behaviour.
    We do not need the music, but only the tag data.
    Create a mp3 file, which may contain only one second of silence.
    Or use the notepad editor and create a simple text file of three characters, and save it as "Dummy.txt.mp3".
    Then use Mp3tag, load both mp3 files, original and dummy, then clipboard copy/paste the entire tag data from the first original media file to the second dummy file.

  • You can provide a text file, containing a tag dump, created by Mp3tag export feature, ...
    maybe this text file can be examined using a hex editor/viewer/lister.
    Download/copy/move this MTE file ...
    http://forums.mp3tag.de/index.php?act=atta...ost&id=5497
    ... into the folder ...
    (Win XP) %APPDATA%\Mp3tag\Export
    resp. (newer OS) %APPDATA%\Roaming\Mp3tag\Export\

Execute the Mp3tag export file of filetype "mte" against the file in question, ...
that means ... create a report output text file using the Mp3tag Export feature.
Then attach the output text file to your next forum message.

DD.20140320.0843.CET


#10

Back from traveling. Thank you for your suggestions. I will explore your suggestions and post results when I again have another "box".

BTW, I didn't mention before that I use WMP 10.0 on a WinXP Pro SP3 to Rip CDs. This is the program that creates the tag data and then I use MP3tag to edit these data. So it appears that there is some issue with MSFT's implementation of their own character sets. I use this older configuration b/c I still use an older MP3 player (Creative Nomad Zen Xtra).


#11

You may try experiments with changing the tag type and version, changing the character set of the tag and/or the codepage of the tag fields.
Search the forum for matching threads.

DD.20140324.1140.CET


#12

like e.g. this thread:
Unicode-Zeichen aus der Zwischenablage
(it's German though :-()
In short: it tells you that a universal font is missing in XP - so there is not much of a point to get this straight.
And as XP XPires in April anyway ...


#13

It's not quite so dramatic as that, :slight_smile:

Microsoft support for Windows XP expires in April (unless you pay for extended support) and no more "bug fixes" or service packs will be released. So if it is currently 'broken' it will remain 'broken' for all eternity.


#14

I understand WinXP is obsolete and except the future problems it will bring (a slow death with regards to webpage rendering).

But back to the discussion. I wanted to provide you with the hex dump of what is in my mp3 tags.

It appears that an extra character is spuriously inserted into the tags. This is a completely random event and I don't know why this occurs. But here are the hex dumps for the sake of providing more information to anyone who may have a similar problem and who wishes to find a remedy for deleting these extra characters.

I got a hex editor and viewed the contents of the *.mp3. Here's what I've found.

When a "box" appears in the Artist Tag...
54 00 68 00 65 00 20 00 00 46 00 72 00 61 00 79 00 00 79 54 50 45 32
T h e F r a y y T P E 2

When the artist tag does not have a "box"
54 00 68 00 65 00 20 00 00 46 00 72 00 61 00 79 00 54 50 45 32
T h e F r a y T P E 2

I also noticed that when the "box" appears, the 'y' could also be any other extra character (ie 'e')

As for the composer tag, I've seen these extra characters after the last ASCII character:

'/' (hex: 00 2F) /TCON (rather than TCON)
'ÿÓ' (hex: FF D3) ÿÓTCON (rather than TCON)
'uR' (hex: 75 52) uRTCON (rather than TCON)
'=' (hex: 3D 00) = TCON (rather than TCON)
'v' (hex: 00 76) vTCON (rather than TCON)

As for the artist tag, I've seen this extra character after the last ASCII character:
'R' (hex: 00 52) RTPE2 (rather than TPE2)

The good news is that I've found the cause for the "box". It is due to an extra character. It is not due to a non-alphanumeric character (in most cases). And therefore using "Replace using RegEx", will not work.

The bad news is that this extra character throws off the character count for the tag. So now the remedy for automating the removal of the "box" is more complicated. The question must now be changed to ....

how can I verify the tag length?
how can I remove the character immediately preceding the termination code: TCON, TPE1, TPE2, etc. ?

Thank you all for helping in guiding me towards finding the source of the problem.


#15

It seems so that your hex example is bad.
For a fresh ID3v2.4 UTF-8 tag, containing only two tag-fields, ARTIST (TPE1) and ALBUMARTIST (TPE2), the hex dump should look like this ...
00000000 49 44 33 04 00 00 00 00 10 13 54 50 45 31 00 00 ID3.......TPE1..
00000010 00 09 00 00 03 54 68 65 20 46 72 61 79 54 50 45 .....The FrayTPE
00000020 32 00 00 00 09 00 00 03 54 68 65 20 46 72 61 79 2.......The Fray

Where do you see an extra character?

Hmm, there might be some confusion with the character encoding.
Presumably the other application, which you use, has a problem.

For a fresh ID3v2.3 UTF-16 tag, containing only two tag-fields, ARTIST (TPE1) and ALBUMARTIST (TPE2), the hex dump should look like this ...

00000000 49 44 33 03 00 00 00 00 10 13 54 50 45 31 00 00 ID3.......TPE1..
00000010 00 13 00 00 01 FF FE 54 00 68 00 65 00 20 00 46 .....ÿþT.h.e. .F
00000020 00 72 00 61 00 79 00 54 50 45 32 00 00 00 13 00 .r.a.y.TPE2.....
00000030 00 01 FF FE 54 00 68 00 65 00 20 00 46 00 72 00 ..ÿþT.h.e. .F.r.
00000040 61 00 79 00 00 00 00 00 00 00 00 00 00 00 00 00 a.y.............

DD.20140331.1345.CEST


#16

The extra character is highighted red. It shouldn't be there. And in an earlier post, I showed that this extra character can be something else.

The red appears as a "box" in the Artist Tag in MP3tag despite the fact that it is a valid ASCII character (the letter y).
54 00 68 00 65 00 20 00 00 46 00 72 00 61 00 79 00 00 79 54 50 45 32
T h e F r a y y T P E 2

When the artist tag does not have a "box" in MP3tag:
54 00 68 00 65 00 20 00 00 46 00 72 00 61 00 79 00 54 50 45 32
T h e F r a y T P E 2

Thanks for spending time thinking about this. I know that MP3tag did not insert this extra "y". And that MP3tag is working perfectly. I'm just trying to figure out how to create an action that can automate removing these "boxes". All I now know is that using regex \W doesn't work b/c "00 79" is not a non-alpha numeric character.


#17

You are still providing a bad hex view example.
It does not show the leading lable TPE1 and what is the byte size of the following ARTIST value.
The lable TPE2 is the start point of the next value ALBUMARTIST.

You provide an example of the tag-type ID3v2.3 UTF-16.
Due to UTF-16 encoding each character consumes two bytes.
The letter "y" is encoded into "79 00" (little-endian).
Mp3Tag writes UTF-16 LE tags.

In your example the failing letter "y" is encoded into "00 79".
This looks like the character "y" is written as UTF-16 BE (big-endian).

This points to an assumption, that there is possibly another application or some special hardware involved (??? PowerPC), which writes the bad encoding.

Because of the swapping of the two bytes from "79 00" to "00 79" the Latin character "y" (UTF-8: 79) changes to the Kanji character "礀" (UTF-8: E7A480), which is still a legal word character.

You can perform an experiment.

  1. In the dialog "Mp3tag Options/Tags/Mpeg" change the writing to "ID3v2.3 ISO-8859-1".
  2. Save the test file.
  3. Use the dialog "Extended Tags..." and check the content of the tag-fields.
  4. In the dialog "Mp3tag Options/Tags/Mpeg" change the writing to "ID3v2.3 UTF-16".
  5. Save the test file.
  6. Use the dialog "Extended Tags..." and check the content of the tag-fields.
  7. Please report the results.

Presumably the foreign letter "礀" will be transformed into a question mark "?".
You can remove this question mark by using Mp3tag action or scripting function.

DD.20140403.1120.CEST.


#18

Sorry. I would have copied more of the ASCII but the hex editor I chose didn't allow for cut & paste. So I transcribed a small snipplet. And being that I don't know what this means (where it starts and ends) I didn't give a full and complete section.

Thanks for explaining that the TPE2 is the start point.

You're a Genius! Your solution worked perfectly. Thank you.
Yes, the "boxes" were converted into "?"s and then I created an action which removed the trailing "?"

Action
FORMAT VALUE
FIELD: ARTIST
FORMAT STRING: $regexp(%ARTIST%,'?$','$2')

Fantastic. Thanks. Problem solved.


#19

Instead of transforming the entire tag into ISO-8859-1 character encoding and back, you may use ...

Action "Format value" or Convert "Tag - Tag"

Field: ARTIST
Formatstring: $regexp(%ARTIST%,'[[:unicode:]]',)

This removes any extended character whose code point is above 255 in value.

DD.20140403.1619.CEST


#20

Thanks for this improvement and thank you teaching me how to use a few more of MP3tag's powerful functions. It's truly amazing what you have built into MP3tag. I have yet to find something that it can not do.