[F] Shift-JIS - some characters cause problems

bug-fixed

#1

This bug has been annoying me for as long as I can remember. I've been using Mp3tag for over a year now and thought I should report it, although looking through some of the support forum posts, I'm sure it's been touched on already. Sorry in advance if this post seems a little long-winded, but I aim to cover everything that I have found.

Scenario:

I often need to tag a large number of files and find Mp3tag a fantastic tool for allowing me to get the job done quickly. With freedb integration, tagging a ripped CD is almost as easy as just clicking a button. At least, I'm sure it is for most people.

Unfortunately, most of the files I tag use Japanese text encoded using Shift-JIS. 95% of the time, everything works correctly. But sometimes things go horribly wrong.

Example:

I'll introduce you to the bug through a worked example.

Below, we have a fictional CD single that I created for the purpose of this post containing two tracks - "Mysterious" and "Mahou no Soda" - performed live by a non-existent artist, Hanako Ishida. As you can see, everything appears almost as you would expect it to. The only clue that something isn't quite right is that the title for track 2 in the list only says "Mahou no" - the end is missing.

:slight_smile: Now let's try track 2.

:frowning:

Okay, let's have one last go at trying to title our track using the tag panel to edit the title instead of editing it in-line in the list.

:slight_smile: Although the list on the right still only shows the start of the track title. <_<

Explanation: (or an attempt at one)

As I said at the start of this post, I've been living with this annoyance for over a year now - plenty of time to further investigate what might be causing it.

The problem only occurs when certain characters are encountered in a string, which is why the first track was easy to rename while the second wasn't. The second track title contains a "problem" character - that innocent-looking bar between towards the end that makes up part of the word "soda".

For those unfamiliar with multi-byte character encoding schemes such as Shift-JIS, each character is represented by a number of bytes, not just one as in ASCII or most traditional Western encodings. Shift-JIS uses two bytes to represent each character. In Shift-JIS, this dash-like character (a chouon, to be exact) is represented by the two bytes 0x81 and 0x5B. Testing has led me to discover that any character containing either of the bytes 0x5B, 0x5C or 0x5D causes serious problems in Mp3tag.

0x5B and 0x5D represent the bracket characters "[" and "]" in ASCII, while 0x5C is a backslash. Entering brackets normally doesn't cause any strange behaviour. It's only when the bytes are used in a sequence that Windows can identify as a multiple-byte character.

I'm unsure exactly why this is as I obviously don't know what's going on under that nice, user-friendly interface. It is possible that an internal mechanism for escaping certain bytes prior to some kind of parsing process is to blame as it would appear to be working on a per-character level rather than a per-byte level. If this is the case, it would be fine if the underlying parser also understood the concept of characters, but it seems to be byte-oriented. I am also unsure why using the text field in the tag panel on the left to alter the ID3 tag seems to bypass this parsing process.

As for the backslash, entering it in a title using either in-line in the list or using the tag panel produces some undesirable results, although the exact result differs depending on which mechanism you use (but why should it?). Surely there should be a system in place for handling/escaping user-entered backslashes?

I'd love to continue to use Mp3tag for tagging, but as the number of files I need to tag increases, this issue is becoming more and more annoying. If there is anything I can do to help or if you would like more information, please let me know. Also, if anyone has any solutions/workarounds, they would be most appreciated.

  • SoZ

(Mp3tag 2.32a on Windows XP)


#2

Did you try the latest development build? Your MP3Tag version does not support Unicode - Unicode was added in 2.32p.


#3

I tried 2.32s on the offchance that adding Unicode support might have had a bonus side effect of allowing the particular bytes I mentioned in my first post to be handled by Mp3tag without being eaten. I pointed it in the direction of the same fictional CD I used earlier and this was the result:

Screenshot

Not quite what I was hoping for. I'm sure if I re-entered all of the text again, it would quite happily use UTF-8/16 to store the string with no issues. But I don't want to use Unicode. I want to use my native OS encoding scheme. There is an option [Always write ISO-8859-1 tags instead of UTF-16], but as the name would suggest, this only enforces the encoding used for writing tags, not reading them.

However, as I suspected, the new build does include some changes related to the handling of certain bytes. In 2.32s, I can freely edit the text in the filename field and hitting enter will cause the filename to be saved correctly. No mangling of text, no errors. Everything works perfectly. Attempting the same procedure in 2.32a for track 2 on my example CD would have renamed the file to "[ SYNTAX ERROR IN FORMATTING STRING ].mp3".

A combination of elements from these two builds would be nice - the new code for handling certain bytes used in 2.32s coupled with the 2.32a's OS-dependent encoding for characters. :slight_smile: Perhaps if the option was available to disable Unicode support completely (without disabling the parser changes that were implemented to make Unicode support possible) in builds based on the current 2.32s codebase, it would work as I want it to.

  • SoZ

P.S. Apologies for mangling the forum layout with my oversized screenshots.

P.P.S. Thanks dano. :slight_smile:


#4

What are your reasons?


#5

All of the MP3 files I store have ID3 tags with strings encoded in Shift-JIS. Both my software and hardware players support Shift-JIS. To move over to Unicode would require re-tagging all of my files, finding new software utilities to work with them and the purchase of a new hardware MP3 player. It's a solution which is feasible, but it's not one that fills me with joy.

Yes, I know this is the 21st century. :slight_smile: But the sad truth is that Unicode encoding schemes such as UTF-8 haven't yet been adopted widely enough for region-specific encoding schemes to be phased out. I'd love to see everything in Unicode, but until the rest of the world moves on, I can't either.

  • SoZ

#6

Thanks for your good explanation :slight_smile: (and the detailed bug report, too)
What I can tell you ATM is:

  1. Mp3tag will probably not support the system codepage again.
  2. The next build will have a codepage conversion action to transfer tags to unicode

I don't know how easy it would be to fix the behaviour described in your first post and if Florian wants to release a non-unicode build again.
Wait till Florian answers here and makes a decision.


#7

I had a go at using the codepage conversion function in 2.32t to see how well it would handle my problematic tags. Actually, it did quite a reasonable job:

:slight_smile:

  • SoZ

#8

SoZ,

thanks for your detailed reports and your feedback on the convert codepage feature. Can you please send me a file with tags that caused the problems? What's the number behind the codepage name you've used for conversion?

Thank you!

Best regards,
~ Florian


#9

It's probably easier if I attach the example files here. I hope you don't mind.

example1.mp3 (10 KB)
example2.mp3 (10 KB)

I created two example files, both wonderful 64kbps transcodings of the Windows "ding" sound - the two example files I was using would have been rather large to upload. I thought I'd vary the software used to tag the files to see what difference it made, so I dragged an old copy of Winamp out to see how it fared.

example1.mp3 was tagged in Winamp v5.08e using in_mp3 v3.08.

example2.mp3 was tagged in Mp3tag v2.32a.

Both files are 10,240 bytes in size and contain an ID3v2 tag.

Winamp in_mp3 tag editor

example1.mp3
:frowning:

Mp3tag v2.32a

example1.mp3

example2.mp3

Mp3tag 2.32a displays both tags correctly in the tag panel on the left. It still has issues displaying the end of the title tag that it created in the main list though.

Mp3tag v2.32t

I tried to perform a local codepage to unicode conversion using 2.32t. My local codepage is Japanese (932), which is the option I chose in Mp3tag. Note: The screenshots below show the files after codepage conversion.

example1.mp3

example2.mp3

Interesting. Mp3tag had no problem converting the Winamp ID3 tag. The second file, tagged by Mp3tag, fared less well.

I hope this information is useful.

  • SoZ

example1.mp3 (10 KB)

example2.mp3 (10 KB)


#10

Sorry, forgot to give you an explanation:
The \ character is used to seperate multiple tag fields of the same name. E.g. if you enter Test1\Test2 in Artist field in the tag panel, you will have 2 Artist fields. Check with ALT+T
So your symbol triggered this. In the tag panel, it looks like one tag field, but it actually is two.
In the last version, it was changed to \\ to allow saving single \ chars.


#11

Yes, dano is right: the backslash triggered the saving of two TITLE fields in Mp3tag V.2.32a.

Removing the two yen symbols (or backslashes) from the tag panel on the left should fix the problem.

Thanks again!

Best regards,
~ Florian


#12

You're quite right. The title on example2.mp3 has been split between two fields. I hadn't noticed that before and it's not exactly clear from the tag panel on the left as it seems to concatenate all of the title fields.

While repairing a UTF-16 tag under 2.32t is relatively simple, how does one perform the same procedure with 2.32a? I cannot see any backslashes to remove in that version. I also tried to modify the tag using the tag viewer (Alt-T), but one of the title fields (the second "title" field in the image for example2.mp3) could not be removed or modified.

example1.mp3

example2.mp3

Before we go any further, do you honestly think there's a chance that Mp3tag will one day properly support more challenging regional encoding schemes, or am I better off just making the switch to Unicode now? Switching to unicode still seems like a lot of hard work (not to mention the expense of new equipment), but I'm starting to wonder if it would actually be easier in the long term.

  • SoZ

#13

Sorry, but bringing back support for ID3v2 tagging in system codepage is not very likely at the moment. The main reason for this is, because it's not specified in the ID3v2 standard.

Best regards,
~ Florian


#14

Understood. I guess it's time I stopped trying to resist change and made the move to Unicode. It's not going to be easy, but it's only going to get harder the longer I leave it. :slight_smile: I'll have a retagging party sometime next month.

Thanks for all your help over the last few days. Florian, I'm afraid I can't donate much (out of work at the moment :frowning: ), but it should be enough for you to buy yourself a drink or two with. (I'm sure you can argue that it's going to help Mp3tag somehow. :slight_smile: )

Thanks again~

  • SoZ

#15

That's exactly why I've decided to implement Unicode support some months ago :slight_smile:

Thanks for your donation and thanks again for your detailed feedback! It's both much appreciated :slight_smile:

Best regards,
~ Florian