[X] Regexp error

DetlevD · February 23, 2010, 9:17pm

tihS can happen ... so it goes ... is there any chance to use a special notation like "U+..." or such to address unicode characters in a regular expression?

DD.20100223.2317.CET

I've read this ...

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.

Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

... and that ...

Java, XML and the .NET framework use Unicode-based regex engines. Perl supports Unicode starting with version 5.6. PCRE can optionally be compiled with Unicode support. Note that PCRE is far less flexible in what it allows for the \p tokens, despite its name "Perl-compatible". The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
See also:
http://www.regular-expressions.info/unicode.html

Because there seems to be, better to say, there is a regex limitation to four hexchars, the usage of five or more hexchars in the one specific UniToAsc.Style.mta file is an erroneous failure.

Is there a way around?
In the case of saving 0x1D49E to Unicode file the hex dump would show the following bytes in sequence: FF FE 35 D8 9E DC
This is a BOM following by UTF16 Little Endian encoded Unicode text.

Perhaps these erroneous regex five hex char literals can be easily transcoded from UTF32LE to UTF16LE Surrogate Pairs, giving "two characters", but this will collide with the "one codepoint = one character related view" of the regex machine.

Maybe a new Mp3tag function $ChrUnicodeToHex(), which converts a Unicode string to a Hex string, could help, search and replace within the hexstring. I do not know, in the moment I am lost with my Latin.

Stevest, from my side as a simple Creativity Sharer, I will have a look at the problem, but this will cost some time, be patient.

In the meantime I would suggest to make a backup of the UniToAsc.Style.mta file and modify the working copy by deleting all five hex char regex literals.
The remaining part then will run with Mp3tag v2.45b.

Or use the UniToAsc.Diacritics.mta which may give good result on accented characters too.

DD.20100224.1747.CET
Edit.
DD.20100224.1920.CET