[X] Regexp error

Stevest · February 23, 2010, 5:47am

Hi,

one of my DetlevD's action files with lots of regexp replace actions doesn't work anymore, I get an error message repeatedly (one message per action in file).

There is no problem with v2.45a.

Action file attached.

st

UniToAsc_Filename.mta (30 KB)

DetlevD · February 23, 2010, 10:18am

Hi Stevest, the name of the mta file let suggest that you use one of my published UniToAsc action groups, maybe somewhat modified by you.

I've checked all of my released versions of the different UniToAsc mta's and cannot detect any further problem. The action groups run fine and do their work of conversion special Unicode characters to ASCII/ANSI characters using Mp3tag's regular expression functionality.

There have been no complaints since publishing sometime in August 2008.

But now using the current developer version Mp3tag v2.45b there is a problem as you have pointed out.

The one original UniToAsc.Style.mta action group does not work correctly from action [#37] to [#88] relating to the conversion of stylish Unicode characters to ASCII A to Z and a to z.
I have no concrete guess why this happens. Rather mysteriously.
And as you have confirmed, it has been working before.

Stevest, as you have noted, there is a significant difference between the versions v2.45a and v2.45b at this point.
The Mp3tag error message indicates an error in the regex machine, so Florian should look into.

DD.20100223.1217.CET

Stevest · February 23, 2010, 1:17pm

Hi DetlevD,

yes, the mta file is your great work (sorry for saying my script, I should have said a script used by me), I use it very frequently, and it worked very well until v2.45a. I've reinstalled v2.45a, and it works again.

st

P.S.
Thanks for the action file again, it is one of the most useful scripts I have in MP3Tag.

DetlevD · February 23, 2010, 5:43pm

Steve, thanks for the credit, I am glad that you appreciate my work and that you can make good use of the tool. In this way I receive many well-intentioned brain waves. This in turn gives me a good feeling. Thank you!

DD.20100223.1943.CET

Florian · February 23, 2010, 7:30pm

The \x{} hexadecimal sequence only supports 16 bit sequences (e.g., \x{dddd}) and I currently see no way of expressing sequences above that limit.

If you have still 2.45a installed, you'll see that the longer sequences actually produce no match and do not result in the correctly replaced character.

DetlevD · February 23, 2010, 9:17pm

tihS can happen ... so it goes ... is there any chance to use a special notation like "U+..." or such to address unicode characters in a regular expression?

DD.20100223.2317.CET

I've read this ...

To match a specific Unicode code point, use \uFFFF where FFFF is the hexadecimal number of the code point you want to match. You must always specify 4 hexadecimal digits E.g. \u00E0 matches à, but only when encoded as a single code point U+00E0.

Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

... and that ...

Java, XML and the .NET framework use Unicode-based regex engines. Perl supports Unicode starting with version 5.6. PCRE can optionally be compiled with Unicode support. Note that PCRE is far less flexible in what it allows for the \p tokens, despite its name "Perl-compatible". The PHP preg functions, which are based on PCRE, support Unicode when the /u option is appended to the regular expression.
See also:
http://www.regular-expressions.info/unicode.html

Because there seems to be, better to say, there is a regex limitation to four hexchars, the usage of five or more hexchars in the one specific UniToAsc.Style.mta file is an erroneous failure.

Is there a way around?
In the case of saving 0x1D49E to Unicode file the hex dump would show the following bytes in sequence: FF FE 35 D8 9E DC
This is a BOM following by UTF16 Little Endian encoded Unicode text.

Perhaps these erroneous regex five hex char literals can be easily transcoded from UTF32LE to UTF16LE Surrogate Pairs, giving "two characters", but this will collide with the "one codepoint = one character related view" of the regex machine.

Maybe a new Mp3tag function $ChrUnicodeToHex(), which converts a Unicode string to a Hex string, could help, search and replace within the hexstring. I do not know, in the moment I am lost with my Latin.

Stevest, from my side as a simple Creativity Sharer, I will have a look at the problem, but this will cost some time, be patient.

In the meantime I would suggest to make a backup of the UniToAsc.Style.mta file and modify the working copy by deleting all five hex char regex literals.
The remaining part then will run with Mp3tag v2.45b.

Or use the UniToAsc.Diacritics.mta which may give good result on accented characters too.

DD.20100224.1747.CET
Edit.
DD.20100224.1920.CET

Florian · December 28, 2018, 2:28pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.