[regex?] How can I remove lines with Chinese characters from the lyrics field?

Some of my lyrics have Chinese language on every other line.

[00.12.34]Message in a bottle
[01.23.45]收到我的漂流瓶
[02.34.56]Message in a bottle
[03.45.67]收到我的漂流瓶
etc.

I've searched as much as I can but can't find a way to remove the Chinese lines with regex.

And also, if it can't be done in one go with an Action, a way to filter to find all the tracks with Chinese characters so I know which ones need fixing?

Many thanks.

Do you ask this same question elsewhere too?
Just to save the time to search for and post a solution...

Here is a thread that deals with a filter for non-ASCII characters ..

And here is one to split text:

Do you ask this same question elsewhere too?
Just to save the time to search for and post a solution...

I don't quite understand.

I don't have a solution to the Chinese characters conundrum.

In that other thread you kindly helped me find the 3 decimal places files, and I'd found another way so I took the time to spell it out in case it would be useful for anyone else searching in the future.

Then I posted this thread about a separate question.

Then I linked to this thread to you, and you only, just because were were already communicating and you seemed very knowledgable about regex.

I really want to find out how to get rid of the Chinese lines in all my lyrics, but I've searched high and low and cannot find a way. Hence this thread. I only mentioned it to you in in the other thread as I was already writing a post to thank you and just thought I'd add that in.

Anyway, this is why. This is [what I see on my phone's music player]

Can anyone help me on this please? I'll be so grateful.

This is absolutely wonderful. Thank you so much, ohrenkino.

Another kind person had already helped me remove the <> karaoke lyrics and to change lyrics with timestamp seconds from 3 decimal places to 2.

So now I have this wonderful Action group.

You've made me very happy.

A couple of things ...

Regular expression to filter path names with diacritics

That's not finding Chinese characters for me.

When I entered NOT %_path% MATCHES ^[a-zA-Z0-9\W_]*$ the filter worked fine, picking up Motörhead etc.

But when I replaced %_path% with %lyrics%, the filter only found 4 tracks (out of hundreds).

It would be handy to be able to search for Chinese characters, if by any chance you have another solution. But given I'll be running that Mp3tag Action group on all my new lyrics, I don't think I actually need it, so don't worry if it's not straightforward.

I also searched for the most common Chinese characters and after filtering by 的, I found a whole bunch. From there I realised that there were a ton of tracks with

作词 which apparently means lyricist
作曲 composer and
制作人 producer

hence the Replace actions in my screenshot above.

I'd also like to put [lyrics not found] into the %unsyncedlyrics% field of some tracks but I couldn't find a way to do it.

I can put in lyrics not found, just not with the brackets. Any ideas please?

This is what I'm using, with the Format string as: %unsyncedlyrics%lyrics not found

I've tried a few things ...

%unsyncedlyrics%[lyrics not found] returns blank
%unsyncedlyrics%"[lyrics not found]" returns ""
%unsyncedlyrics%"["lyrics not found"]" returns ""

Perhaps you could teach me what I'm doing wrong?

Anyway, thanks a million again. I really appreciate it.

If you don't do it directly with the help of the extended tags dialogue or the tag panel but with an action, then please note that the square brackets serve a special purpose for conditional output. Use a e.g such a
Format string: '['lyrics not found']'

See the documenation on Format strings:

Ah, single quote instead of double.

I should've looked at the documentation. Thanks for the link.

Oh, and any idea what I'm doing wrong with

NOT %unsyncedlyrics% HAS ^[a-zA-Z0-9\W_]*$

Actually, it's odd because NOT %_path% HAS ^[a-zA-Z0-9\W_]*$ and NOT %artist% HAS ^[a-zA-Z0-9\W_]*$ work fine, but the lyrics searches just return ALL files, whether there's anything in the lyrics field or not.

It's the wrong filter keyword - to use regular expressions in a filter you need MATCHES instead of HAS

Please also note that the beginning of a UNSYNCEDLYRICS field starts with a language token for MP3 files...

It's the wrong filter keyword - to use regular expressions in a filter you need MATCHES instead of HAS

NOT %unsyncedlyrics% MATCHES ^[a-zA-Z0-9\W_]*$ doesn't return anything when I put a Ñ in one of the tracks' unsyncedlyrics field.

Please also note that the beginning of a UNSYNCEDLYRICS field starts with a language token for MP3 files...

That would explain it returning all the tracks with unsynced lyrics, but that previous filter returned all tracks, whether there was anything in unsyncedlyrics or not. But I guess that was because I was using HAS instead of MATCHES.

But still, MATCHES isn't returning anything, so the language token isn't relevant, yet.

Do you mean that the language token means filtering for non-ascii characters in unsyncedlyrics won't work?

Oh, now I'm confused ...

If I put Ñ (and only that) into the unsyncedlyrics field, NOT %unsyncedlyrics% MATCHES ^[a-zA-Z0-9\W_]*$ finds the track.

But if I put the Ñ into here:

eng||
[instrumentalÑ]

NOT %unsyncedlyrics% MATCHES ^[a-zA-Z0-9\W_]*$ returns nothing.

To actually find and filter out Chinese characters in action group, you would want something else instead:

$regexp(%lyrics%,'[\x{2E80}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE10}-\x{FE1F}\x{FE30}-\x{FE6F}\x{FF00}-\x{FFEF}]',)

It is technically infeasible to just filter out Chinese, since CJK characters are all intermixed within different unicode ranges. The regex above also filters out most Japanese and some Korean characters. But for the intention to modify lyrics, it should work fine.

Thanks again.

I can't get that to work.

Here's a screenshot of the quick action I tried running with the lyrics and unsyncedlyrics columns behind it so you can see what I'm trying to remove.

The first few lines of the lyrics look like this:

作词 : Billy Gibbons/Frank Beard/Dusty Hill
作曲 : Billy Gibbons/Frank Beard/Dusty Hill
[Verse 1]
Well, I was rolling down the road
In some cold blue steel

and I'd like to remove the lines

作词 : Billy Gibbons/Frank Beard/Dusty Hill
作曲 : Billy Gibbons/Frank Beard/Dusty Hill

Actually in this particular case, I'd be happy to just replace 作词 : with Lyrics by: but let's pretend it's a whole line of lyrics like:

[01:02:34]作曲 作词 作曲 作词 作曲 作词 作曲 作词

As you can see in the screenshot, I tried running:

Guess values
Source format:

$regexp(%lyrics%,'[\x{2E80}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE10}-\x{FE1F}\x{FE30}-\x{FE6F}\x{FF00}-\x{FFEF}]',)

Guessing pattern:
%lyrics%===%chinese%

But nothing happens.

(I'm just using %chinese% instead of %comment% because I use the comment field.)

I thought your previous suggestion was working fine:

Guess values
Source format:

$regexp(%lyrics%,'[^\x00-\x7F]+',)===$regexp(%lyrics%,'[\x00-\x7F]+',)

Guessing pattern:
%lyrics%===%chinese%

I've run it on more than a thousand files and I can't turn back now, lol.

Is the newer version homing in on Chinese, Japanese and Korean characters more specifically (rather than just all non-ascii characters)?

I think so. Can you tell me how to run it properly please. My Guess values attempt just isn't working.

By the way, the only issue I had with the $regexp(%lyrics%,'[^\x00-\x7F]+',)===$regexp(%lyrics%,'[\x00-\x7F]+',) version was that it seemed to be removing the first round bracket on any line of lyrics that had round brackets. Maybe that was happening only when I ran it on lyrics that didn't have any non-ascii characters in it, not sure. But I've seen a few lyrics that now have e.g.

Tell her about it Tell her about it)
instead of
Tell her about it (Tell her about it)

Is that possibly due to the Guess values Source format?

The reason I'm running your original Guess values string on all tracks is just because I can't work out how to filter tracks with Chinese characters. And I'd come to the conclusion that it would leave tracks with no non-ascii characters in the lyrics untouched. Not sure again.

I'm writing too much and should probably edit this down but hopefully you see what I'm trying to do and where I'm failing.

Many thanks.

Where in the source format do you divide the supplied string with ==?
If those are missing, then the pattern does not match and you get no change.

I'm sorry, I don't understand.

I'm just copying your suggestions without actually understanding them properly because it's all a bit over my head.

I don't really want to use that %chinese% (%comment%) field or "split" the lyrics like the original poster in that other thread, it was just the only way I could think to make it work.

I simply have some lyrics where every other line is in Chinese, and I'd like to remove those.

Anyway, so I don't know what you mean by ...

Where in the source format do you divide the supplied string with ==?

Could you phrase it differently please?

Sorry, yes, that was also my impression.

An action of the type "Guess value" works almost like the Convert>Filename>Tag function - only that the source is not always the filename but any string that you supply.
And just like you have define a pattern for the filename which part of the string is meant to end up in a certain field, you have to do that also in "Guess value".
And just like in Convert>Filename-Tag: if the pattern does not match the string then you get no result.
In your case you applied a pattern that had == as divider between the target fields - but I could not see that anywhere in the source.

As an attempt:
Try an action of the type "Format value" for UNSYNCEDLYRICS
Format string: $regexp(%unsyncedlyrics%,'[\x{2E80}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE10}-\x{FE1F}\x{FE30}-\x{FE6F}\x{FF00}-\x{FFEF}]',)

This should leave only the timing.
You could then remove that timing pattern if it is still necessary.

That nearly works.

I'd definitely like to remove the timestamps too and I can't think how to do that.

The timestamps are actually being deleted (which is what I want) except for some of the lines which are left intact with some curly quotes (sometimes one, sometimes two) or numbers. I guess that's because they aren't Chinese characters.

Is there a way around that to cover (nearly) all possibilities?

The previous method I'd been using left in the 15 and 45 lines (and their timestamps). I didn't know that until just now and that's disappointing because I think I've gone too far to go back to my backups! But what they heck, it's a big improvement on having all the Chinese lines.

But the old method didn't keep the curly quotes, which is good.

My mission is to completely delete (including timestamps) any line that contains a Chinese character.

Three versions below.

  1. Result of your new Format value method.
  2. Original lyrics including Chinese.
  3. Result of the old method (where I also then delete the %chinese% field in the same Action group).

Here's some of the original lyrics:

[00:39.30]that's just the name of the song, and that's why I called the song, Alice's Restaurant.
[00:39.30]它只是这歌的名字,这也是为什么我称这首歌为“爱丽丝的餐厅”。
[01:11.15]An' you can get anything you want at Alice's Restaurant
[01:11.15]你想要的全都有,就在爱丽丝的餐厅里。”
[02:12.30]Well we got there and there was a big sign and a chain across across the dump saying, "Closed on Thanksgiving."
[02:12.30]好吧,我们到了,发现垃圾场被链子锁着,而且有一个大大的标志,上面写着:“感恩节关门。”
[02:33.20]and off the side of the side road there was another fifteen foot cliff
[02:33.20]就在小路边上有一个15英尺高的悬崖,

Here's what happened with the new method:

[00:39.30]that's just the name of the song, and that's why I called the song, Alice's Restaurant.
[00:39.30]“”
[01:11.15]An' you can get anything you want at Alice's Restaurant
[01:11.15]”
[02:12.30]Well we got there and there was a big sign and a chain across across the dump saying, "Closed on Thanksgiving."
[02:12.30]“”
[02:33.20]and off the side of the side road there was another fifteen foot cliff
[02:33.20]15

Here's the result of the old Guess values method:

[00:39.30]that's just the name of the song, and that's why I called the song, Alice's Restaurant.
[01:11.15]An' you can get anything you want at Alice's Restaurant
[02:12.30]Well we got there and there was a big sign and a chain across across the dump saying, "Closed on Thanksgiving."
[02:33.20]15

Try an action of the type "Replace" for UNSYNCEDLYRICS
Search string: ]“
Replace with: ]

and a similar one with
Search string: ]”

And then another "Replace with regular expression"
Search string: \[\d\d\:\d\d\.\d\d\]
Replace with:
(leave empty)

Thank you again.

Some of the lines still have quotes in them, and the problem, of course, is that the Arabic numbers are there too.

At [05:01.649] he talks about "twenty seven eight-by-ten color glossy photographs", which in Chinese has "拍了27张标记着圆圈箭头的8x10彩色光面照片," and then, after your string, becomes "278x10" on its Chinese line. There's a few things like that. There's 473758, W, ~~, and ......

You were right, the timestamps do remain, it's just that my lyrics panel in foobar2000 doesn't show them when I select Edit lyric. My phone app will display them though, so I will want to get rid of all the timestamps for the, now otherwise empty, 'Chinese' lines.

Here's an example of what remains in the unsyncedlyrics field in the tags:

[05:01.649]and they took twenty seven eight-by-ten color glossy photographs with circles
[05:01.649]278x10

Is there a way to delete all lines that have a Chinese (or non-ascii perhaps) character in them?
Is there a way to just filter/search for tracks with Chinese (or non-ascii) characters?

Oh, please! Do not clot the forum with such long dumps.
One line or perhaps 2 of those that you want to treat should be enough.

Have you tried the suggested actions to remove the otherwise empty timing?

Or see what this can do for you:
$regexp(%unsyncedlyrics%,'^\[.+?[\x{2E80}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE10}-\x{FE1F}\x{FE30}-\x{FE6F}\x{FF00}-\x{FFEF}](.*?)$',)

Sorry, I didn't realise that that long drop of all the lyrics would be a problem. Posts amended.

Yes, I had tried to remove the lines that only have timestamps like this for example.

[00:27.558]
[00:35.974]

The following seemingly has no effect.

"Replace with regular expression"
Field: UNSYNCEDLYRICS
Regular expression: [\d\d:\d\d.\d\d]
Replace matches with: (leave empty)

And it's the same with your latest suggestion:

"Format value"
Field: UNSYNCEDLYRICS
Format string: $regexp(%unsyncedlyrics%,'^[.+?\x{2E80}-\x{9FFF}\x{F900}-\x{FAFF}\x{FE10}-\x{FE1F}\x{FE30}-\x{FE6F}\x{FF00}-\x{FFEF}$',)

No effect.

The problem remains that so far we've been unable to remove:

  • whole lines
  • ASCII characters that are on the same line as the Chinese characters

Forgetting Chinese/non-ASCII for now, is it possible do a search-and-replace with regex to say, for example, "Look for any line with the number 7 on it and, if you find a number 7 on a line, then delete that whole line"?

That's what I'm trying to do, except with Chinese (or non-ASCII) characters instead of the number 7.