Splitting a title with mixed English and non-English

sivaram2 · June 12, 2020, 7:09am

Suppose I have song title in which the first part is in English and then the second part is title within the song's language. So like "I'm Watching a Loneliness Just Arisen 나는 새롭게 떠오른 외로움을 봐요", where the first half is English and the second half is the Korean name.

Is there a $regexp formula I can use so that I can capture group the English part and then the Korean part? I want to move the Korean part to the %title% tag and then the English part to the %titlesort% tag.

As per the example I mentioned above, I know that the language switch happens with a whitespace (\s) in the middle.

Appreciate the help in advance!

Crissov · June 12, 2020, 8:15am

Will it always be two different scripts, not just languages? Can the English part contain diacritic marks like accents and umlauts?

sivaram2 · June 12, 2020, 8:30am

Great question! For simplicity’s sake, yes, let’s consider the case where the first language uses Latin script and the second language uses something non-Latin. So yeah, this could cover Korean, Japanese, Hindi, etc.

ohrenkino · June 12, 2020, 8:54am

$regexp(%title%,'[^\x00-\x7F]+',)
leaves the ASCII part
$regexp(%title%,'[\x00-\x7F]+',)
gives the non-ascii part.

So an action of the type "Guess value" could work:
Source pattern: $regexp(%title%,'[^\x00-\x7F]+',)===$regexp(%title%,'[\x00-\x7F]+',)
Target string: %titlesort%===%title%