Help with Long RegEx

Rijkstra · March 30, 2016, 9:23pm

I used a lengthy RegEx to find all of the permutations of the word CLIMATE in the New York Times Crossword database. It works, but may be too long:

88 results for regular expression ([CLIMATE])(?!\1)([CLIMATE])(?!\1|\2)([CLIMATE])(?!\1|\2|\3)([CLIMATE])(?!\1|\2|\3|\4)([CLIMATE])(?!\1|\2|\3|\4|\5)([CLIMATE])(?!\1|\2|\3|\4|\5|\6)[CLIMATE]

ACCLIMATE ACCLIMATED ACCLIMATES ACHROMATICLENS ANASTIGMATICLENS ARISTOLOCHIACLEMATITIS ARITHMETICAL ARITHMETICALAID ARITHMETICALLY ASTRONOMICALTELESCOPE AUTOMATICELEVATOR BIOLOGICALTIME CELESTIALMECHANICS ...

I believe the problem is that you can back reference the result of a character class match, but not its terms. I'd love to be proven wrong, BTW. You can see the entire result here:

http://wordplay.blogs.nytimes.com/2016/03/...permid=18048550

DetlevD · March 31, 2016, 4:15am

Note: In a regular expression the term [CLIMATE] is a set of seven letters.
A permutation is a transposition of a given ordered set of elements, in order to create a different ordered set of elements, to give the new order another semantically sense.

If there is given a word of 7 letters, then the new word has 7 letters too, but distributed in a different order of the given letters.
CLIMATE, MACLITE, METALIC, LACTIME, CALMITE, METICAL, LITECAM, MALETIC, MALTICE, CALTIME, ...

DD.20160331.0815.CEST

Rijkstra · March 31, 2016, 5:14am

I'm not just looking for seven-letter words, but also seven-letter sequences in longer words. My RegEx does that. I've tried to shorten it thiis way with subroutine calls, but I can't get it to work:

Error: parsing "([CLIMATE])(?!\1)((?1))(?!\1|\2)((?1))(?!\1|\2|\3)((?1))(?!\1|\2|\3|\4)((?1))(?!\1|\2|\3|\4|\5)((?1))(?!\1|\2|\3|\4|\5|\6)(?1)" - Unrecognized grouping construct.

Do the (?1) subroutine calls not support another set of parens so that their results can be back-referenced?

ohrenkino · March 31, 2016, 5:25am

If you look for a string constant, why not use a simple search?
If you do not use "word only", you should find all entries that are or contain the string constant.

DetlevD · March 31, 2016, 5:39am

At the risk that I have understood your problem wrong, ...
if you have a list of words and you want to know, whether one word contains a predefined set of letters, ...
then you may do something like this:

get a word from the list of words (dictionary database);
remove all the letters from the word, which are given by the predefined set of letters;
measure the length of the resulting word.
If the length is shorter than the unchanged word, ...
at least by the given number of predefined letters, ...
then put this word to the result list.

DD.20160331.0939.CEST

Rijkstra · March 31, 2016, 5:40am

There's nothing simple about it. The search is for anagrams of CLIMATE within words of 7+ length. As you can see, I have a working RegEx in the original post that I just want to shorten. I've tried (?1) and \g<1> as subroutine calls without success.

Rijkstra · March 31, 2016, 5:52am

That won't work. The seven letters of CLIMATE must be consecutive within a word with no repeats or intervening letters. As I said the long RegEx works. Note that you will find an anagram of CLIMATE within each hit if not the word itself.

DetlevD · March 31, 2016, 8:10am

Ok, as you said, your solution works, then use it.
What is the benefit of all this effort?
Is there any prize money?

DD.20160331.1210.CEST

DetlevD · March 31, 2016, 11:22am

Assuming some regexp dialects do not support recursion, ...
maybe there is a way to recode the recursive expression into a linear expression ...
see there ...
http://www.regular-expressions.info/subroutine.html

DD.20160331.1522.CEST

Rijkstra · March 31, 2016, 2:33pm

I was on that page yesterday trying to solve the problem. No, there is no prize money for shortening the working RegEx. I'm just frustrated that I can't eliminate the multiple occurrences of the [CLIMATE] search character class. I'll have to find out exactly which dialect of RegEx the site is using.

Rijkstra · March 31, 2016, 5:24pm

I found out why my shortened RegEx won't work. The site uses Microsoft's .NET which is considered less full-featured that the more standard PERL, PCRE or PHP. .NET doesn't support what I am trying to do at all.

I'm hoping Mp3Tag uses one of PERL-based versions. Does anyone here know exactly which version is used?

ohrenkino · March 31, 2016, 5:47pm

see this thread: /t/6109/1

Rijkstra · March 31, 2016, 7:15pm

Doesn't tell me much about the Perl version, but I assume that Florian keeps it up to date. My failed shortened RegEx works fine on this PHP-based website that the failed .NET site uses as a tutorial!

([CLIMATE])(?!\1)((?1))(?!\1|\2)((?1))(?!\1|\2|\3)((?1))(?!\1|\2|\3|\4)((?1))(?!\1|\2|\3|\4|\5)((?1))(?!\1|\2|\3|\4|\5|\6)(?1)

On

http://www.visca.com/regexdict/tutorial.html

And yes, it also works in Mp3Tag!