Fragments Of Time

Happiness only real when shared.

Open Sourced my JavaScript Regular Expression Generator - RegexGen.js

RegexGen.js - JavaScript Regular Expression Generator

RegexGen.js is a JavaScript Regular Expression Generator that helps to construct complex regular expressions, inspired by JSVerbalExpressions.

The Problems

RegexGen.js tries to ease two problems.

  1. While creating a regular expression, it's hard to remember the correct syntax and what characters to escape.
  2. After done creating a regular expression, it's hard to read and remember what the regex do.

The Goals

RegexGen.js is designed to achieve the following goals.

  1. The written codes should be easy to read and easy to understand.
  2. The generated code should be as compact as possible, e.g., no redundant brackets and parentheses.
  3. No more character escaping reguired (except '\', or if you use regex overwrite.)
  4. If the generated code is not good enougth, bad parts can be easily replaced directly in the written codes.

Getting Started

The generator is exported as a regexGen() function.

To generate a regular expression, pass sub-expressions as parameters to the regexGen() function.

Sub-expressions as parameters which are separated by comma are concatenated together to form the whole regular expression.

Sub-expressions can either be a string, a number, a RegExp object, or any values generated by the owned functions of the regexGen() function object, i.e., the regex-generator() as the following informal BNF syntax.

Strings passed into the regexGen(), the text(), the maybe(), the anyCharOf() and the anyCharBut() functions, are always escaped as necessary, so you don't have to worry about which characters to escape.

The result of calling the regexGen() function is a RegExp object.

The basic usage can be expressed as the following informal BNF syntax.

RegExp object = regexGen( sub-expression [, sub-expression ...] [, modifier ...] )

sub-expression ::= string | number | RegExp object | term

term ::= regex-generator() [.term-quantifier()] [.term-lookahead()]

regex-generator() ::= regexGen.startOfLine() | regexGen.endOfLine()
    | regexGen.wordBoundary() | regexGen.nonWordBoundary()
    | regexGen.text() | regexGen.maybe() | regexGen.anyChar() | regexGen.anyCharOf() | regexGen.anyCharBut()
    | regexGen.either() | regexGen.group() | regexGen.capture() | regexGen.sameAs()
    | regex() | ... (see regexGen.js for all termGenerator()s.)

term-quantifier() ::= .term-quantifier-generator() [.term-quantifier-modifier()]

term-quantifier-generator() ::= term.any() | term.many() | term.maybe() | term.repeat() | term.multiple()

term-quantifier-modifier() ::= term.greedy() | term.lazy() | term.reluctant()

term-lookahead() ::= term.contains() | term.notContains() | term.followedBy() | term.notFollowedBy()

modifier ::= regexGen.ignoreCase() | regexGen.searchAll() | regexGen.searchMultiLine()

Please check out regexgen.js and wiki for API documentations, and check out test.js for more examples.

Installation

If your are managing package dependencies with bower, your can install RegexGen.js using bower install command.

bower install git://github.com/amobiz/regexgen.js.git

Or you can just download the regexgen.js or regexgen.min.js, and put it to where your scripts located in your project.

Usage

The hard (but safe) way

Since the generator is exported as the regexGen() function.
Everything must be referenced from it.
To simplify the code, assign it to a short variable is preferable.

var _ = regexGen;

var regex = regexGen(
    _.startOfLine(),
    _.capture( 'http', _.maybe( 's' ) ), '://',
    _.capture( _.anyCharBut( ':/' ).repeat() ),
    _.group( ':', _.capture( _.digital().multiple(2,4) ) ).maybe(), '/',
    _.capture( _.anything() ),
    _.endOfLine()
);
var matches = regex.exec( url );

Mixin to global object

If you still feel inconvenient, and don't mind the global object being polluted,
use the regexGen.mixin() function to export all member functions of the regexGen() function object to the global object.

regexGen.mixin( window );

var regex = regexGen(
    startOfLine(),
    capture( 'http', maybe( 's' ) ), '://',
    capture( anyCharBut( ':/' ).repeat() ),
    group( ':', capture( digital().multiple(2,4) ) ).maybe(), '/',
    capture( anything() ),
    endOfLine()
);
var matches = regex.exec( url );

Use the with keyword

Or, if you don't use the strict mode with use strict keyword,
you can use the with keyword to refer to all member functions of the regexGen() function object.

with( regexGen ) {
    var regex = regexGen(
        startOfLine(),
        capture( 'http', maybe( 's' ) ), '://',
        capture( anyCharBut( ':/' ).repeat() ),
        group( ':', capture( digital().multiple(2,4) ) ).maybe(), '/',
        capture( anything() ),
        endOfLine()
    );
    var matches = regex.exec( url );
}

Examples

Simple Password Validation

This example is taken from the article: Mastering Lookahead and Lookbehind.

regexGen.mixin( window );
var regex = regexGen(
    // Anchor: the beginning of the string
    startOfLine(),
    // Match: six to ten word characters
    word().multiple(6,10).
        // Look ahead: anything, then a lower-case letter
        contains( anything().reluctant(), anyCharOf(['a','z']) ).
        // Look ahead: anything, then an upper-case letter
        contains( anything().reluctant(), anyCharOf(['A','Z']) ).
        // Look ahead: anything, then one digit
        contains( anything().reluctant(), digital() ),
    // Anchor: the end of the string
    endOfLine()
);

Generates:

/^(?=.*?[a-z])(?=.*?[A-Z])(?=.*?\d)\w{6,10}$/
Matching an IP Address

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var d1 = group( anyCharOf( '0', '1' ).maybe(), digital(), digital().maybe() );
var d2 = group( '2', anyCharOf( ['0', '4'] ), digital() );
var d3 = group( '25', anyCharOf( ['0', '5'] ) );
var d255 = capture( either( d1, d2, d3 ) );
var regex = regexGen(
    startOfLine(),
    d255, '.', d255, '.', d255, '.', d255,
    endOfLine()
);

Generates:

/^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$/
Matching Balanced Sets of Parentheses

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regex = regexGen(
    '(',
    anyCharBut( '()' ).any(),
    group(
        '(',
        anyCharBut( '()' ).any(),
        ')',
        anyCharBut( '()' ).any()
    ).any(),
    ')'
);

Generates:

/\([^()]*(?:\([^()]*\)[^()]*)*\)/
Matching Balanced Sets of Parentheses within Any Given Levels of Depth

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
function nestingParentheses( level ) {
    if ( level < 0 ) {
        return '';
    }
    if ( level === 0 ) {
        return anyCharBut( '()' ).any();
    }
    return either(
            anyCharBut( '()' ),
            group(
                '(',
                nestingParentheses( level - 1 ),
                ')'
            )
        ).any();
}

Given 1 level of nesting:

var regex = regexGen(
    '(', nestingParentheses( 1 ), ')'
);

Generates:

/\((?:[^()]|\([^()]*\))*\)/

Given 3 levels of nesting:

var regex = regexGen(
    '(', nestingParentheses( 3 ), ')'
);

Generates:

/\((?:[^()]|\((?:[^()]|\((?:[^()]|\([^()]*\))*\))*\))*\)/
Matching an HTML Tag

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regex = regexGen(
    '<',
    either(
        group( '"', anyCharBut('"').any(), '"' ),
        group( "'", anyCharBut("'").any(), "'" ),
        group( anyCharBut( '"', "'", '>' ) )
    ).any(),
    '>'
);

Generates:

/<(?:"[^"]*"|'[^']*'|[^"'>])*>/
Matching an HTML Link

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regexLink = regexGen(
    '<a',
    wordBoundary(),
    capture(
        anyCharBut( '>' ).many()
    ),
    '>',
    capture(
        label( 'Link' ),
        anything().lazy()
    ),
    '</a>',
    ignoreCase(),
    searchAll()
);
var regexUrl = regexGen(
    wordBoundary(),
    'href',
    space().any(), '=', space().any(),
    either(
        group( '"', capture( anyCharBut( '"' ).any() ), '"' ),
        group( "'", capture( anyCharBut( "'" ).any() ), "'" ),
        capture( anyCharBut( "'", '"', '>', space() ).many() )
    ),
    ignoreCase()
);

Generates:

/<a\b([^>]+)>(.*?)<\/a>/gi
/\bhref\s*=\s*(?:"([^"]*)"|'([^']*)'|([^'">\s]+))/i

Here's how to iterate all links:

var capture, guts, link, url, html = document.documentElement.outerHTML;
while ( (capture = regexLink.exec( html )) ) {
    guts = capture[ 1 ];
    link = capture[ 2 ];
    if ( (capture = regexUrl.exec( guts )) ) {
        url = capture[ 1 ] || capture[ 2 ] || capture[ 3 ];
    }
    console.log( url + ' with link text: ' + link );
}
Examining an HTTP URL

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regex = regexGen(
    startOfLine(),
    'http', maybe( 's' ), '://',
    capture( anyCharBut( '/:' ).many() ),
    group( ':', capture( digital().many() ) ).maybe(),
    capture( '/', anything() ).maybe(),
    endOfLine()
);

Generates:

/^https?:\/\/([^/:]+)(?::(\d+))?(\/.*)?$/

Here's a snippet to report about a URL:

var capture = location.href.match( regex );
var host = capture[1];
var port = capture[2] || 80;
var path = capture[3] || '/';
console.log( 'host:' + host + ', port:' + port + ', path:' + path );
Validating a Hostname

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regex = regexGen(
    startOfLine(),
    // One or more dot-separated parts . . .
    either(
        group(
            anyCharOf( ['a', 'z'], ['0', '9'] ),
            '.'
        ),
        group(
            anyCharOf( ['a', 'z'], ['0', '9'] ),
            anyCharOf( '-', ['a', 'z'], ['0', '9'] ).multiple( 0, 61 ),
            anyCharOf( ['a', 'z'], ['0', '9'] ),
            '.'
        )
    ).any(),
    // Followed by the final suffix part . . .
    either(
        'com', 'edu', 'gov', 'int', 'mil', 'net', 'org', 'biz', 'info', 'name', 'museum', 'coop', 'aero',
        group( anyCharOf( ['a', 'z'] ), anyCharOf( ['a', 'z'] ) )
    ),
    endOfLine()
);

Generates:

/^(?:[a-z0-9]\.|[a-z0-9][-a-z0-9]{0,61}[a-z0-9]\.)*(?:com|edu|gov|int|mil|net|org|biz|info|name|museum|coop|aero|[a-z][a-z])$/
Parsing CSV Files

This example is taken from the book: Mastering Regular Expressions

regexGen.mixin( window );
var regex = regexGen(
    either( startOfLine(), ',' ),
    either(
        // Either a double-quoted field (with "" for each ")
        group(
            // double-quoted field's opening quote
            '"',
            capture(
                anyCharBut( '"' ).any(),
                group(
                    '""',
                    anyCharBut( '"' ).any()
                ).any()
            ),
            // double-quoted field's closing quote
            '"'
        ),
        // Or some non-quote/non-comma text....
        capture(
            anyCharBut( '",' ).any()
        )
    )
);

Generates:

/(?:^|,)(?:"([^"]*(?:""[^"]*)*)"|([^",]*))/