Go, Rust, Haxeによる正規表現エンジン実装の読み比べ

この記事はRust Language Advent Calendar 2014の20日目であると同時に、Haxe Advent Calendar 2014の20日目でもあります。

正規表現エンジン

正規表現エンジンのGoによる実装sre2、Rustによる実装libregex、そしてはHaxeによる実装HxREを読み比べます。

sre2は、もともとC++で実装されていたRE2をGoで実装したものです。

libregexは、Rustで実装された正規表現エンジンです。これはRE2をベースに実装されていていますが、独自の特徴として、マクロを使うことでコンパイル時に特殊化したコードを生成できることが挙げられます。

HxReは、Rustのlibregexを参考にHaxeで実装したもので、こちらもマクロによる特殊化機能を備えています。実験的な実装で未実装の機能も多いです。

仮想マシンアプローチによる正規表現マッチング実装

上記の正規表現エンジンは、正規表現を命令列に変換し、仮想マシンで実行することでマッチングを行っています。この手法についてはRegular Expression Matching: the Virtual Machine Approachで詳細に語られているので、これを読んでおくとlibregexの実装を理解するのに役立つでしょう。

ファイル構成

sre2の構成

ファイル	内容
ascii.go	文字クラスの定義
data.go	`RuneFilter`の定義
sparser.go	`SafeReader`の定義
regexp.go	正規表現をパースし命令列にする
simple.go	命令列を解釈実行する仮想マシン

libregexの構成

ファイル	内容
parse.rs	正規表現をパースし抽象構文木にする
compile.rs	抽象構文木をコンパイルし命令列にする
vm.rs	命令列を解釈実行する仮想マシン
re.rs	公開されるインターフェース
lib.rs	crateドキュメント

regex!マクロはlibregex_macro crateに分割されています。

HxReの構成

ファイル	内容
Types.hx	型の定義
Ast.hx	抽象構文木の定義
Parser.hx	正規表現をパースし抽象構文木にする
Inst.hx	命令の定義
Compiler.hx	抽象構文木をコンパイルし命令列にする
Thread.hx	仮想マシン上のスレッド
NfaVM.hx	命令列を解釈実行する仮想マシン
StringWindow.hx	文字列を読み込むためのオブジェクト
Regex.hx	公開されるインターフェース
Specializer.hx	特殊化された仮想マシンの生成

Rust, Haxe, Goの比較

代数的データ型とオブジェクト

抽象構文木や命令といったデータに対して、代数的データ型が活躍します。RustもHaxeも、代数的データ型を持っているので、これらのデータを自然に表現できます。

// Rust
#[deriving(Show, Clone)]
pub enum Inst {
    Match,
    OneChar(char, Flags),
    CharClass(Vec<(char, char)>, Flags),
    Any(Flags),
    EmptyBegin(Flags),
    EmptyEnd(Flags),
    EmptyWordBoundary(Flags),
    Save(uint),
    Jump(InstIdx),
}

// Haxe
enum Inst {
	AddThread;
	Jump(x : Index<Inst>);
	Split(x : Index<Inst>, y : Index<Inst>);
	PassTurnstile(i : Index<Bool>);
	SaveBegin(i : Index<Index<Char>>);
	SaveEnd(i : Index<Index<Char>>);
	OneChar(c : Char);
	CharClass(ranges : Array<Range<Char>>);
	Begin;
	End;
}

どちらの実装も、命令の種類の列挙になっています。この命令をどのように解釈するかは、このデータ型を扱う関数に任されています。

一方、代数的データ型を持たないGoによる実装sre2では、sre2ではパースの段階で抽象構文木を生成せず、直接命令列に変換しています。命令の型は、命令の機能を保持した構造体です。命令自身がその機能を持っているという点で、オブジェクト指向のプログラムになっています。

// Go
type instr struct {
	idx  int       // index of this instr
	mode instrMode // mode (as above)
	out  *instr    // next instr to process

	// alternate path, for iSplit
	out1 *instr

	// boundary mode, for iBoundaryCase
	lr boundaryMode

	// rune class to match against, for iRuneClass
	rf RuneFilter

	// identifier of submatch for iIndexCap
	cid   int    // numbered index
	cname string // string identifier (blank=none)
}

modeフィールドには、命令の種類が格納されています。

// instrMode describes a particular instruction type for the regexp internal
// state machine.
type instrMode byte

// Enum-style definitions for the instrMode type.
const (
	iSplit        instrMode = iota // proceed down out & out1
	iIndexCap                      // capturing start/end parenthesis
	iBoundaryCase                  // match left/right runes here
	iRuneClass                     // if match rune, proceed down out
	iMatch                         // success state!
)

VM

Goのsre2では、正規表現についての情報ほ保持する構造体sregexpのメソッドで、マッチング処理が行われています。

// sregexp struct. Just a list of states and a number of subexpressions.
type sregexp struct {
	prog []*instr // List of instruction states that comprise this RE.

	start int // start instr

	// Number of paired subexpressions [()'s], including the outermost brackets
	// (i.e. which match the entire string).
	caps int
}

func (r *sregexp) run(src string, submatch bool) (success bool, capture []int) {
	curr := makeStateList(len(r.prog))
	next := makeStateList(len(r.prog))
	parser := NewSafeReader(src)

	return r._run(curr, next, &parser, src, submatch)
}


func (r *sregexp) _run(curr *stateList, next *stateList, parser *SafeReader, src string, submatch bool) (success bool, capture []int) {
	// always start with state zero
	curr.addstate(parser, r.prog[r.start], submatch, nil)

	for parser.nextCh() != -1 {
		ch := parser.curr()
		if len(curr.states) == 0 {
			return false, nil // no more possible states, short-circuit failure
		}

		// move along rune paths
		for _, st := range curr.states {
			i := r.prog[st.idx]
			if i.match(ch) {
				next.addstate(parser, i.out, submatch, st.capture)
			}
		}
		curr, next = next, curr
		next.clear() // clear next so it can be re-used
	}

	// search for success state
	for _, st := range curr.states {
		if r.prog[st.idx].mode == iMatch {
			return true, st.capture.list(r.caps)
		}
	}
	return false, nil
}

Rustのlibregexでは、Nfa構造体のメソッドで同様の処理が行われています。

struct Nfa<'r, 't> {
    which: MatchKind,
    prog: &'r Program,
    input: &'t str,
    start: uint,
    end: uint,
    ic: uint,
    chars: CharReader<'t>,
}

impl<'r, 't> Nfa<'r, 't> {
    fn run(&mut self) -> CaptureLocs {
        // omitted 長いよ
    }

    fn step(&self, groups: &mut [Option<uint>], nlist: &mut Threads,
            caps: &mut [Option<uint>], pc: uint)
           -> StepState {
        // omitted
    }

    fn add(&self, nlist: &mut Threads, pc: uint, groups: &mut [Option<uint>]) {
        // omitted
    }

    // omitted
}

HaxeのhxREでは、NfaVMクラスのメソッドで処理を行うことにしました。

class NfaVM {
	public var ignoreCase (default, null) : Bool = false;
	public var multiline (default, null) : Bool = false;
	public var global (default, null) : Bool = false;

	public var lastIndex (default, null) : Int = 0;

	var prog : Null<Program>;
	var w : Window;
	var cts : Array<Thread>;
	var nts : Array<Thread>;
	var ncap : Index<Option<Index<Char>>>;
	var turnstile : Array<Bool>;
	var matchedThread : Option<Thread>;

	public function new(prog : Null<Program>, flags : Flags) {
		this.prog = prog;
		if (prog != null) {
			ncap = prog.ncap;
			turnstile = [for (i in 0 ... prog.nturnstile) false];
		}
		ignoreCase = flags.ignoreCase;
		multiline = flags.multiline;
		global = flags.global;
	}

	public function match(w : Window) : Bool {
		this.w = w;
		if (global) {
			w.index = lastIndex;
		}
		cts = [];
		nts = [];
		clearTurnstiles();
		matchedThread = None;

		while (true) {
			if (matchedThread == None) {
				cts.push(new Thread(0, [w.index], [], ncap));
			}

			if (cts.length == 0) {
				break;
			}

			for (t in cts) {
				if (run(t)) {
					break;
				}
			}

			if (terminal()) {
				break;
			}

			advance();
		}
		return matchedThread != None;
	}

	// omitted
}