WoW:HOWTO: Speed up string match lookups

From AddOn Studio
Jump to navigation Jump to search

When you have large number of patterns (dozens) to scan to find out which pattern is matching a given string, there's a few things you can do to speed up the job.

If the patterns are hard coded, there is of course any number of ways that you can be clever. But if you do not know what the patterns look like beforehand, which is the case when you're trying to match input strings against patterns in [DEPRECATED http://wowprogramming.com/utils/xmlbrowser/live/FrameXML/GlobalStrings.lua GlobalStrings.lua] using a formatstring-to-regex utility like BabbleLib's Deformat() function.


The approach below works by making lists of words used by patterns, and then looking at words in the input strings to determine which list(s) to look for matches in.

Actually, the process is 2-pass. The first pass figures out the LEAST commonly used words, and then just uses those.


  • Note: The example contains a very simplistic "MyDeformatterFunc()" for converting "%s" to "(.*)". It will not work for other locales than english. Do not use it in the real world, please.


-- Functions that we want called for different string matches
function RoughPokeFunc(v1,v2) print("RoughPokeFunc "..v1.." "..v2); end
function SoftPokeFunc(v1,v2) print("SoftPokeFunc "..v1.." "..v2); end
function SoftNudgeFunc(v1,v2) print("SoftNudgeFunc "..v1.." "..v2); end
function ChickenFunc(v1,v2) print("ChickenFunc "..v1.." "..v2); end


-- Strings to match mapped to functions that we want called
MatchStrings = {
  ["%s roughly pokes %s"] = RoughPokeFunc,
  ["%s softly pokes %s"] = SoftPokeFunc,
  ["%s softly nudges %s"] = SoftNudgeFunc,
  ["%s gets nudged by %s and runs away screaming"] = ChickenFunc,
}

-- VERY simplistic deformatter function. 
-- You probably want a real deformatting library for this.
function MyDeformatterFunc(str)	
  return (string.gsub(str, "%%s", "(.*)"));
end


-- First run: count how many occurences there are of each word
WordCounts = {}
for str,func in MatchStrings do
  for word in string.gfind(str, "[^ ]+") do
  	if(string.find(word, "^%%")) then
  		-- ignore format strings
  	else
  		WordCounts[word] = (WordCounts[word] or 0) + 1;
  	end
  end
end

-- Second run: for each string, pick the least common word and place string in that hash bucket
MatchStringsHash = {}
for str,func in MatchStrings do
  local bestword, num;
  for word in string.gfind(str, "[^ ]+") do
  	if(string.find(word, "^%%")) then
  		-- ignore format strings
  	else
  		if(not num or WordCounts[word] < num) then
  			num = WordCounts[word];
  			bestword = word;
  		end
  	end
  end
  
  assert(bestword);
  
  if(not MatchStringsHash[bestword]) then MatchStringsHash[bestword] = {}; end
  MatchStringsHash[bestword][MyDeformatterFunc(str)] = func;
end

WordCounts = nil; -- now we don't need the counts anymore


-- Dump our MatchStringsHash on-screen so we can see what it looks like!
print "Examining hash buckets"
print "----------------------"
for word,strings in MatchStringsHash do
  print("  "..word..":");
  for str,func in strings do
  	print("    \""..str.."\"");
  end
end





-- Function that scans for matches and calls the resulting function
function ScanForMatch(str)
  local bDone = false;
  local nCompares = 0;
  
  for word in string.gfind(str, "[^ ]+") do
  	if(MatchStringsHash[word]) then
  		for pattern,func in MatchStringsHash[word] do
  			nCompares = nCompares + 1;
  			local success,_,v1,v2,v3,v4 = string.find(str, pattern);
  			if(success) then
  				func(v1,v2,v3,v4);
  				bDone=true;
  				break;
  			end
  		end
  	end
  	
  	if(bDone) then break; end
  end

  print("  \""..str.."\": "..nCompares.." string.finds actually executed\n");
end

print("");
print("Executing!");
print("----------");


ScanForMatch("Alice roughly pokes Bob");
ScanForMatch("Bob softly pokes Charles");
ScanForMatch("Charles softly nudges Denise");
ScanForMatch("Denise gets nudged by Eve and runs away screaming");
ScanForMatch("This string does not exist");


Running the above produces the following output:

Examining hash buckets
----------------------
  roughly:
    "(.*) roughly pokes (.*)"
  nudges:
    "(.*) softly nudges (.*)"
  gets:
    "(.*) gets nudged by (.*) and runs away screaming"
  softly:
    "(.*) softly pokes (.*)"

Executing!
----------
RoughPokeFunc Alice Bob
  "Alice roughly pokes Bob": 1 string.finds actually executed

SoftPokeFunc Bob Charles
  "Bob softly pokes Charles": 1 string.finds actually executed

SoftNudgeFunc Charles Denise
  "Charles softly nudges Denise": 2 string.finds actually executed

ChickenFunc Denise Eve
  "Denise gets nudged by Eve and runs away screaming": 1 string.finds actually executed

  "This string does not exist": 0 string.finds actually executed


Problems with this approach[edit]

There is no guarantee as to which order the string matches will be attempted.

For example, assume these two patterns:

  1. "%s hits %s."
  2. "%s hits %s hard."

Now, given the input string "Alice hits Bob.", only #1 will match, and all is good.

But with the input string "Alice hits Bob hard.", there is NO guarantee which string will match. You can get #1 with the arguments "Alice", "Bob hard". Or you can get #2 with the arguments "Alice", "Bob".