Riley on Life: benchmarks

Showing posts with label benchmarks. Show all posts

Thursday, July 31, 2014

Join Activity

I've been an expert system tool developer for the last twenty-nine years, but it's only been the last nine years that I've had a significant amount of ongoing year-to-year experience developing and maintaining programs from the other side as a user of expert system tools.

One of the pain points I encountered was trying to determine which rules were causing performance issues. A typical scenario would be that the rules would pass validation testing by the business users, but then weeks or months later an atypically large data file would be processed uncovering a performance issue (requiring hours for the data to be processed rather than seconds). Isolating the problem was accomplished by strategically deleting rules until the performance issue went away. The offending rule(s) were then inspected and analyzed with respect to the data to determine the cause of the problem. This was hardly an ideal approach and only worked best if there were a few rules causing the issue.

From a user perspective, It would be nice if you could profile rule performance much as you can profile the amount of time spent in functions/methods in languages such as C and Java. In fact, CLIPS does have profiling functions which allow you to track the amount of time spent in the actions of a rule. Unfortunately, the performance issues in rules are usually caused by the way the conditions have been written. Because rule conditions can be shared, it's difficult to assign time spent while pattern matching to a specific rule.

When I was developing performance improvements for CLIPS 6.3, I added code to track the number of operations performed on partial matches in the join network to gauge the benefit of using hash tables rather than linked lists for the join memories. It occurred to me that this could be a useful metric for identifying performance issues, and from this idea came the join-activity command in CLIPS 6.3.

This new command easily identifies the offending rule in the manners benchmark program:

CLIPS> (clear)
CLIPS> (unwatch all)
CLIPS> (load manners.clp)
:%%%%%%%********
TRUE
CLIPS> (reset)
CLIPS> (load-facts manners128.fct)
TRUE
CLIPS> (run)
CLIPS> 
(progn$ 
   (?r (get-defrule-list)) 
   (printout t ?r ": " (join-activity ?r terse) crlf))
assign_first_seat: (1314 1315 1315)
find_seating: (5847660 7067211 7067211)
make_path: (49549 16510 16510)
path_done: (127 254 254)
are_we_done: (128 255 255)
continue: (0 127 127)
print_results: (257 258 128)
all_done: (0 1 0)
CLIPS>

When used in terse mode, the join-activity command returns a multifield value containing the number of comparisons made between the left and right memories, the number of partial matches added to the join memories, and the number of partial matches removed from the join memories. For this program, the find_seating rule has much more join activity than all the other rules combined.

Using the verbose mode allows you to examine the information for each join in the rule.

CLIPS> (join-activity find_seating verbose)
Activity for CE 1
   Compares:          0
   Adds:            127
   Deletes:         127
Activity for CE 2
   Compares:       8128
   Adds:           8128
   Deletes:        8128
Activity for CE 3
   Compares:      28255
   Adds:          28255
   Deletes:       28255
Activity for CE 4
   Compares:    2487321
   Adds:        1241614
   Deletes:     1241614
Activity for CE 5
   Compares:    2483228
   Adds:        2483228
   Deletes:     2483228
Activity for CE 6
   Compares:     823428
   Adds:        1660994
   Deletes:     1660994
Activity for CE 7
   Compares:      17300
   Adds:        1644865
   Deletes:     1644865
(5847660 7067211 7067211)
CLIPS>

In conjunction with the sort function, you can create a ranked listing of all the rules. Here's an example using the waltz benchmark.

CLIPS> (clear)
CLIPS> (unwatch all)
CLIPS> (load waltz.clp)
%%%%:!!!!!!********************************
TRUE
CLIPS> (reset)
CLIPS> (load-facts waltz50.fct)
TRUE
CLIPS> (run)
CLIPS> 
(deffunction ja-lt (?r1 ?r2)
   (< (+ (expand$ (join-activity ?r1 terse))) 
      (+ (expand$ (join-activity ?r2 terse)))))
CLIPS>
(progn$ (?r (sort ja-lt (get-defrule-list)))
   (format t "%-35s %10d%n" 
             (str-cat ?r ":")
             (+ (expand$ (join-activity ?r terse)))))
initial_boundary_junction_L:           1165683
initial_boundary_junction_arrow:        511101
make_L:                                 163364
make-3_junction:                        128960
label_arrow-1B:                          59799
label_tee_B:                             55991
label_arrow-2B:                          39455
label_arrow-1A:                          35692
match_edge:                              32072
second_boundary_junction_L:              26984
label_tee_A:                             17095
label_arrow-2A:                          16217
label_L:                                 14926
label_arrow-4A:                          13811
label_arrow-4B:                          13811
label_fork-1:                            12727
reverse_edges:                           11618
plot_boundaries:                          9830
second_boundary_junction_arrow:           9641
done_reversing:                           3878
label_fork-2:                             2213
label_fork-3:                             2213
label_arrow-3A:                           1838
label_arrow-3B:                           1838
label_arrow-5A:                           1838
label_arrow-5B:                           1838
plot_remaining:                           1808
label_fork-4:                              614
done_detecting:                              7
done_plotting:                               5
done_labeling:                               2
begin:                                       0
""
CLIPS>

In this case I've added the compares, adds, and deletes together to create a single join activity value for each rule displayed in the output. Hopefully this new command will provide CLIPS developers with more options for analyzing the performance of their rules.

Thursday, May 8, 2014

River Benchmark

I came across another river crossing problem that’s similar to the farmer’s dilemma example problem for CLIPS. It’s a little bit more complex by virtue of the number of things that have to be moved, but it’s essentially the same type of problem.

One of the issues with existing benchmarks such as waltz and manners is validation. The manners benchmark only runs in CLIPS with the depth conflict resolution strategy and the waltz benchmark executes different numbers of rules depending upon the conflict resolution strategy chosen.

Clearly this is an issue. I'm a proponent of having lots of benchmarks, rather than one or two, but in order to have lots of benchmarks, they need to be dead simple to translate and verify. In the case of the existing benchmarks, they're not.

What this means is that you have to design the benchmarks with this is mind. I thought it would be an interesting exercise to demonstrate how this can be done. So I wrote a CLIPS program to solve a variation of the river crossing problem and once I had it working, set the conflict resolution strategy to random to see if the same number of rules were executed with each run. There weren’t.

It took several iterations before I had a version that produced the same number of rule executions regardless of the order in which rules of the same priority/salience were placed on the agenda. The primary mechanism used to get an exact number of rules executed was to assign weights to each of the possible moves that could be made so that the search was always made in the same order. In total, there were 19 rules and 5 salience values for the different groups of rules. If I’d used modules (which would have made translation to other languages more difficult), there wouldn’t have been any need for salience values at all.

Like the manners and waltz programs, the river program runs considerably faster in version 6.3 of CLIPS than in version 6.24. On a 2.66 GHz Intel Core 2 Duo it completes in 0.7 seconds in the newer version as opposed to 29 seconds in the older version.

The river program is available here.

Monday, March 17, 2014

Manners and Waltz Benchmarks

The two most widely used benchmarks for rule-based systems are Manners and Waltz. Daniel Selman nicely summarizes the major issues with these benchmarks in The Good, The Bad, and the Ugly - Rule Engine Benchmarks.

CLIPS didn’t implement the optimizations needed to run these benchmarks efficiently until version 6.30, so it compared unfavorably to other engines when these were the only metrics used for comparing performance.

As the following graphs show, the performance of CLIPS 6.30 is orders of magnitude faster than CLIPS 6.24 for the larger data sets used by these benchmarks.

In particular, hashing the memory nodes and optimizations for handling large numbers of activations caused the dramatic improvement in the benchmark results.

How did these optimizations improve the performance of a real world application? I benchmarked some of the larger sample data sets for a production system I developed that’s run hundreds of thousands of time a month. The system consists of hundreds of rules and the amount of data processed can range up to ten thousand or more facts.

Each of the samples showed improvement, but not nearly as dramatic as Manners or Waltz:

A process is created and the rules loaded each time the system is run, so a more accurate picture of the total processing time would include the time to load the rules:

There’s nothing wrong with modest improvements, but if your expectations of performance were based on Manners and Waltz, you’d surely be disappointed.

That’s not to say there weren’t performance benefits of the 6.30 optimizations in real world situations. Occasionally, I’d write one or more rules that were efficient for a small data set, but acceptance testing did not include a large data set. The system would run fine until a sufficiently large data set with the appropriate types of facts was submitted, at which point the process would display non-optimized Manners/Waltz behavior (i.e. it would appear to hang).

When using CLIPS 6.24, I rewrote the offending rules to be more efficient for large data sets. With 6.30, since the system is more tolerant of inefficient rules, these situations occur less frequently and are easier to correct.

CLIPS versions of the Manners and Waltz benchmarks are available here and here.

Tuesday, January 7, 2014

Pattern Matching Constants in CLIPS

In CLIPS, there are three basic templates for matching a constant in a rule pattern. The constant, in this case 387, can be placed directly in the pattern as a literal constraint:

(defrule rule-387
   (data (value 387))
   =>)

A predicate constraint can be used to evaluate an expression testing for equality:

(defrule rule-387
   (data (value ?x&:(eq ?x 387)))
   =>)

A test conditional element can be used to evaluate an expression testing for equality:

(defrule rule-387
   (data (value ?x))
   (test (eq ?x 387))
   =>)

An empirical analysis can be performed to determine which template is most efficient as the number of rules and distinct constants increase. A common set of constructs will be used with each group of rules to trigger pattern matching as a single fact is repeatedly modified:

(deftemplate data
   (slot index)
   (slot value))

(deffacts start
   (data (index 1000000) (value 100000)))

(defrule loop
   (declare (salience 1))
   ?f <- (data (index ?i&~0))
   =>
   (modify ?f (index (- ?i 1)) (value (- ?i 1))))

Using CLIPS 6.24 to run the common rules in conjunction with groups of 1, 100, 400, 700, and 1000 rules using each of the three templates produces the following results.

The literal constraint is the most efficient, which is what you’d intuitively expect after an examination of the code. It directly compares a value in a fact to a value stored in the pattern network.

The predicate constraint and test conditional element are less efficient as they require a function evaluation. The predicate constraint is slightly more efficient because it’s evaluated at an earlier stage of pattern matching than the test conditional element. If all of the referenced variables in a predicate constraint are bound within the pattern, then the evaluation can be performed in the pattern network. Test conditional elements and expressions containing variables bound in other patterns are evaluated in the join network, which is primarily used to unify variable bindings across multiple patterns.

CLIPS 6.30 provides significantly better performance for literal constraints:

In situations where multiple patterns satisfy the criteria for sharing, literal constraints are not evaluated one at a time. Instead the next node in the pattern network is determined using a hashing algorithm which requires just a single computation regardless of the number of literal constraints. So for this particular benchmark, the execution time does not increase for literal constraints as the number of rules and distinct literals is increased.

Bottom line: regardless of which version of CLIPS you’re using, use literal constraints rather than predicate constraints or test conditional elements.

The code for these benchmarks is available here.

Monday, September 16, 2013

Changes

When change happens gradually over a long period of time, it’s easy to lose sight of how far you’ve come from where you started. When I first started working with CLIPS in 1985, I used the monkey and banana (MAB) problem as a benchmark for improving performance of the rule engine. The original version had 20 rules, 5 initial facts, and executed 14 rules. To make the program a bit more complex, I added chests and keys so that objects could be locked away and the monkey required additional planning to gain access to them. This extended program had 32 rules, 13 initial facts, and executed 81 rules. It’s a toy problem and today it seems silly that I used it for benchmarking anything, but it was at the time invaluable in improving the performance of CLIPS as well as comparing the performance of various rule engines and hardware platforms. The usefulness of this benchmark is more apparent if you view it as measuring best case performance rather than performance under load as other benchmarks such as Manners and Waltz measure.

In 1986 even the state of the art Automated Reasoning Tool (ART) software running on expensive Lisp machines executed MAB at only 72 rules per second. CLIPS running on a VAX-11/780 clocked in at 16.8 rules per second and OPS5 on this same platform executed 62.3 rules per second. When I first got CLIPS running on an IBM AT, execution speed was less than 1 rule per minute. You read that right—it took over an hour for the monkey to get his hands on the banana. CLIPS dynamically allocates and deallocates memory as it executes and in those days—when memory on PCs was more often measured in KB rather than MB or GB—the memory allocation libraries were often painfully inefficient. It was only after I removed some of the unnecessary allocations/deallocations and had CLIPS cache deallocated memory that I was able to improve the performance to 4.3 rules per second.

Over 25 years later, CLIPS clocks in at well over 100,000 rules per second running MAB on my laptop. Think about that for a minute. That’s over 25,000 times faster. Twenty. Five. Thousand. And most of that, at least for MAB, is attributable to the hardware (and possibly the compiler/operating system), not improvements to CLIPS (which at most might have improved performance for MAB by a factor of 2 to 4 times). Consider that improvement in terms of the problems you can solve today that you couldn’t solve 25 years ago.

And it’s not just that hardware is faster—it’s also cheaper. I was downright ecstatic when I got a Macintosh IIfx in the early 90s at NASA. Programmers hate waiting for code to compile so we always want the fastest computer available. At the time the IIfx was “wicked fast” with a 40 MHz CPU and up to 128 MB of memory. It also had a hefty price of around $10,000 (roughly $16,000 in today’s dollars adjusting for inflation). After I left NASA and was working from home, I bought a Mac clone from Power Computing in 1997 for $4700 (roughly $6700 today). Since that time, my desire to have the fastest computer available has melted away as the lower end computers became fast enough for almost everything I do. The last desktop computer I bought was an iMac in 2009 for $1500 (roughly $1600 today). It came with a 2.66 GHz processor and 4 GB of memory. I upgraded to 8 GB of memory in 2011 for around $115 when I started running Windows in a virtual machine. Occasionally there’s some sluggishness when switching between the virtual machine and Mac OS X—which could probably be fixed by adding more RAM (if alas the computer weren’t already maxed out)—but for the most part this computer still feels new.

Usually after around four to five years, as the operating system and applications introduce new functionality that increasingly taxes the hardware, sluggishness becomes noticeable to the point that it’s easy for me to justify a new computer based on productivity alone. Yet I’m at the point where I usually start thinking about a replacement, but feel no serious need for one. If I feel that way and use a computer most of the day for programming, I wonder how the average consumer feels about replacing their computer when all they do is use it for activities like email and surfing the web. Obviously traditional computers (desktops and laptops) aren’t in danger of extinction, but if I were a company that just made these devices (and not smart phones and tablets), I’d be worried about my long term profitability.

Which gets me back to my original point. We had computers 25 years ago and we have them today, but boy have they changed. With smart phones, we literally walk around with computers having tricorder like functionality in our pockets. What will I be comparing these devices to in 25 years?