Wanda experiments for FSCD 2020

These experiments were done for the paper WANDA – a Higher Order Termination Tool, presented at FSCD 2020. All experiments were performed with Wanda version 2.2, on a 60 second timeout.

We performed four series of experiments, three of which are (patially) noted in the paper. In the first, we evaluated WANDA on the Termination Problem Database (TPDB), version 10.6. We used various configurations, to evaluate the power of various techniques described in the paper. This gives the following results:

Technique Wanda configuration #Yes #No Perc. Avg. time
WANDA without restrictions ./wanda.exe 188 16 78% 1.14
Only rule removal ./wanda.exe -d nt,dp 123 0 47% 1.13
Only rule removal with StarHorpo ./wanda.exe -d poly,nt,dp 111 0 42% 0.24
Only rule removal with Polynomial Interpretations ./wanda.exe -d horpo,nt,dp 59 0 23% 0.07
Only dependency pairs ./wanda.exe -d rr,nt 186 0 71% 1.02
Only static dependency pairs ./wanda.exe -d dynamic,rr,nt 152 0 58% 0.55
Only dynamic dependency pairs ./wanda.exe -d static,rr,nt 167 0 64% 1.30
Full WANDA, but without a first-order tool ./wanda.exe -i none 183 9 74% 0.90
Full WANDA, but without simplifying fake patterns ./wanda.exe --dontsimplify 155 16 66% 0.75

(full result table for the TPDB 2018)

The second set of experiments is not included in the paper. This set of experiments considers the TPDB without the two subdirectories Uncurried_Applicative_11 and Hamana_Kikuchi_18. The reason for this separation is that those two directories seem to contain largely generated benchmarks – there is a lot of repetition of the same few rules (mostly fold, rec and map), coupled with different first-order systems or called in slightly different ways. While it is interesting to see how tools perform on these variations, it does carry a significant risk of skewing the results. So, we isolated the results on the remaining benchmarks, which allows us to more accurately assess the power on truly higher-order systems. This gives the following results:

Technique Wanda configuration #Yes #No Perc. Avg. time
WANDA without restrictions ./wanda.exe 67 5 82% 0.45
Only rule removal ./wanda.exe -d nt,dp 39 0 44% 0.27
Only rule removal with StarHorpo ./wanda.exe -d poly,nt,dp 29 0 33% 0.12
Only rule removal with Polynomial Interpretations ./wanda.exe -d horpo,nt,dp 27 0 31% 0.06
Only dependency pairs ./wanda.exe -d rr,nt 66 0 66% 0.69
Only static dependency pairs ./wanda.exe -d dynamic,rr,nt 51 0 58% 0.35
Only dynamic dependency pairs ./wanda.exe -d static,rr,nt 54 0 61% 0.81
Full WANDA, but without a first-order tool ./wanda.exe -i none 67 4 81% 0.37
Full WANDA, but without simplifying fake patterns ./wanda.exe --dontsimplify 64 5 78% 0.63

(full result table for TPDB 2018, with two directories removed)

As the remaining benchmarks do not have a significant first-order section, the first-order tool makes very little difference. It is also notable that StarHorpo and Polynomial Interpretations perform much more equally, as do static and dynamic dependency pairs.

The third set of experiments is mentioned in the paper, although the tables were not included. This set of experiments tests WANDA's performance when coupled with different first-order termination back-ends. The first three tests couple WANDA with AProVE, NaTT and MU-TERM respectively for both termination and non-termination; the last test uses NaTT for termination and AProVE for non-termination. The first table below considers all benchmarks in the TPDB; the second table considers all except the most recently added category, since the difference is remarkable.

Setup Wanda configuration #Yes #No #Maybe #Timeout Perc. Avg. time
AProVE for termination and non-termination ./wanda.exe 188 16 24 33 78% 1.20
NaTT for termination and non-termination ./wanda.exe -i nattprover -n nattprover 201 10 35 15 81% 1.87
MU-TERM for termination and non-termination ./wanda.exe -i mutermprover -n mutermprover 188 9 31 33 75% 0.87
NaTT for termination, AProVE for non-termination if NaTT fails ./wanda.exe -i nattprover 200 17 24 20 83% 1.88
No first-order tool ./wanda.exe -i none 183 9 46 23 74% 0.89

(full result table for WANDA with various first-order tools on the TPDB 2018)

Setup Wanda configuration #Yes #No #Maybe #Timeout Perc. Avg. time
AProVE for termination and non-termination ./wanda.exe 155 16 21 6 86% 0.72
NaTT for termination and non-termination ./wanda.exe -i nattprover -n nattprover 154 9 32 3 82% 0.37
MU-TERM for termination and non-termination ./wanda.exe -i mutermprover -n mutermprover 155 9 28 6 83% 0.38
NaTT for termination, AProVE for non-termination if NaTT fails ./wanda.exe -i nattprover 153 16 21 8 85% 0.41
No first-order tool ./wanda.exe -i none 150 9 36 3 80% 0.41

(full result table for WANDA with various first-order tools on the TPDB 2017)

It is notable that NaTT's advantage is very local; on the benchmarks in the 2017 TPDB, all three tools perform equally. It seems likely that later updates of MU-TERM and AProVE will also handle the tools in Hamana_Kikuchi_18/. Otherwise, what we can observe by investigating the cases where a first-order tool is needed to conclude YES is that all these benchmarks have a simple higher-order part: in all cases, a proof with static dependency pairs and the subterm criterion (or computable subterm criterion) suffices for the higher-order part. This may suggest that using a first-order tool for termination analysis is not worthwhile when the higher-order part has a different form. However, it might also just be a case of the database: for most benchmarks, a first-order tool is not needed at all, and the few where it is needed here are not a sufficiently diverse sample space to confidently draw any conclusions.

The final set of experiments considers WANDA's power on the Confluence Problems database (COPS), 2019 version. As WANDA cannot read HRS input, the benchmarks were converted by CSI^ho, with some manual changes (this was necessary because, due to lacking documentation on WANDA's input restrictions, some of the generated AFSMs could not be read by WANDA). We used the same configurations as before, except the "simplifying fake patterns" configuration, which only makes sense for AFS input.

Technique Wanda configuration #Yes #No Perc. Avg. time
WANDA without restrictions ./wanda.exe --betafirst 43 30 78% 0.09
Only rule removal ./wanda.exe -d nt,dp --betafirst 37 0 40% 0.11
Only rule removal with StarHorpo ./wanda.exe -d poly,nt,dp --betafirst 33 0 35% 0.17
Only rule removal with Polynomial Interpretations ./wanda.exe -d horpo,nt,dp --betafirst 21 0 23% 0.01
Only dependency pairs ./wanda.exe -d rr,nt --betafirst 43 0 46% 0.51
Only static dependency pairs ./wanda.exe -d dynamic,rr,nt --betafirst 37 0 40% 0.26
Only dynamic dependency pairs ./wanda.exe -d static,rr,nt --betafirst 40 0 43% 0.77
Full WANDA, but without a first-order tool ./wanda.exe -i none --betafirst 43 30 78% 0.07

(full result table for COPS 2019 (only pattern HRSs))

The longer average time when using dependency pairs is largely due to a single benchmark (433) for which rule removal is very effective, but which the non-monotonic reduction pair processor takes much longer to handle.