Today, AI systems replace humans in an increasing number of decisions affecting people's lives.
Therefore, it is important to evaluate the performance of such systems {\it offline}, i.e., before they are deployed in real settings --
and compare it to the performance of human decisions they aim to replace.
The data which such evaluation is performed on has two major challenges, biasing any direct evaluations of considered decision makers.
First, in most cases the data does not include all factors that play a role in the decisions recorded in it.
Second, the past decision in the data may skew the data, and, in particular, any possible outcomes recorded in it.
%Another major challenge in such cases is that often past decisions have skewed the data on which the evaluation is performed.
For example, when a bank decides whether a customer should be granted a loan, it is desired to grant loans to customers who would honor its conditions, but not to ones who would violate them.
However, we can directly evaluate only the decision to grant the loan, while we cannot observe whether customers who were not granted the loan would indeed violate its conditions.
Such bias appears in the decisions of both human and AI decision makers -- and should be properly taken into account for evaluation.
%Further complications arise since commonly not all features that the decisions are based on are observed. DISCUSS UNOBSERVABLES IN THE INTRO?
We use a detailed causal model of the decision making process, taking into account also unobserved features.
Based on this model, we evaluate counterfactual outcomes to correct any aforementioned biases.
% to infer unobserved outcomes.
Compared to previous methods for this setting, the approach estimates the quality of decisions more accurately and with lower variance.
The approach is also %demonstrated to be
robust to different variations in the decision mechanisms in the data.
