Skip to Main Content
Status Planned
Created by Avi Yashchin
Created on Mar 12, 2024

Measure real-time synthetic respondent forms to real respondents. Evaluate real-time statistical power.

One of the key issues is to convince an independent observer that a generated responder is producing results that are better than a random guess.


Suppose that we have a study, and in the course of this study respondents complete a standard form. Suppose that we have 2000 responders.

Now, for every respondent, we will produce a matching artificial respondent that has exactly the same covariates. E.g., suppose that respondent "Resp1" completed a form F resulting in F1. We have his covariates Cov1 (e.g., demographic data), either as part of F1 or as a separate attachment.


Now we will generate an artificial respondent, whom we will call Resp1G. This respondent has the same covariates Cov1 - however, the answers to questions in the form F1G are obtained from ChatGpt.


To evaluate whether the generated respondents produce forms that are similar to to those of real respondents, we need to introduce a distance d(F1G, F1) between forms F1G and F1. Ideally, F1G = F1, in which case the distance is 0. However, we need a general formula, and we will use


d(F1G, F1) = Sum_k {w_k * d(F1G_k, F1_k)} (1)


where F1_k represents the k-th field of the completed form and w_k is the weight reflecting the importance of the field k. We could agree, for example that all the weights are positive and sum to 1, i.e.


Sum_k (w_k) = 1. In what follows, we will assume, without loss of generality, that all w_k = 1.


For example, if the k-th field is the question:


F_k: How would you rate having the imbedded hand sanitizer dispencer in a car:


0 = not important

1 = moderately important (2)

2 = very important


and the first respondent (Cov1 = 21-year old college-educated male) responded: F1_k = 1, then we would generate an artificial respondent of type Cov1, and obtain his response from ChatGpt: F1G_k = 2.


Now we need to agree on a distance d(F1G_k, F1_k): a simplest way would be to simply get an absolute value of the difference between their responses, on the above-stated numerical scale:


d(F1G_k, F1_k) = |2 - 1| = 1. (3)


So, based on the field j, the distance of Resp1G from Resp1 is 1.

We could agree that if Resp1 did not respond to some field k, then the RespG also did not respond to it, and so for such a field d(F1G_k, F1_k) = 0. Note that in real applications we can still use the field F1G_k - however, we will not use it for the purpose of computing the distance since F1_k is empty.


Thus, we have a measure of discrepancy between the responses of Resp1 and Resp1G. Suppose that the observed distance is 1 as in 3. Should we be excited about it? Not really - because we can compare our distance 1 in (3) to what the random observer would produce:


0, 1, 2 all with probability 1/3, therefore his expected distance from F1_k = 1 is 0 * (1/3) + |-1| * (1/3) + 1 * (1/3) = 2/3 = 0.66 < 1. In other words, if Resp1G is not able to deliver the same response 1 as Resp1, he would be considered as less capable than the purely random agent.

Therefore, the inefficiency of Resp1G with respect to the field k turned out to be


Ineff_k (Resp1G, Resp1) = 1 / 0.66 = 1.5


i.e. ratio of distance delivered by the generated Resp1G to the average distance of random agent.


Note that if the responses of F1G_k and F1_k were reveresed, i.e. F1G_k = 1, F1_k = 2, then the distance d(F1G_k, F1_k) = |1 - 2| = 1, as before - however, the measure of inefficiency is more favorable since a random agent would produce an average distance


0 * (1/3) + |(1-2)| * (1/3) + |(0 - 2)| * (1/3) = 1, and thus


Ineff_k (Resp1G, Resp1) = 1 / 1 = 1.


Actually, there is no need to compute the inefficiency for every individual response: we can compute the average distances between generated and true respondents for all 2000 people in the sample (for field k), then compute the average distances for the 2000 random agents, and construct the ratio for inefficiency (for field k) based on these averages. It would be lovely if this ratio were lower than 1: this would indicate that on the average the respondents generated by ChatGpt are doing better than random agents for field k. Repeating this exercise for every field of the form F, we would be able to evaluate the performance of generated respondents for these fields and get a good picture of the overall performance of generated respondents (e.g., we could create statistics of the inefficiency coefficients for various fields and compile the overall inefficiency coefficient).



  • Attach files