For instance, memGPT uses an ‘LLM judge’ which is instructed to evaluate whether or not the generated response is consistent with the gold response, using the chat dataset of https://parl.ai/projects/msc/.
Anything else?
submitted by /u/lorepieri
[link] [comments]