Are there any good benchmarks for chatbot memory capabilities?

For instance, memGPT uses an ‘LLM judge’ which is instructed to evaluate whether or not the generated response is consistent with the gold response, using the chat dataset of https://parl.ai/projects/msc/.

Anything else?

submitted by /u/lorepieri
[link] [comments]


Posted

in

by

Tags: