curious how the performance compares to a standard llama 8b on benchmarks - interpretability usually comes with a quality tax.
SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
I don't know the specifics of what this particular model's approach is.
But SHAP unfortunately does not work for LLMs at all.
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
In the example it shows how much of the reason for an answer is due to data from Wikipedia. Can it drill down to show paragraph or sentence level that influences the answer ?
I believe that the plagiarism complaint about llm models comes from the assumption that there is a one-to-one relationship between training and answers. I think the real and delightfully messier situation is that there is a many-to-one relationship.
We'll see.