Fresh Hacker News | Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

▲Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length(arxiv.org)

167 points by amichail 13 days ago | 6 comments

▲cs702 13 days ago

The two issues I see with this work:

* There's no mention of model performance on recall tasks. Models with attention do well on recall tasks, but models without it, like this one, tend to do poorly.[a] What is the performance of this model on recall tasks?

* As others here have pointed out, the github link is dead. In addition, the pretrained weights don't seem to be available on HF or anywhere else. It's hard to check any claims if we have neither code nor weights!

---

[a] https://arxiv.org/abs/2402.01032

▲phh 13 days ago

> * There's no mention of model performance on recall tasks. Models with attention do well on recall tasks, but models without it, like this one, tend to do poorly.[a] What is the performance of this model on recall tasks?

For what it's worth, RWKV's (another "transformer-less") website on that matter mention that yes it's bad on recall, but for the vast majority of tasks you can just ask the question *before* the content, and it'll handle the task just fine. (I'm just reporting, I haven't spent time trying myself)

▲a2128 13 days ago

I thought the recommendation for long contexts with RWKV was to put the question after the content, otherwise it can forget the question

▲ronsor 13 days ago

This is no longer the case for RWKV-5/6

▲refulgentis 13 days ago

Section 4.3 addresses this, runs 3 benchmarks, tl;dr: 7B roundly beats LLamA 2 7B, almost same performance as LLamA 2 7B-L, which got an extra 500K of training tokens specifically at long context length.

* section 4.3, in both the paper you linked, and the paper we're commenting on. why did I notice that? it took me several minutes to understand "how the paper changed": it was 2 papers all along, I just switched tabs without realizing. And it's only Tuesday.

▲cs702 13 days ago

Those benchmarks are about long context. The question isn't about long context. It's about recall, or the ability to fetch and repeat parts of the input context. There's nothing about recall on section 4.3.

▲refulgentis 13 days ago

Do I have the wrong paper open again? :)

See "Long-Context QA tasks in Scrolls", I don't want to copy and paste the whole thing, I'll elide the words in between: "...long-context open-book question answering (QA), we use a simple prompt {CONTEXT} Q: {QUESTION} A"

n.b. It's literally the same eval in both papers :) I know, it's buried and non-obvious, it took me 20 minutes so far today to double check it 3x, even after reading it yesterday.

▲euclaise 13 days ago

This one does have attention, it's just chunked into segments of 4096

▲cs702 12 days ago

Yes, but the claim is about "unlimited context length." I doubt attention over each segment can be as good at recall as attention over the full input context.

▲qwertox 13 days ago

I was just chatting with ChatGPT about unlimited context length, and even if you theoretically could archive to have a personal assistant this way, one which would know all your chat history, an unlimited context length doesn't seem efficient enough.

It would make more sense to create a new context every day and integrate it into the model at night. Or a every day a new context of the aggregated last several days. Giving it time to sleep on it every day and it being capable to use it the next day without it needing to get passed in the context again.

▲zingelshuher 13 days ago

If we can keep unlimited memory, but use only a selected relevant subset in each chat session. This should help. Of course the key is 'selected', it's another big problem. Like short memory. Probably we can make summaries from different perspectives on idle or 'sleep' time. Training into model is very expensive, can be done only from time to time. Better to add only the most important, or most used fragments. It likely impossible to do on mobile robot, sort of 'thin agent'. If done on supercomputer we can aggregate new knowledge collected by all agents. Then push new model back to them. All this is sort of engineering approach.

▲maxma1987 12 days ago

We are sorry that we temporarily closed the repo because we were unfamiliar with the code release policy from Meta. We had to re-organize a small part of the code.

Now the repo has been re-opened at https://github.com/XuezheMax/megalodon

The model checkpoint is still under Meta legal review. We will release it once we get approval.

▲YetAnotherNick 13 days ago

This model has attention, just the sequence in broken into chunks of length 4096 and the attention is only applied for the chunk. Llama 2 was trained on chunks of length 4096 so this model has the same quadratic complexity for any sequence which fits within llama 2 context size.

▲patrickhogan1 13 days ago

Show me a working request/response that does better than state of the art.

▲zer00eyz 13 days ago

1. Open paper

2. Find GitHub

3. Read Source

https://github.com/XuezheMax/megalodon Dead link

I have stoped reading the papers, I only care about working code that can be used. Arvix LLM papers have reached the level of academic mastrubation in general.

Fix the bad link and then we have something to talk about.

▲abathur 13 days ago

Looks like probably https://github.com/dumpmemory/megalodon

▲bitvoid 13 days ago

I think that is a fork before the actual repo was made private.

XuezheMax's GitHub profile shows that his contributions in April were all to private repositories.

▲abathur 13 days ago

Seems like a reasonable reading.

It's only got 3 commits and 2 are from 18 hours ago--one of which fiddled the repo-made-public date and added the arxiv link--so perhaps there's a private working repo and they decided to make a ~clean public repo at the last minute (i.e., maybe the authors are self-conscious about the commit log).

▲elvircrn 13 days ago

Perhaps it was made private as part of a conference submission process?

▲amitport 13 days ago

Unlikely. Having code somewhere is OK, you just don't refer to it in the submitted version.

▲maleldil 12 days ago

If you link to a private repository, your GitHub profile can still be seen, which violates anonymity.

▲dartos 13 days ago

I wonder what happened

▲ 13 days ago

▲lumost 13 days ago

This happened to WizardLM2 yesterday as well https://wizardlm.github.io/WizardLM2/

▲viksit 13 days ago

they released an announcement on this.

> We are sorry for that.

> It’s been a while since we’ve released a model months ago, so we’re unfamiliar with the new release process now: We accidentally missed an item required in the model release process - toxicity testing.

> We are currently completing this test quickly and then will re-release our model as soon as possible.

https://x.com/wizardlm_ai/status/1780101465950105775?s=46

▲ 13 days ago

▲kristjansson 13 days ago

very happy my first step with any model with big claims is 'huggingface-cli donwload ...'

▲zer00eyz 13 days ago

Well that's interesting.

It took a bunch of detective work, for what should have just been a NOTE on the read me on the repo.

Is this why "science communicators" are needed?

https://xkcd.com/1254/ <<< very relevant.

▲marci 12 days ago

> https://xkcd.com/1254/ <<< very relevant.

This made me laugh so hard I hate it. I hate that it feels just like something I would do/have done.

▲padthai 12 days ago

It is up again, as why it was down: https://news.ycombinator.com/item?id=40061362

▲ 13 days ago