▲Bamba: An open-source LLM that crosses a transformer with an SSMresearch.ibm.com

207 points by shallow-mind 142 days ago | 69 comments

adt 142 days ago [-]

Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!

montebicyclelo 142 days ago [-]

So could do better than chance by excluding the option it's picked?

gryfft 142 days ago [-]

A stopped clock is right twice a day, but a running clock set to the wrong time is always wrong.

cwt137 142 days ago [-]

Not always true! Your statement is only true when the running clock's speed is the same as time. Thus, regular time and the clock's time will never meet.

If the clock is running faster than regular time, it will at point catch up to regular time and thus be correct for a split second. If the clock is slower than regular time, regular time will catch up to the clock and the clock will be right for a split second.

actionfromafar 142 days ago [-]

If we are being pedantic, running clocks never run exactly the same as time. So they'll be right (very) much more seldom than the stopped clock, which is right twice a day.

nathan_douglas 141 days ago [-]

If the clock is running backwards at very high speed, it would be right infinitely many times but the proportion of the time that it is right would approach some finite constant.

k__ 141 days ago [-]

My girlfriend's microwave-clock runs faster than normal.

Somehow this thing manages to accumulate an error of ~15 minutes in a month.

patapong 141 days ago [-]

And we haven't even touched on the issue of 24-hour format digital clocks, which can at most be right once per day if stopped!

parrit 142 days ago [-]

The RMS of wrongness of the running clock is probably lower.

nthingtohide 141 days ago [-]

> a running clock set to the wrong time is always wrong.

Could be right within 15 min accuracy in the appropriate timezone. And such a mechanism can be corrected for in the postprocessing step.

dudeinhawaii 141 days ago [-]

or.. A stopped clock is right twice a day; a mis-prompted LLM is wrong 19 times out of 20—but only because we handed it the wrong instruction sheet.

Procedural error in testing perhaps? I'm not familiar with the methodology for GPQA.

mh- 142 days ago [-]

SSM = state-space model, for the unfamiliar.

https://en.wikipedia.org/wiki/State-space_representation

jwilber 142 days ago [-]

LLM/state space models have been popular for some years now, see: https://arxiv.org/abs/2212.14052

More recently, hybrid architectures that utilize attention plus other operators are gaining traction.

See https://arxiv.org/abs/2503.01868

mentalgear 142 days ago [-]

> chose to make just about everything associated with Bamba open-source — the training recipes, the data, the data loader IBM designed for largescale distributed training, and a quantization framework aimed at shaving storage and inferencing costs.

cubefox 142 days ago [-]

Another recent transformer/SSM hybrid is "M1", with a more than 3x claimed inference speed-up compared to equivalent transformers: https://arxiv.org/pdf/2504.10449

IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.

bushbaba 141 days ago [-]

Wonder if the name is inspired by my favorite snack, bamba. The best are the hazelnut bamba.

Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies

flaviolivolsi 141 days ago [-]

Bamba means cocaine in Italian. Better not to give it to kids

ericol 141 days ago [-]

Well, have you ever heard of the Mitsubishi Pajero? [1]

https://en.wikipedia.org/wiki/Mitsubishi_Pajero

visarga 141 days ago [-]

Let me show you the etymology of Bamba:

SSM (state space model) -> SSSM (structured state space model) -> (it's like a snake ssss...) Mamba -> Bamba

zaptrem 141 days ago [-]

Where does the B come from?

cubefox 141 days ago [-]

Bamba is a traditional Mexican dance. An earlier MAMBA based SSM was called "SAMBA", a Brazilian dance I believe.

anentropic 141 days ago [-]

> they added another trillion tokens and shrank the model from 18 GB to 9 GB through quantization, reducing its bit width from Mamba2’s 16-bit floating-point precision to 8-bits.

This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

I thought generally we were naming models "nB" by their number of params and treating quantisation as a separate concern. Are there any other models that instead treat the name as an indicative memory requirement?

Is this an attempt to hide that it fares poorly vs other ~18B parameter models?

EDIT: no, I just misunderstood

cubefox 141 days ago [-]

> This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

No it doesn't? The fact that it is 18 GB with 16 bit per parameter before quantization means that it is a 9B parameter model.

anentropic 141 days ago [-]

Ah thanks, I see where I got confused now.

tmalsburg2 141 days ago [-]

Yeah, that's confusing, but the HuggingFace page says it has 9.78 B parameters.

https://huggingface.co/ibm-ai-platform/Bamba-9B-fp8

jmward01 142 days ago [-]

This type of architecture is definitely the future. Unlimited attn is a dead end. As a human you don't need to scan an entire book just to guess what the next word will be and LLMs shouldn't need that either.

og_kalu 141 days ago [-]

Humans can re-attend to material whenever necessary (i.e you can just re-read a book, re-watch a documentary etc when you feel you have missed crucial context) so it's not the end of the world. These SSMs or modern RNNs can't and if crucial context has been discarded by the end of the query then well too bad. Transformers are of course always re-attending so not an issue for them either. Until that issue is resolved, i don't think attention will be going anywhere.

imtringued 140 days ago [-]

As you said. Transformers are using linear attention for each token. It's just that n times n is quadratic. There is no way around this other than by adding a separate token that indicates rerunning the SSM from the beginning. Then you have a dynamically scaling system that seamlessly switches between linear and quadratic complexity depending on the problem.

MLA is probably the closest thing that is in-between both.

quantadev 142 days ago [-]

Not be contrarian, but if the next word prediction happens to be someone's name or a place or something discussed multiple places in the book then often, yes, a knowledge of the full plot of the book is "required" just to predict the next word, as you get to the middle or end of a book.

For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.

parrit 142 days ago [-]

What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).

boroboro4 141 days ago [-]

We don’t put things into long term memory after we read it. We usually put it after night of sleep. I personally think that context (and kv cache correspondingly) in the models are akin to our short term memory, while training process (and actual weights) are to our long term memory. And we can’t be sure our short term memory doesn’t work in a way of matching the current context towards currently stored short term memory. From this perspective transformers are enough and just fine.

141 days ago [-]

parrit 141 days ago [-]

So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said?

OTOH if you had to remember a phone number to write it down, how does that differ?

boroboro4 141 days ago [-]

I think in a way it makes transformers superior to humans, their short term memory is much more powerful =) Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.

As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).

littlestymaar 141 days ago [-]

> And much shorter than millions of tokens we expect from models nowadays.

Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).

32k is still much higher than humans' though, so I agree with you that it gives them some kind of super human abilities over moderately long context, but they are still disappointingly bad over longer context.

boroboro4 141 days ago [-]

Out of curiosity I estimated per day context size (of text only!) by multiplying reading speed by number of minutes: 16 * 60 * 300 = 288000 words ~ 288000 tokens.

tmalsburg2 141 days ago [-]

Isn't this exactly the point of this model? No need to memorize everything (which makes transfomers expensive), just keep the relevant info. SSM are essentially recurrent models.

og_kalu 141 days ago [-]

You can't always know what will be "relevant info" in the future. Even humans can't do this but whenever that's an issue, we just go back and re-read, re-watch etc.

None of these modern recurrent architecture have a way to do this.

tmalsburg2 141 days ago [-]

How often do you go back an rewatch earlier parts of a movie? I hardly ever do this. In the cinema, theater, or when listening to the radio it’s simply impossible and it still works.

og_kalu 141 days ago [-]

You are mentioning avenues that are largely for entertainment. Sure you might not go back to re-attend for those. If you will be tested or are doing research, are you really looking at a large source once ?

tmalsburg2 137 days ago [-]

It’s do easy to come up with serious non-entertainment examples, I‘m sure you don’t need my help finding them.

roger_ 142 days ago [-]

Never got how mamba models work in multiple dimensions and non-causally.

joshjob42 142 days ago [-]

For some reason this link isn't loading, but it's on https://archive.ph/Ks0xt

142 days ago [-]

OldSystemsFart 140 days ago [-]

Bamba in italian slang is cocaine, just to tell you

aantix 142 days ago [-]

Where's the code?

beklein 142 days ago [-]

I could find these two resources: Hugging Face: https://huggingface.co/collections/ibm-ai-platform/bamba-674... GitHub: https://github.com/foundation-model-stack/bamba

gitroom 141 days ago [-]

the name bamba is killing me lol, all i can see is the snack now

antirez 142 days ago [-]

Dear IBM name pickers: "Bamba", in Italian, means cocaine.

alex7o 142 days ago [-]

It's just a mamba (https://github.com/state-spaces/mamba) but with a transformer. Idk where the B comes from.

_davide_ 142 days ago [-]

When I read the title 'IBM crossed a transformer with an SSM and got ‘Bamba’' I laughed so hard I woke up my kid

iddan 142 days ago [-]

And in Heberw it's the name of a snack made of peanut-butter-flavored puffed maize https://en.wikipedia.org/wiki/Bamba_(snack)

kridsdale1 142 days ago [-]

I imported these to America to feed my infant. Data shows the prevalence of peanut allergies lines up with when AAP guidelines started recommending that babies do NOT eat peanut. Israel never went along with this and thus has the lowest rates of allergies in the world.

arijun 142 days ago [-]

I think the difference in allergy rates between UK and Israeli Ashkenazi Jews (10x higher in UK Jews!) [1] is strong evidence for that.

Also, they sell Bamba at Trader Joe’s now.

[1] https://www.jacionline.org/article/S0091-6749(08)01698-9/ful...

cycomanic 142 days ago [-]

Latest research does strongly suggest that introducing small amounts of common allergens (peanuts, shellfish,milk products...) as early as possible does significantly reduce risk for allergies later. Many early childhood organisations already recommend this. Official health recommendations are often slow to catch up (often for good reasons, but introducing peanuts etc. early is already officially recommended in quite a few countries (Australia, NZ, Sweden for example AFAIK). Not all health professionals are always up to date either though.

itayd 141 days ago [-]

You actually don't need to self import these. Usually Safeway (is it only a west coast thing?) always have these stocked in the Kosher section.

bonzini 142 days ago [-]

As an Italian who has tried (only) the Israeli Bamba, I can certify that it is pretty addictive.

amitport 142 days ago [-]

Maybe?

https://en.m.wikipedia.org/wiki/Bamba_(snack)

;)

akovaski 142 days ago [-]

https://en.wikipedia.org/wiki/La_Bamba_(song)

dantastic 142 days ago [-]

Or (where I'm from) a school cafeteria:

https://www.thelocal.se/20221125/swedish-word-of-the-day-bam...

ofrzeta 142 days ago [-]

Spot on. From the linked blog post "The refrain of La Bamba, the Mexican folk song that Ritchie Valens made famous, goes: Para bailar La Bamba/Se necesita una poca de Gracia. "

rdtsc 142 days ago [-]

So someone can get fired for picking IBM after all! Or get a bonus, depending on the organization...

fb03 142 days ago [-]

and in Portuguese, it means "flimsy". What a great name.

folgoris 142 days ago [-]

A very funny and friendly way to say "cocaine" among italians. I'm struggling to read it seriously.

rzzzt 142 days ago [-]

Para bailar La Bamba / Se necesita una poca de gracia

dismalaf 142 days ago [-]

Seems like a good fit.

142 days ago [-]

vienzo 142 days ago [-]

And in Lithuanian it's a navel

lenerdenator 142 days ago [-]

about time they did something to liven things up at big blue

francasso 142 days ago [-]

SSMs never stop

beanjuiceII 142 days ago [-]

i mean that sounds good to me

samanator 142 days ago [-]

Yummy

Loading comments...

adt 142 days ago [-]

https://lifearchitect.ai/models-table/

Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!

montebicyclelo 142 days ago [-]

So could do better than chance by excluding the option it's picked?

gryfft 142 days ago [-]

A stopped clock is right twice a day, but a running clock set to the wrong time is always wrong.

cwt137 142 days ago [-]

Not always true! Your statement is only true when the running clock's speed is the same as time. Thus, regular time and the clock's time will never meet.

actionfromafar 142 days ago [-]

If we are being pedantic, running clocks never run exactly the same as time. So they'll be right (very) much more seldom than the stopped clock, which is right twice a day.

nathan_douglas 141 days ago [-]

If the clock is running backwards at very high speed, it would be right infinitely many times but the proportion of the time that it is right would approach some finite constant.

k__ 141 days ago [-]

My girlfriend's microwave-clock runs faster than normal.

Somehow this thing manages to accumulate an error of ~15 minutes in a month.

patapong 141 days ago [-]

And we haven't even touched on the issue of 24-hour format digital clocks, which can at most be right once per day if stopped!

parrit 142 days ago [-]

The RMS of wrongness of the running clock is probably lower.

nthingtohide 141 days ago [-]

> a running clock set to the wrong time is always wrong.

Could be right within 15 min accuracy in the appropriate timezone. And such a mechanism can be corrected for in the postprocessing step.

dudeinhawaii 141 days ago [-]

or.. A stopped clock is right twice a day; a mis-prompted LLM is wrong 19 times out of 20—but only because we handed it the wrong instruction sheet.

Procedural error in testing perhaps? I'm not familiar with the methodology for GPQA.

mh- 142 days ago [-]

SSM = state-space model, for the unfamiliar.

https://en.wikipedia.org/wiki/State-space_representation

jwilber 142 days ago [-]

LLM/state space models have been popular for some years now, see: https://arxiv.org/abs/2212.14052

More recently, hybrid architectures that utilize attention plus other operators are gaining traction.

See https://arxiv.org/abs/2503.01868

mentalgear 142 days ago [-]

cubefox 142 days ago [-]

Another recent transformer/SSM hybrid is "M1", with a more than 3x claimed inference speed-up compared to equivalent transformers: https://arxiv.org/pdf/2504.10449

IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.

bushbaba 141 days ago [-]

Wonder if the name is inspired by my favorite snack, bamba. The best are the hazelnut bamba.

Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies

flaviolivolsi 141 days ago [-]

Bamba means cocaine in Italian. Better not to give it to kids

ericol 141 days ago [-]

Well, have you ever heard of the Mitsubishi Pajero? [1]

https://en.wikipedia.org/wiki/Mitsubishi_Pajero

visarga 141 days ago [-]

Let me show you the etymology of Bamba:

SSM (state space model) -> SSSM (structured state space model) -> (it's like a snake ssss...) Mamba -> Bamba

zaptrem 141 days ago [-]

Where does the B come from?

cubefox 141 days ago [-]

Bamba is a traditional Mexican dance. An earlier MAMBA based SSM was called "SAMBA", a Brazilian dance I believe.

anentropic 141 days ago [-]

> they added another trillion tokens and shrank the model from 18 GB to 9 GB through quantization, reducing its bit width from Mamba2’s 16-bit floating-point precision to 8-bits.

This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

Is this an attempt to hide that it fares poorly vs other ~18B parameter models?

EDIT: no, I just misunderstood

cubefox 141 days ago [-]

> This sounds like what they call "Bamba-9B" is actually an 18B model quantised to 8 bits.

No it doesn't? The fact that it is 18 GB with 16 bit per parameter before quantization means that it is a 9B parameter model.

anentropic 141 days ago [-]

Ah thanks, I see where I got confused now.

tmalsburg2 141 days ago [-]

Yeah, that's confusing, but the HuggingFace page says it has 9.78 B parameters.

https://huggingface.co/ibm-ai-platform/Bamba-9B-fp8

jmward01 142 days ago [-]

og_kalu 141 days ago [-]

imtringued 140 days ago [-]

MLA is probably the closest thing that is in-between both.

quantadev 142 days ago [-]

For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.

parrit 142 days ago [-]

boroboro4 141 days ago [-]

141 days ago [-]

parrit 141 days ago [-]

OTOH if you had to remember a phone number to write it down, how does that differ?

boroboro4 141 days ago [-]

littlestymaar 141 days ago [-]

> And much shorter than millions of tokens we expect from models nowadays.

Yet all current model still suck above 32k. (Yes some can do needle in a haystack fine, but they still fail at anything even slightly more complex over a long context).

boroboro4 141 days ago [-]

Out of curiosity I estimated per day context size (of text only!) by multiplying reading speed by number of minutes: 16 * 60 * 300 = 288000 words ~ 288000 tokens.

tmalsburg2 141 days ago [-]

Isn't this exactly the point of this model? No need to memorize everything (which makes transfomers expensive), just keep the relevant info. SSM are essentially recurrent models.

og_kalu 141 days ago [-]

You can't always know what will be "relevant info" in the future. Even humans can't do this but whenever that's an issue, we just go back and re-read, re-watch etc.

None of these modern recurrent architecture have a way to do this.

tmalsburg2 141 days ago [-]

How often do you go back an rewatch earlier parts of a movie? I hardly ever do this. In the cinema, theater, or when listening to the radio it’s simply impossible and it still works.

og_kalu 141 days ago [-]

tmalsburg2 137 days ago [-]

It’s do easy to come up with serious non-entertainment examples, I‘m sure you don’t need my help finding them.

roger_ 142 days ago [-]

Never got how mamba models work in multiple dimensions and non-causally.

joshjob42 142 days ago [-]

For some reason this link isn't loading, but it's on https://archive.ph/Ks0xt

142 days ago [-]

OldSystemsFart 140 days ago [-]

Bamba in italian slang is cocaine, just to tell you

aantix 142 days ago [-]

Where's the code?

beklein 142 days ago [-]

I could find these two resources: Hugging Face: https://huggingface.co/collections/ibm-ai-platform/bamba-674... GitHub: https://github.com/foundation-model-stack/bamba

gitroom 141 days ago [-]

the name bamba is killing me lol, all i can see is the snack now

antirez 142 days ago [-]

Dear IBM name pickers: "Bamba", in Italian, means cocaine.

alex7o 142 days ago [-]

It's just a mamba (https://github.com/state-spaces/mamba) but with a transformer. Idk where the B comes from.

_davide_ 142 days ago [-]

When I read the title 'IBM crossed a transformer with an SSM and got ‘Bamba’' I laughed so hard I woke up my kid

iddan 142 days ago [-]

And in Heberw it's the name of a snack made of peanut-butter-flavored puffed maize https://en.wikipedia.org/wiki/Bamba_(snack)

kridsdale1 142 days ago [-]

arijun 142 days ago [-]

I think the difference in allergy rates between UK and Israeli Ashkenazi Jews (10x higher in UK Jews!) [1] is strong evidence for that.

Also, they sell Bamba at Trader Joe’s now.

[1] https://www.jacionline.org/article/S0091-6749(08)01698-9/ful...

cycomanic 142 days ago [-]

itayd 141 days ago [-]

You actually don't need to self import these. Usually Safeway (is it only a west coast thing?) always have these stocked in the Kosher section.

bonzini 142 days ago [-]

As an Italian who has tried (only) the Israeli Bamba, I can certify that it is pretty addictive.

amitport 142 days ago [-]

Maybe?

https://en.m.wikipedia.org/wiki/Bamba_(snack)

;)

akovaski 142 days ago [-]

https://en.wikipedia.org/wiki/La_Bamba_(song)

dantastic 142 days ago [-]

Or (where I'm from) a school cafeteria:

https://www.thelocal.se/20221125/swedish-word-of-the-day-bam...

ofrzeta 142 days ago [-]

Spot on. From the linked blog post "The refrain of La Bamba, the Mexican folk song that Ritchie Valens made famous, goes: Para bailar La Bamba/Se necesita una poca de Gracia. "

rdtsc 142 days ago [-]

So someone can get fired for picking IBM after all! Or get a bonus, depending on the organization...

fb03 142 days ago [-]

and in Portuguese, it means "flimsy". What a great name.

folgoris 142 days ago [-]

A very funny and friendly way to say "cocaine" among italians. I'm struggling to read it seriously.

rzzzt 142 days ago [-]

Para bailar La Bamba / Se necesita una poca de gracia

dismalaf 142 days ago [-]

Seems like a good fit.

142 days ago [-]

vienzo 142 days ago [-]

And in Lithuanian it's a navel

lenerdenator 142 days ago [-]

about time they did something to liven things up at big blue

francasso 142 days ago [-]

SSMs never stop

beanjuiceII 142 days ago [-]

i mean that sounds good to me

samanator 142 days ago [-]

Yummy