I once had this kind of body recovery/stress level measuring thingy on me for a few days, and a doctor would then analyze my health and such. I was under some stress those days and (according to the measurements) I wasn't recovering properly even during the nights. But then there was this one, long, flat, deep green curve in the middle of my work day. I checked from my VCS what I was doing during that period: I was optimizing.
I've since noticed this many times. Optimizing is like meditation to me. It's very mechanical (measure), with a sprinkle of creative work (once you know what is slow, it's quite obvious how to make it faster, but just challenging enough to be engaging), and it has a very nice tight feedback loop: Something is slow. I make a change. Now it's fast. Next.
Optimizing is my happy place.
devnull3 5 hours ago [-]
I think optimization triggers fundamental instincts in humans:
1. Tracking i.e. navigating through jungles for hunting or find where the bottleneck is.
2. The thrill of the hunt. The joy after finding the hotspot or the bottleneck. It makes your day.
3. The actual hunt i.e. shooting an arrow/spear or using clever way of fixing.
4. Brag about the hunt or writing a blog post or tech talk on how difficult and awesome the hunt was.
Also optimizing a critical path has multiple takers in the org:
1. Product Managers are happy due to improved user experience.
2. Management is happy as in some cases it actually saves a lot of money.
3. Your boss is happy because he/she gets to score points as "team achievement" when evaluation is around the corner.
4. Engineers get to do nerdy things which otherwise would not find traction in the org.
All in all its a win-win-win-* situation.
AaronAPU 17 hours ago [-]
I spent 10 years straight doing C++ and assembly optimization. My work is still fun these days but that was probably the most enjoyable work of my career in terms of the actual day to day coding.
Code cleanup in general is the same for me, but it’s really hard to justify putting much time into that when running your own company solo.
jasonthorsness 17 hours ago [-]
What tools did you use to assess the results of your changes?
AaronAPU 16 hours ago [-]
The routines were individually benchmarked using some custom tools (iterate repeatedly and use statistical analysis to converge on an estimate). Always compared against a plain C reference implementation.
Then there was a system for benchmarking the software as a whole on a wide variety of architectures, including NUMA. With lots of plots and statistics.
Usually you’d eventually end up at a point where the improvements are below the noise floor or they help on some systems and cause regression on others. The rule was usually “no regressions”
VTune for multithreading optimization. Built a fibers and lockfree system for efficient scheduling.
ecshafer 11 hours ago [-]
That sounds like a pretty good set up with a lot of investment, HFT shop?
phanimahesh 22 minutes ago [-]
Might explain why factorion is so popular. Not just factorion but the entire genre of games it spawned
singhrac 5 hours ago [-]
There’s nothing quite like writing the final squashed commit message and saying “results should be bit-level identical, +25% speedup on average” or the equivalent. Even better if it’s a hot path that matters to the bottom line of the company.
I try to keep a few of these tasks lying around marked “unblocked” in case I need to unwind. It keeps me from doing it right after noticing, which is an extra bonus.
YZF 6 hours ago [-]
I also love optimization work. There can be a lot of creativity when it comes to reinventing how something is done. I've seen an algorithm taken to x10 after a ton of optimization by some insane SIMD techniques.
It seems there's less appreciation for this kind of work these days.
MawKKe 3 hours ago [-]
Was the need for optimization work something that was on your mind the whole time? Like a nagging feeling in the back of the head that the code is not working as well as it _should_ be working? i.e. stress.
And when you finally "fixed" the issue through optimization, your mind allowed itself to let go of these open loops (borrowing the GTD terminology), leading to relaxation?
I also love optimising, especially low level code.
01HNNWZ0MV43FF 19 hours ago [-]
I remember one fun time between jobs, that I stayed up till 4 or 5 am optimizing something. It always felt like I was making progress and about to beat the original implementation
Unfortunately I had to give up when I was still 10 times slower than the reference lol
senderista 18 hours ago [-]
Same here, last time I was between jobs I optimized my defunct startup's database from ~50K TPS to nearly 5M TPS (no durability, if you're wondering), and that was unbelievably rewarding.
zahlman 13 hours ago [-]
> Optimizing is my happy place.
Interesting. For me, it's refactoring.
optymizer 16 hours ago [-]
I see you.
jmull 20 hours ago [-]
> I dislike the “intuition doesn’t work, profile your code” mantra because it seemingly says profiling is a viable replacement for theoretical calculations, which it isn’t.
This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
Measuring/profiling just means observing the system you want to optimize in a systematic way. You certainly won't be very effective at optimizing anything if you don't observe it.
Theoretical calculations means you've formed a model of what's happening and you're devising a plan to optimize against that model. But of course, a useful model needs to represent the significant aspects of your system (and a good model should exclude most insignificant ones). Failing to observe your system means your model could be bad -- focused on insignificant aspects and missing significant ones -- and you'd never know.
jandrewrogers 19 hours ago [-]
I think many people have seen both sides of this in practice. I’ve seen engineers follow the profiler into a dead-end because they see nothing that stands out in the profiler or they don’t grok the essential nature of the code or system they are profiling. I’ve seen engineers with deep domain expertise consistently make accurate estimates of how a code changes will impact performance without ever using a profiler because their mental model of the code execution maps to reality with high fidelity.
Profilers fill in gaps in our mental model of code execution but they are not a substitute for it. Computer behavior is largely knowable from first principles, albeit requiring a considerable degree of expertise and detail. For some people in some contexts, the profiler adds little information to what they already understand. I know a few people that do almost all of their optimization work using pencil and paper with great effectiveness and precision. Not something I would recommend for most software engineers but not unreasonable at the limit either.
infogulch 18 hours ago [-]
Optimize according to your mental model; profile to check that your mental model matches reality.
If you optimize infrequently, or haven't profiled code like this recently, or haven't profiled this specific codebase recently, then your mental model is probably due a refresh.
YZF 6 hours ago [-]
In a complex system it isn't always easy to reason about the behavior. Even if you fully understand the cpu, the compiler, the kernel, the disks etc. the way they interact in a real application isn't easy to hold in your head or to deal with using theoretical tools.
I've spent a lot of time optimizing video codecs and while it is maybe possible to reason about these without a profiler it's much faster to profile. The profiler isn't going to tell you what you need to do though.
kaptainscarlet 17 hours ago [-]
A lot of the optimisation I do largely consists of a series of highly educated guesses and the resulting fixes are right in 99% percent of the cases.
purplesyringa 18 hours ago [-]
You're right, I could've phrased that better.
Profiling to find suboptimal code is perfectly fine. Then you need to figure out how to fix it. Many people don't understand how performance optimization works, so they blindly add caching, improve constant time by invoking more low-level methods, etc. This obviously doesn't work, yet intuitively (to those people, anyway) it should produce good results.
That's why the mantra exists: don't trust your intuition, don't believe it when it says these changes improve performance, instead measure that performance and only apply changes that work. This is also perfectly fine, but this is a double-edged sword, and I've seen people go too far in this direction.
For example, they refuse to do any changes that don't immediately improve performance according to the profiler. If they modify one computation and performance decreases, they abandon this path altogether. They treat optimization as a game with a dense fog of war, and they refuse to apply deductive reasoning and, of course, intuition to apply changes that, according to the profiler at least, are not immediately rewarding.
mnahkies 16 hours ago [-]
I think there's a related problem where profiling/measurements can be made poorly and not reflect the real world.
Eg: indexing or partitioning a database table may appear to make things slower if you don't have both a representative amount of data and representative query patterns when you're measuring the change.
You should still measure your changes, but sometimes you need to be careful about measuring them in the right way, and possibly simulating a future context (eg: more scale) before drawing a conclusion.
Intuition about how the context will evolve and what effect that might have on the tradeoffs of different approaches is helpful
Capricorn2481 18 hours ago [-]
Sounds like a tricky balancing act. There are things that are extremely difficult to "game out." CPUs are very complicated. There are optimizations that seem like they could be cache friendly in theory, but aren't in practice.
necovek 17 hours ago [-]
"Intuitively" literally means without having to learn something.
Adding caches or switching to lower-level calls is definitely something learned, and I wouldn't call it "intuitive".
What I think you are referring to is that sometimes, simply reading and understanding the code can tell you where the problem really is — still, my experience is that you want to measure the before and after to at least identify the general area you should be looking at, and more often than not, I could figure out what needs optimizing and how without having to get detailed profiles.
I did end up being surprised a few times, but it was mostly due to external code like buggy library implementations that didn't do the things they promised (eg. async library really synchronizing everything with a badly placed mutex).
At the same time, it's wrong to focus on a single case or a single profile (executing one code path in one set of circumstances), but what you are describing simply sounds like bad engineering — the fact that you can misuse the tool does not make the tool bad, it's still the engineer wielding it who's at fault.
scratcheee 16 hours ago [-]
And yet their statement makes perfect sense to me.
Caching and lower level calls are generic solutions that work everywhere, but are also generally the last and worst way to optimise (thus why they need such careful analysis since they so often have the opposite effect).
Better is to optimise the algorithms, where actual profiling is a lesser factor. Not a zero factor of course, as a rule of thumb it’s probably still wise to test your improvements, but if you manage to delete an n^2 loop then you really don’t need a profiler to tell you that you’ve made things better.
necovek 8 hours ago [-]
Which is exactly what I said: I am only arguing against their use of "intuitively".
titzer 20 hours ago [-]
No amount of measuring and squeezing--not even years of it--is a subsitute for high-level thinking. And vice versa.
Imagine:
function F() {
for (i = 0; i < 10; i++) {
A();
B();
C();
}
}
If we profile this code, we might find out, e.g. B takes the majority of the time--let's say 90%. So you spend hours, days, weeks, making B 2X faster. Great. Now you removed 45% of execution time. But the loop in the outer function F is just a few instructions, it is not "hot"--it won't show up in profiles except for ones that capture stacks.
If you're just stuck in the weeds optimizing hot functions that show up in profiles, it's possible to completely overlook F. That loop might be completely redundant, causing 10X the workload by repeatedly computing A, B, and C, which may don't need to be recomputed.
There are bazillions of examples like this. Say you find out that a function is super, super hot. But it's just a simple function. There are calls to it all over the code. You can't make it any faster. Instead you need to figure out how to not call it at all, e.g. by caching or rethinking the whole algorithm.
> How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
This happens more than you think. Understanding how the system works in enough detail and also at a high level to formulate a plan is in short supply. Jumping in and hacking in things, like a cache or something, is surprisingly common.
hinkley 18 hours ago [-]
Small functions need special attention not just because they show up as leaf nodes everywhere but also because they are difficult for profilers to account properly. You get two functions listed as each taking 4% of CPU time, and one could easily be taking up twice as much compute as the other. The sort of memory pressure that small functions can generate can end up scapegoating a big function that uses a large fraction of memory because it gets stuck with cold cache or lots of GC pressure from the piddly functions fouling the nest.
One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.
The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.
dmurray 19 hours ago [-]
> it won't show up in profiles except for ones that capture stacks
I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.
cogman10 14 hours ago [-]
The default usage of perf does this. There's also a few profilers I know of that will show the functions taking the most time.
IMO, those are (generally) nowhere near as useful as a flame/icicle graph.
Not saying they are never useful; Sometimes people do really dumb things in 1 function. However, the actual performance bottleneck often lives at least a few levels up the stack.
YZF 6 hours ago [-]
Which is why the defaults for perf always drive me crazy. You want to see the entire call tree with the cumulative and exclusive time spent in all the functions.
saagarjha 5 hours ago [-]
I’m honestly curious why the defaults are the way they are. I have basically never found them to be what I want. Surely the perf people aren’t doing something completely different than I am?
AtNightWeCode 18 hours ago [-]
Agree with this. But not what I concluded from OP. Architectural decisions from the start is where most optimizations should happen. I remember from school some kids that did this super optimized loop and the teacher said. Do you really have to do that same calculation on every iteration?
But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.
hinkley 18 hours ago [-]
Measuring is also useless once someone has introduced bottom up caching.
There’s so much noise at that point that even people who would usually catch problems start to miss them.
There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.
And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.
For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.
hinkley 19 hours ago [-]
I think the confusion sneaks in because “measuring” is something you do several times and each iteration has different connotation.
Once you “find” a candidate change you measure it to see if what you did made things worse and you put it back if it did, or maybe you try it in combination with other changes to see if its value is complementary.
But people fuck up all the time reading the initial telemetry, which is often where I come in. I get tired of hearing people say, “we’ve done all we can, look how flat this chart is,” and hand someone my beer. You won’t find all of the good candidates in the percentage of run time list. That’s not all the data that’s there, and not every change that works needs to even be supported by the initial data. It only needs to be supported by the delta afterward.
klysm 20 hours ago [-]
I think what he’s saying here is you can’t skip the basic math step to arrive at good performance. Staring at profiling results will lead you to a local minima
cogman10 20 hours ago [-]
Profiling doesn't mean you don't do the math. Profiling is simply there to show you that "hey, this is where the problems actually are".
You do profiling because it's WAY too easy to get obsessed about theoretical problems when a simple measurement will show you the actual problems.
You do the math on the actual problem location, not a method with O(n!) which only gets called with n=3.
You still have to look at the entire call stack when profiling (which means thinking about the overarching algorithm).
20 hours ago [-]
9rx 19 hours ago [-]
Well, he's saying that intuition does work... But does it really?
If a problem area is so intuitively obvious, why would you introduce the problem in the first place? In reality, performance optimizations are usually needed where you least expect them. Which means that you can't get there intuitively. Hence, the suggestion of using profiling to help track down where the problem is instead.
mjmahone17 19 hours ago [-]
Most code I’ve seen is written in an environment where people can’t afford to “intuit”: thinking through the repercussions of specific language or algorithm choices comes second to solving a problem “well enough”. Building an intuition takes learning how you’ve built something poorly, and for a specific problem can take hours to gain context and walk yourself through many paths.
Bring in a performance expert and their intuition can quickly identify many performance improvements. The tough part for the business is when and where do you pay for the experience of performant code, and when is good enough to leave alone?
mystified5016 19 hours ago [-]
Do you want to claim you've never written quick and ugly code to get something working to come back and fix it up later?
Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.
If you're fretting about optimization at every stage of development, you're really doing it wrong. This is precisely the type of early optimization that everyone and their dog will tell you is bad. You should focus first on making your program work and laying out the logic and data flows. Optimization does not matter until the program is almost finished.
cogman10 14 hours ago [-]
> Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.
Most of the times I've seen this, the faster algorithm is literally just a dictionary with a well-defined key.
I honestly do not understand why that's not the first solution most devs think of and why n^2 seems to dominate.
As an example, I've seen code like this quiet a bit
result = []
for (item: items) {
for (item2: items) {
if (item != item2 && item.foo == item2.foo) {
result.add(item)
}
}
}
Easily replaced and massively easier to read as
result = []
itemFoo = HashSet()
for (item: items) {
if (itemFoo.contains(item.foo))
result.add(item)
else
itemFoo.add(item.foo)
}
The mindset of re-looking for values in a collection you are already iterating through is just foreign to me. The first solution for something like this in my mind is always utilizing dictionaries.
YZF 6 hours ago [-]
For many people performance optimization really starts with your second version.
I'm not even sure what this is supposed to do exactly but the O(N^2) is just bad/sloppy code. Optimization is usually more about getting the right algorithms to execute faster. At least that's how I think about it.
9rx 17 hours ago [-]
Purposefully introducing underperforming code as a tradeoff to meet some other goal (faster delivery, for example) is something else entirely. It doesn't require intuition or profiling – at most requiring only memory of the choice. A fun aside, but not related to what we're talking about.
tough 19 hours ago [-]
One could even say optimization does not matter until the program is already running.
18 hours ago [-]
18 hours ago [-]
patrick451 17 hours ago [-]
This only works if you don't need real time performance. For anything that does need to run in realtime, you have to be damn sure that whatever your initial approach is can be optimized to run at the required sample rate. Otherwise, you will be ripping out the entire design and starting over. You might as well optimize first.
bqmjjx0kac 19 hours ago [-]
FYI, the author is a woman.
absolutelastone 19 hours ago [-]
I think the person you are responding to is criticizing the original "mantra" quote not the author's criticism of it(?)
jmull 20 hours ago [-]
Yeah, that's why I think it's nonsense.
Measuring doesn't mean don't think. Measuring and thinking are two different things. You need to do them both to optimize effectively.
Joel_Mckay 20 hours ago [-]
Or optimizing a scheduling noop loop dozens of times given local encapsulation obscures global functional context.
The fact remains most projects that do small trivial modular prototypes first will ultimately know which paths are viable before painting themselves into a corner algorithmically.
Best of luck =3
19 hours ago [-]
tmoertel 19 hours ago [-]
> > I dislike the “intuition doesn’t work, profile your code” mantra because it seemingly says profiling is a viable replacement for theoretical calculations, which it isn’t.
> This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
I believe the mantra “intuition doesn’t work, profile your code” is understood to mean “don't rely on your intuition alone; profile your work.” It's a warning that when it comes to performance optimization, intuition is often unreliable, so you need to supplement your mental model with actual data if you want to avoid big mistakes.
I don't know why the author of the blog post is interpreting the mantra literally, as if it claims intuition is obsoleted by profiling.
kccqzy 19 hours ago [-]
I think theoretical calculations could mean a detailed calculation of the number of times a certain function or operation is called. We all know from tech interviews that there are big O time complexity, but this is usually very hand-wavy and not precise enough. You can usually come up with a more precise formula though it can get messy with recurrences if your algorithm involves recursion. You probably need a computer algebra system. Spending an afternoon doing these symbolic calculations might give you better intuition of why the profiling produced such a measurement.
fabian2k 19 hours ago [-]
While profiling and measuring is very important if you want to optimize performance, there are a lot of things you can do without any profiling. In many situations the consequences of each approach are well known or easily reasoned about. Most of the time it's simply "do less work" = "more performance", or avoiding obvious and well-known patterns like N+1 database queries.
But it's also very easily to mislead yourself that way, many "optimizations" might do much less than you think. So you should avoid implementing more complex or harder to understand code just because you think it is faster, but otherwise I'd certainly try to write faster code by default in areas I know well enough to judge that.
AtNightWeCode 19 hours ago [-]
Agree. A classic example is compilers that let you choose between optimizing for speed or binary size. But a smaller sized binary is sometimes faster.
tough 18 hours ago [-]
Why not both?
touisteur 18 hours ago [-]
One way to improve performance is to unroll loops and inline code. Unfortunately this increases code size and puts pressure on the instruction cache, making a program sometimes slower. It's probably a lot harder to balance these out in the compiler than to just... sometimes try.
AtNightWeCode 18 hours ago [-]
Impossible. Speed option may do things like loop unrolling, function inlining and today even way more complicated things than that and therefor creates larger binaries.
somethingsome 2 hours ago [-]
I teach my students to make a tree of the promising optimizations and suboptimization (it can be rough), keep track of each interesting optimizations, then implement the branch that seems the more promising and at each step use the profiler massively, memory bound, compute bound, latency bound, number of flops, cache accesses, etc..
After narrowing the smallest leaf, think about another branch and if some optimizations studied could lead to nice results there too.
With time they learn which branch is more promising and which optimizations are good beforehand with their problem.
I guess this could be called branch and bound with memoization instead of brute force aha.
Note: we write code in cuda
hansvm 20 hours ago [-]
> too many cases to analyze
One of my early-career successes was just creating a framework for generating every permutation of perf optimizations for every (log-scaled -- clz is very fast) input size and checking which was best, dropping the results into a lookup table of function pointers to branch on. The university had a large supply of heterogeneous computers, replete with all the normal problems like being able to double floating-point addition throughput on Haswell CPUs by abusing the fmadd instruction, so I made a framework (probably closer to a DSL) for encoding your algorithms in a way that you could analyze perf tradeoffs at compile time and tune your result for the given computer. It's kind of like what ATLAS does for some linear algebra tasks.
Such practices are almost never optimal, but they're pretty easy to implement, and the results are near-optimal for almost all inputs. In the tradeoff between human and computer performance, I think it's a nice option.
ttd 21 hours ago [-]
Good article that I agree with mostly. One interesting note is this:
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
This is true, but can be seen as a failure of language and tooling. For example, Halide [1] pioneered (AFAIK) the concept of separating algorithm from implementation at the language level. This separation lets you express the algorithm once, and then "schedule" it by specifying parallelism, vectorization, etc. You can provide multiple schedules for one algorithm, which allows you to specialize / make different choices depending on varying factors.
It's a really interesting concept, though maybe limited in practice to DSLs. I'm not sure a general purpose language would be a good fit for this model, but then, for general purpose programs written in general purpose languages, perf optimization at the level TFA discusses is frequently limited to just specific hot sections. Those hot sections could be extracted out into specialized components written in such a DSL.
> Halide [1] pioneered (AFAIK) the concept of separating algorithm from implementation at the language level.
you don't need to go all the way to Halide to do what the article is claiming isn't possible - you can do it just by including a "micro-kernel" in your library and have the code branch to that impl (depending on something at runtime) instead of whatever the C code compiled down to. this is done every single day in every single GPU lib (famously cublas ships with hundreds/thousands of these of such ukernels for gemms depending on shapes).
purplesyringa 18 hours ago [-]
I was going for something different: I don't want to choose a different implementation in runtime, I want the compiler to see through my code and apply constant propagation -- not just for constant inputs, but inputs with known properties, like `n < 1000` or `(n & 7) == 0`. I want it to also learn facts about the output values, e.g. that my `isqrt(n) -> m` function always returns `m` such that `m^2 <= n`. None of this is possible with runtime selection because runtime selection was never the point.
Validark 15 hours ago [-]
A lot of people have ways of accomplishing this, but my way is using compile-time execution in Zig (I know at least D, C++, and Terra have their own versions of this feature). You can specify a parameter as `comptime` and then do different things based on whatever conditions you want. You can also execute a lot of code at compile-time, including your sqrt check.
E.g. I wrote a `pextComptime` function, which will compile to just a `pext` instruction on machines that have a fast implementation, otherwise it will try to figure out if it can use a few clever tricks to emit just a couple of instructions, but if those aren't applicable it will fallback on a naïve technique.
Your suggestions introduce, in effect, a hypothetical `if` statement, only one branch of which is taken. I can change the condition arbitrarily, but ultimately it's still going to be either one or the other.
I want the `if` to take both branches at once. I want the compiler to assume that both branches trigger the exact same side effects and return the same results. I want it to try both approaches and determine the better one depending on the environment, e.g. the number of free registers, (lack of) inlining, facts statically known about the input -- all those things that you can't write a condition for on the source level.
Think about it this way. A standard compiler like LLVM contains passes which rewrite the program in order. If something has been rewritten, it will never be rolled back, except it another pass performs a separate rewrite that explicitly does that. In contrast, e-graphs-based compilers like Cranelift maintain an equivalence graph that represents all possible lowerings, and after the whole graph is built, an algorithm finds a single optimal lowering.
Existing solutions make me choose immediately without knowing all the context. The solution I'd like to see would delay the choice until lowering.
alion02 3 hours ago [-]
I recently ran down approximately the same rabbit hole when trying to figure out what to do about x86 treating addition and bitwise OR differently. There's https://llvm.org/docs/LangRef.html#id171, but it can't generally be synthesized in Rust. So I went on a short-lived quest:
Oh, yes, I understand now. I've thought to myself before it would be nice if I could have implementation 1 go into variable x. And implementation 2 go into variable y. Then I do `assert(x == y)` and a compiler like Cranelift should know it only needs to pick one of them.
I'm glad to know that's the design of Cranelift, since that's how I would think it should be done, although I haven't written a middle or backend for a compiler yet.
thrtythreeforty 14 hours ago [-]
> e-graphs-based compilers like Cranelift maintain an equivalence graph that represents all possible lowerings, and after the whole graph is built, an algorithm finds a single optimal lowering
Do you have a good entrypoint reference for learning about how this works? This (and the associated mention in the article) is the first time I've heard of this approach.
panic 11 hours ago [-]
Another cool thing you could do is fuzz test a version of the code that actually does take both branches (in separate runs with fresh memory etc.) and aborts if they give different results.
almostgotcaught 12 hours ago [-]
> I want the `if` to take both branches at once.
this is called symbolic execution - as i have already told you, many compilers do this in the form sccp and scev. you don't need silly things like egraphs for this.
> If something has been rewritten, it will never be rolled back
this is patently false - not only do passes get rolled back to catch/correct errors but there is a fixed-point iteration system (at least in MLIR) that will apply passes as long as they are "profitable".
purplesyringa 3 hours ago [-]
I don't see how symbolic execution is relevant to this. Yes, symbolic execution does "check both paths" for some definition of "check", but ultimately I as a programmer still need to write a condition, and that condition is on the source level, so it can't access information on e.g. register pressure, which is what I would like to comptime-branch on.
> not only do passes get rolled back to catch/correct errors but there is a fixed-point iteration system (at least in MLIR) that will apply passes as long as they are "profitable"
Huh? I'm not arguing against fixed-point iteration, that's perfectly fine because it's still a unidirectional process. What do you mean by "passes get rolled back to catch/correct errors" though? Certain rewrites can certainly be not performed in the first place if they pessimize the code, but that's not what I'm talking about.
If there's pass 1 that chooses between rewrite A and B, pass 2 that chooses between rewrite C or D, and pass 3 choosing between E or F, in a typical compiler, this choices would be made one by one mostly greedily. An e-graph style approach allows all of those eight combinations to be tried out without necessarily leading to a combinatorial explosion.
almostgotcaught 15 hours ago [-]
i have no idea what that has to do with what op quoted from your article:
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
this is manifestly obviously possible (as i've said).
what you're talking about is something completely different goes by many names and uses many techniques (symbex, conex, sccp, scev, blah blah blah). many of these things are implemented in eg LLVM.
ttd 19 hours ago [-]
Ah ok, I see what you mean (and likely sibling comment too w.r.t. gcc feature). Yes that is a fair point - though still has the substantial downfall of maintaining many different implementation of any given algorithm.
This is useful but not equivalent. Using this type of tooling you still have to write the algorithm itself in N versions. Changing the algorithm then requires changing all N implementations. This contrasts with the Halide approach where the algorithm is written once, and then schedules can be modified without worrying that you are changing the algorithm itself.
willvarfar 21 hours ago [-]
I think it is worth making a distinction between "micro" (what the blogpost is about) and "macro", or "tactical" and "strategic", optimisations.
Strategic optimisations is often basically free if you have domain expertise. It's that easy to know that the business wants x outcome and algorithm y is the right choice etc if its all internal thought processes. Whereas if you don't know enough then you're likely to make very expensive to undo decisions.
jandrewrogers 21 hours ago [-]
I often refer to those as architectural optimizations. Even some of these tend to sensitive to the details of the operating environment.
jmward01 20 hours ago [-]
I have almost always found that simple code runs faster than complex code. I think this is because optimization is likely an NP problem and like all NP problems, the best algorithm we have for solving it is divide and conquer. The core thing about D&C is that you divide until you reach a level that you can actually find the optimum answer within the resources given but accept that by dividing the problem you will likely have some high level inefficiencies. This means simple code, code you understand all pieces of, can actually be locally optimum. When I see people try to optimize code I often see them fall into that trap of optimizing too large/complex a problem which leads to them not actually being able to find a true local optimum and, often, making far slower code than had they just tried for much simpler code. This likely NP behavior runs rampant in software development where we often think we can design some elaborate process to design things and when things fail it was because people failed to follow the process and not because the problem was NP. We all love building machines which is why we likely do this, but unless you know something the rest of the world doesn't then D&C, and admitting you are only looking for a local optimum, is the only algorithm we have for attacking NP problems. (Shameless plug here for smaller, far more autonomous teams instead of monolithic dev shops with one big shared feature list)
YZF 6 hours ago [-]
If you look at some SIMD accelerated algorithms they're anything but simple. They involve clever ways of breaking up the work into streams and clever ways of computing those streams including conditionals without breaking the data flow.
They are very much not the simplest possible implementation of the algorithm and they run an order of magnitude faster for some algorithms.
purplesyringa 3 hours ago [-]
I do still see their point. SIMD is very rarely applicable, so you could argue that SIMD can only be applied to simple enough code. The idea is: write stupid code because it'll be easier to optimize on a micro level, e.g. with SIMD.
IsTom 2 hours ago [-]
> I think this is because optimization is likely an NP
Optimization of computer programs is a fundamentally uncomputable affair due to Rice's theorem – in the general case you can't check if two programs are equivalent.
tmoertel 18 hours ago [-]
I think that there is a different reason that an emphasis on simple code often results in faster systems. When you write simple code, you spend less time writing code. Therefore, you have more time left to invest in optimizing the very small subset of your overall system that actually matters. You didn't burn engineering resources for speed where it didn't matter.
hinkley 18 hours ago [-]
Simplifying complex code often exposes optimization opportunities. Kernighan’s Law applies to performance as well as debugging.
tmoertel 18 hours ago [-]
Indeed!
hinkley 17 hours ago [-]
If you can’t make it faster, make it dumber.
jmward01 18 hours ago [-]
I'd argue that is a big part of the point I am making. If you take too big of a bite the time it takes to build it optimally goes up in an NP manor. If the bites are the right size then it balances the time/resources you have compared to all the other bites you make to get a locally optimal answer given all resource constraints. Long story short, cutting a problem into manageable pieces is a solid strategy. I will add one thing though, and that is that most people think they have cut things into manageable pieces but in reality they have left them too intertwined and they aren't really independent pieces. For divide and conquer to actually work requires that the pieces have clearly defined, and very limited, communication.
munificent 14 hours ago [-]
> When you write simple code, you spend less time writing code.
I have found that many engineers write complex code faster than simple code.
You're given requirements like: "the program should do W when the user does A, X when the user does B, Y when the user does C, and Z when the user does D." And a naive programmer will happily trot off and write a pile of code for each of those cases, often with a lot of redundancy between them.
It takes more time and judgement to analyze those cases, see what they have in common, and distill a simpler underlying model for the behavior that encompasses all of the requirements.
n4r9 2 hours ago [-]
Definitely true. "Give me a week, and I can implement this 1000-LOC feature. Give me a month and I can do it in 100 LOC."
compiler-guy 18 hours ago [-]
From the article:
"Even Apple’s LLVM fork lacks scheduling annotations for Apple Silicon. How am I supposed to write efficient code when Apple doesn’t bother to tune their own compiler?"
In addition to its public fork of LLVM, Apple also keeps a very private fork where (one assumes) they keep all the really juicy stuff.
I agree that it is frustrating not to have the data, but don't confuse "I don't have the data" with "no one has the data".
Validark 15 hours ago [-]
"Register pressure is even worse because that is only a problem because of the ISA, not the microarchitecture."
I'm not so sure. How many cycles would you expect this code to take?
According to Agner Fog, these 3 instructions have a latency of 15 cycles on an AMD Zen 1 processor. On Zen 2, its latency is 2 cycles. This is because the CPU was given the ability to assign a register to `dword [rsi]`, overcoming the limit of 16 registers.
This optimization is subject to problems, obviously pointer aliasing will enable the CPU to make the wrong assumption at times, and cause a situation not entirely unlike a branch mispredict.
There are constraints imposed by the micro-architecture for this feature. For you and I, a big one is it only works with general purpose registers. But is there a reason it couldn't or shouldn't be done for vectors? It seems like a micro-arch issue to me. Perhaps in a few years or in a lot of years, we'll have a CPU that can do this optimization for vectors.
purplesyringa 15 hours ago [-]
I learn something new every day! Thanks for mentioning this. For other readers: Agner Fog documents this in 22.18 Mirroring memory operands.
I've known that similar optimizations exist, namely store-to-load forwarding, but I didn't know that AMD has experimented with mapping in-flight writes straight into the register file. Sounds like they've abandoned this approach, though, and Zen 3 doesn't feature this, supposedly because it's expensive to implement. So for all intents and purposes, it doesn't exist anymore, and it probably won't be brought back in the same fashion.
I do still think this is something better solved by ISA changes. Doing this on the uarch level will either be flaky or more costly. It is absolutely possible, but only with tradeoffs that may not be acceptable. The APX extension doubles the number of GPRs and improves orthogonality, so there's at least work in that direction on the ISA level, and I think that's what we're realistically going to use soon.
hinkley 19 hours ago [-]
It’s like dieting. Everyone wants to hear your one weird trick and zones out when you tell them it’s hard work. Yes, mechanical sympathy is a thing but usually pulling off a series of wins isn’t luck or particularly bad work by your peers, it’s being methodical and persistent. It’s about being stubborn as fuck, and not taking no for an answer.
karmakaze 19 hours ago [-]
I agree with many of the comments here including some of the ones that are in disagreement. The difference is context where for most situations the starting point is far from optimal and the any of N better choices is a good improvement.
That doesn't seem to be what this post it talking about. It seems to talking about well worn areas trying to improve the state of the art. An example that illustrates it for me is DeepMind's AlphaTensor finding a better way to multiply matrices[0] in 2022. It wasn't a brute-force solution, but the scale of it makes it appear so.
> On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97[24] in normal arithmetic.[25] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.
This to me shows that direct profiling and observation wouldn't have led to the optimization. Improvements needed a sort-of but not actually brute-force effort of many people trying, but also being clever with their attempts.
An interesting aspect is data dependencies. If your next statement reuses data you just computed, that can cause pipeline bubbles, as that result you want to use just isn't available yet. I dived into that topic for a video about relative performance of old PCs I just published today.
jandrewrogers 21 hours ago [-]
Yes, there is non-obvious structure in some algorithms solely for the purpose of turning a single logical stream of dependent instructions into multiple concurrent streams of dependent instructions running through the same pipeline. The caveat of doing this, of course, is that it typically increases register pressure.
PaulKeeble 21 hours ago [-]
When it comes to micro optimisations the issue is partly our usual tools in algorithm analysis and hardware intuition are very far apart. Random accessing memory is very slow compared to linear, some branches are considerably worse than others and it's actually quite hard to predict how much you can improve the low level details before you start, especially for big changes. Our tools can show us where time is being lost in inefficiencies but can't help us predict how changes will improve things.
rurban 1 hours ago [-]
perf? I value -fopt-info more. The optimizer does crazy things all the time, and hereby you find out what exactly is optimized, de-optimized or needs work.
mannyv 20 hours ago [-]
Optimization is hard work because you need to understand how things work to be effective.
It's really just debugging and troubleshooting, but with a different goal in mind.
pavelstoev 8 hours ago [-]
Optimizing AI performance is like peeling an onion — every time you remove one bottleneck, another layer appears underneath. What looks like a compute problem turns out to be a memory bottleneck, which then turns out to be a scheduling issue, which reveals a parallelism mismatch… and so on.
It’s a process of continuous uncovering, and unless you have visibility across the whole stack — from kernel to cluster — you’ll spend all your time slicing through surface layers with lots of tears being shed.
Fortunately, there are software automation solutions to this.
saagarjha 5 hours ago [-]
They’re not very good, unfortunately.
enescakir 21 hours ago [-]
The hardest bugs are the ones that only show up after you “optimize.”
hinkley 18 hours ago [-]
The ones that only happen when you remove the debug statements are super fun.
raluk 21 hours ago [-]
For low level benchmarking papi https://icl.utk.edu/papi/ is great tool. It provides access for counters like cache misses, branch misses and number of cycles.
loeg 19 hours ago [-]
This is like, tip of the spear stuff, and that's very cool. But in most of the software world, I would argue, performance optimization is knocking out extremely low-hanging, obvious fruit -- it's relatively obvious what's slow, and why, and just picking any of N better approaches is good enough to eliminate the problem.
MoonGhost 15 hours ago [-]
In big projects it's not that obvious. I've seen people optimizing for their desktop applications which are intended for server. It's nice for progress reporting though.
interactivecode 19 hours ago [-]
While actually measuring and profiling is great.
In my experience most webapps can fix so much low hanging performance issues by mapping the API in a way that matches how its used in the client. It can remove so much mapping and combining for data all over.
inetknght 17 hours ago [-]
Performance optimization isn't a brute-force task. It just (currently) requires a lot of skill, and it's hindered by terrible documentation; performance in software is only a brute-force task because 99% of software don't tell you what their performance impacts are.
In C++, you can achieve performance using the often-denigrated standard template library, if you would only pay attention to the documented performance requirements for given function calls. But even there it's not the panacae that it could be, because it often provides an amortized cost while handwashing the access pattern (for example: std::unordered_map is great in algorithmic cost theory and terrible in memory access patterns).
What's the algorithmic cost (big-O) of this function call? What's the memory footprint of that function call? Worse: is that documented big-O estimated across a contiguous dataset, discontiguous paged datasets, or does it have to dereference pointers? What's the network cost of a given call? It's hard to know if you don't know that `write()` to a socket could incur 40 or 400 bytes across a network, and don't know whether that network is 1mbps, 10mbps, 1gbit, or localhost, etc; and with how much latency.
For example, when I hand-rolled some x86_64 SIMD instructions to analyze some DNA, I found the Intel Intrinsics Guide [0] 1000% helpful because many of the instructions detailed what sort of performance to expect on specific processor architectures (or, at least, in general). If you read the Intel Instruction Set Reference [1], a lot of similar performance information can be found. The performance I achieved was approximately the theoretical bottleneck between CPU and main RAM; getting better performance would have required a complete algorithm change which would have been Ph.D paper publishing worthy.
Of course, sometimes even low level hardware can have performance-killing bugs.
I don't think we're in disagreement. You have to consider big-O cost, memory footprint, exact numbers, know what performance to expect from various abstractions, etc. -- and then you need to choose between multiple alternatives. The first half of this process is absolutely skill-based, but I'm arguing that when you're trying to push performance to its limit, the second half unavoidably becomes expensive and brute-forcy.
For example: do you compression data sent over the network? What level of compression do you use? Changing the data format can affect the optimal compression level, and vice versa, using a higher compression level means you can keep the underlying data simpler. For example, you can replace deserialization with zero-cost casts. But that might mean you spend more memory. Do you have that much memory? If you do, would giving that memory to the database for use as cache be better? And so on.
The individual choices are simple, but they compound and affect the overall performance in unpredictable ways. The only way to be sure you aren't missing something obvious is to check all, or at least most combinations.
inetknght 12 hours ago [-]
> I don't think we're in disagreement.
Very likely we agree on many points but perhaps disagree at the outcome :)
So to expand & reply...
> do you compression data sent over the network?
Depends on size of data, complexity of data, and performance of your system, and performance of the receiving system.
The choice you make might be correct for one set of variables but there might be a better choice for a different set of variables. If the receiving system is overloaded, then decompression might add to that. If the sending system is overloaded, then compressing might add to that.
But compression over the network is often a moot point since most networks should be encrypted whereas compression can be used to defeat encryption (see compression-oracle-attacks such as BREACH [0] and CRIME [1]).
> using a higher compression level means you can keep the underlying data simpler
Why? Moreover, I would never rely on data to be compressed-well to then leave data simpler (but perhaps more data).
> would giving that memory to the database for use as cache be better?
As in, shared-memory? Or do you mean to implement your own database?
> The individual choices are simple, but they compound and affect the overall performance in unpredictable ways.
I disagree about unpredictability. I've rarely found some effect to be unpredictable. Generally when something is unpredictable it actually represented something I didn't understand.
> The only way to be sure you aren't missing something obvious is to check all, or at least most combinations.
Would you check all combinations of inputs for a given continuous function? No, of course not, that would be effectively impossible.
The engineer should identify the edge cases and the corner cases and ensure that the product works within acceptable behavior for the systems being targeted. There's a saying: "you only care about performance of the things you benchmark, and be careful what you benchmark". So if you don't have any performance benchmarks then you don't care about performance. If you do have performance benchmarks, then use them in the systems you intend to deploy to.
Ultimately, it takes a seasoned engineer to understand a whole system and especially to understand what a given change might incur in terms of performance to that whole system. If a seasoned engineer can do it then AI is just around the corner to do it automatically. That's not a brute-force task. We just don't have a whole lot of seasoned engineers; but we do have a ton of engineers with deep domain-specific knowledge though, and those domain-specific knowledge engineers often would brute-force some aspect they don't understand. They can usually fix it in a patch, after all.
Performance is a trade-off between various aspects of various systems. That doesn't make it a brute-force task. The trade-off decisions can be made with the right data. The right data can be found either: via brute force with benchmarks, or with proper documentation about the system. The system might tell you about its performance if you ask it (eg, the system is self-documenting), but you also have to beware that the performance can change -- and neither benchmark (brute force) nor documentation will tell you unless you look for it (re-run benchmarks or re-query documentation).
18 hours ago [-]
Misdicorl 19 hours ago [-]
I'll throw my hat in the ring for "disagree". Few kinds of work have such clear and unambiguous results as optimization work (its now X% faster!). Few kinds of work have such incredible and detailed tools to guide your hand in finding where to invest your effort (look at this hot loop!).
The fact that sometimes optimization work is tricky or requires some pre-thinking, or is even gasp counter-intuitive is such a hilarious way to say "this is hard". That's just table stakes starting points for so-so-so much work.
Edit: Heck, even deciding whether to prioritize optimization work or feature work is usually a harder problem than the actual optimization work itself.
saagarjha 5 hours ago [-]
Performance optimizations are usually tradeoffs. Which ones you make affect the outcome.
19 hours ago [-]
bddicken 8 hours ago [-]
This was a great read - thanks!
nostrademons 16 hours ago [-]
This is coming from the perspective of a performance engineer whose day job is squeezing every last bit of performance out of system libraries and low-level code. This is important work, and it can pay very well if you get one of the few positions in it. But for an application developer whose primary day job is cranking out features and then spending 10% of the time at the end optimizing them, the conclusion (and the headline) very much does not hold.
For them, the systematic way to optimize goes: profile your code, and then apply domain knowledge of the product to optimize the hotspots with these common techniques:
Don't do repeated work. (If you have an expensive invariant calculation in a loop or function call, move it out, to a higher level of the program where it can be done once.)
Save your work. (Caching and memoization.)
Do less work. (Alter the product requirements to use less computationally-intensive techniques in cases where users won't notice the difference.)
Do work when the user isn't looking. (Move computations to background tasks, apply concurrency, perform async requests.)
If all else fails, call in a performance engineer like the author to micro-optimize your building blocks.
You can often get speedups of 3-6 orders of magnitude by applying these techniques, simply because the original code is so brain-dead. Performance engineers like the author tend to work on code that has already been tightly optimized, and so there is less low-hanging fruit to pick.
18 hours ago [-]
oqtvs 20 hours ago [-]
Question coming from the article: what would be better tooling instead of profilers and MCA?
purplesyringa 18 hours ago [-]
I'd love to use a tool that shows the state of every CPU component at each point in time. Performance counters demonstrate global behavior, while what actually matters during optimization is local behavior. I'd like to be able to inspect pipeline stalls and conditions that led to these situations, I'd like to get an estimate on the efficiency of port allocation, I'd like to be able to compare the rate of memory accesses vs computation and get exact numbers, e.g. "you can access 20% more data over the bus without adding CPU stalls".
Ygg2 6 hours ago [-]
> Any developer can see that the following two snippets are (supposed to be) equivalent
Being how I fall under any developer, I'm going to bite the bullet and ask:
How are they supposed to be equivalent? One is a HashSet, other is boolean? And what is `c`.
saagarjha 5 hours ago [-]
c is an arbitrary object. The first puts two numbers and checks if c is present, which is only true if c is equal to one of the two numbers.
Ygg2 4 hours ago [-]
I see. Luckily mobile phone cut `.contains(&c)` perfectly. And you see scrollbars only when you click the text.
trhway 19 hours ago [-]
>Performance optimization is hard because it’s fundamentally a brute-force task, and there’s nothing you can do about it.
fundamentally disagree. First it is a building of a mental model of what happens, a kind of analysis stage, and then compare it to the mental model of how it should or could work or producing a more efficient algorithm/way of accomplishing the target task.
When people try to brute-force, lets try this or this, without having the model in mind, that is frequently a waste, and even when/if it produces some improvement there is no understanding and guarantee what use cases the improvement will cover, whether it would regress some use cases, whether it still be there after we push those new features/fixes/etc.
purplesyringa 18 hours ago [-]
The problem is that way too often, the model simply doesn't capture enough complexity to be applicable. This happens rarely during high-level optimization but is very common during microoptimization. You can build a model, and it will give you good enough results, but you won't be able to extract those last bits of performance you need to surpass SOTA.
Mawr 18 hours ago [-]
Sure, but that's a different statement. "Micro performance optimizations are fundamentally brute force" - ok.
ForOldHack 19 hours ago [-]
The word is 'Grock' you have to Grock the performance to optimize it.
My father had a PhD in Operations Research/Industrial Engneering.
hinkley 18 hours ago [-]
> Grock
Grock was a famous Swiss clown and once the highest paid entertainer in Europe.
The non-clown word you’re looking for is grok and Robert Heinlein coined it in 1961.
I've since noticed this many times. Optimizing is like meditation to me. It's very mechanical (measure), with a sprinkle of creative work (once you know what is slow, it's quite obvious how to make it faster, but just challenging enough to be engaging), and it has a very nice tight feedback loop: Something is slow. I make a change. Now it's fast. Next.
Optimizing is my happy place.
1. Tracking i.e. navigating through jungles for hunting or find where the bottleneck is.
2. The thrill of the hunt. The joy after finding the hotspot or the bottleneck. It makes your day.
3. The actual hunt i.e. shooting an arrow/spear or using clever way of fixing.
4. Brag about the hunt or writing a blog post or tech talk on how difficult and awesome the hunt was.
Also optimizing a critical path has multiple takers in the org:
1. Product Managers are happy due to improved user experience.
2. Management is happy as in some cases it actually saves a lot of money.
3. Your boss is happy because he/she gets to score points as "team achievement" when evaluation is around the corner.
4. Engineers get to do nerdy things which otherwise would not find traction in the org.
All in all its a win-win-win-* situation.
Code cleanup in general is the same for me, but it’s really hard to justify putting much time into that when running your own company solo.
Then there was a system for benchmarking the software as a whole on a wide variety of architectures, including NUMA. With lots of plots and statistics.
Usually you’d eventually end up at a point where the improvements are below the noise floor or they help on some systems and cause regression on others. The rule was usually “no regressions”
VTune for multithreading optimization. Built a fibers and lockfree system for efficient scheduling.
I try to keep a few of these tasks lying around marked “unblocked” in case I need to unwind. It keeps me from doing it right after noticing, which is an extra bonus.
It seems there's less appreciation for this kind of work these days.
And when you finally "fixed" the issue through optimization, your mind allowed itself to let go of these open loops (borrowing the GTD terminology), leading to relaxation?
https://en.wikipedia.org/wiki/Biofeedback
Unfortunately I had to give up when I was still 10 times slower than the reference lol
Interesting. For me, it's refactoring.
This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
Measuring/profiling just means observing the system you want to optimize in a systematic way. You certainly won't be very effective at optimizing anything if you don't observe it.
Theoretical calculations means you've formed a model of what's happening and you're devising a plan to optimize against that model. But of course, a useful model needs to represent the significant aspects of your system (and a good model should exclude most insignificant ones). Failing to observe your system means your model could be bad -- focused on insignificant aspects and missing significant ones -- and you'd never know.
Profilers fill in gaps in our mental model of code execution but they are not a substitute for it. Computer behavior is largely knowable from first principles, albeit requiring a considerable degree of expertise and detail. For some people in some contexts, the profiler adds little information to what they already understand. I know a few people that do almost all of their optimization work using pencil and paper with great effectiveness and precision. Not something I would recommend for most software engineers but not unreasonable at the limit either.
If you optimize infrequently, or haven't profiled code like this recently, or haven't profiled this specific codebase recently, then your mental model is probably due a refresh.
I've spent a lot of time optimizing video codecs and while it is maybe possible to reason about these without a profiler it's much faster to profile. The profiler isn't going to tell you what you need to do though.
Profiling to find suboptimal code is perfectly fine. Then you need to figure out how to fix it. Many people don't understand how performance optimization works, so they blindly add caching, improve constant time by invoking more low-level methods, etc. This obviously doesn't work, yet intuitively (to those people, anyway) it should produce good results.
That's why the mantra exists: don't trust your intuition, don't believe it when it says these changes improve performance, instead measure that performance and only apply changes that work. This is also perfectly fine, but this is a double-edged sword, and I've seen people go too far in this direction.
For example, they refuse to do any changes that don't immediately improve performance according to the profiler. If they modify one computation and performance decreases, they abandon this path altogether. They treat optimization as a game with a dense fog of war, and they refuse to apply deductive reasoning and, of course, intuition to apply changes that, according to the profiler at least, are not immediately rewarding.
Eg: indexing or partitioning a database table may appear to make things slower if you don't have both a representative amount of data and representative query patterns when you're measuring the change.
You should still measure your changes, but sometimes you need to be careful about measuring them in the right way, and possibly simulating a future context (eg: more scale) before drawing a conclusion.
Intuition about how the context will evolve and what effect that might have on the tradeoffs of different approaches is helpful
Adding caches or switching to lower-level calls is definitely something learned, and I wouldn't call it "intuitive".
What I think you are referring to is that sometimes, simply reading and understanding the code can tell you where the problem really is — still, my experience is that you want to measure the before and after to at least identify the general area you should be looking at, and more often than not, I could figure out what needs optimizing and how without having to get detailed profiles.
I did end up being surprised a few times, but it was mostly due to external code like buggy library implementations that didn't do the things they promised (eg. async library really synchronizing everything with a badly placed mutex).
At the same time, it's wrong to focus on a single case or a single profile (executing one code path in one set of circumstances), but what you are describing simply sounds like bad engineering — the fact that you can misuse the tool does not make the tool bad, it's still the engineer wielding it who's at fault.
Caching and lower level calls are generic solutions that work everywhere, but are also generally the last and worst way to optimise (thus why they need such careful analysis since they so often have the opposite effect).
Better is to optimise the algorithms, where actual profiling is a lesser factor. Not a zero factor of course, as a rule of thumb it’s probably still wise to test your improvements, but if you manage to delete an n^2 loop then you really don’t need a profiler to tell you that you’ve made things better.
Imagine: function F() { for (i = 0; i < 10; i++) { A(); B(); C(); } }
If we profile this code, we might find out, e.g. B takes the majority of the time--let's say 90%. So you spend hours, days, weeks, making B 2X faster. Great. Now you removed 45% of execution time. But the loop in the outer function F is just a few instructions, it is not "hot"--it won't show up in profiles except for ones that capture stacks.
If you're just stuck in the weeds optimizing hot functions that show up in profiles, it's possible to completely overlook F. That loop might be completely redundant, causing 10X the workload by repeatedly computing A, B, and C, which may don't need to be recomputed.
There are bazillions of examples like this. Say you find out that a function is super, super hot. But it's just a simple function. There are calls to it all over the code. You can't make it any faster. Instead you need to figure out how to not call it at all, e.g. by caching or rethinking the whole algorithm.
> How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
This happens more than you think. Understanding how the system works in enough detail and also at a high level to formulate a plan is in short supply. Jumping in and hacking in things, like a cache or something, is surprisingly common.
One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.
The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.
I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.
IMO, those are (generally) nowhere near as useful as a flame/icicle graph.
Not saying they are never useful; Sometimes people do really dumb things in 1 function. However, the actual performance bottleneck often lives at least a few levels up the stack.
But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.
There’s so much noise at that point that even people who would usually catch problems start to miss them.
There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.
And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.
For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.
Once you “find” a candidate change you measure it to see if what you did made things worse and you put it back if it did, or maybe you try it in combination with other changes to see if its value is complementary.
But people fuck up all the time reading the initial telemetry, which is often where I come in. I get tired of hearing people say, “we’ve done all we can, look how flat this chart is,” and hand someone my beer. You won’t find all of the good candidates in the percentage of run time list. That’s not all the data that’s there, and not every change that works needs to even be supported by the initial data. It only needs to be supported by the delta afterward.
You do profiling because it's WAY too easy to get obsessed about theoretical problems when a simple measurement will show you the actual problems.
You do the math on the actual problem location, not a method with O(n!) which only gets called with n=3.
You still have to look at the entire call stack when profiling (which means thinking about the overarching algorithm).
If a problem area is so intuitively obvious, why would you introduce the problem in the first place? In reality, performance optimizations are usually needed where you least expect them. Which means that you can't get there intuitively. Hence, the suggestion of using profiling to help track down where the problem is instead.
Bring in a performance expert and their intuition can quickly identify many performance improvements. The tough part for the business is when and where do you pay for the experience of performant code, and when is good enough to leave alone?
Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.
If you're fretting about optimization at every stage of development, you're really doing it wrong. This is precisely the type of early optimization that everyone and their dog will tell you is bad. You should focus first on making your program work and laying out the logic and data flows. Optimization does not matter until the program is almost finished.
Most of the times I've seen this, the faster algorithm is literally just a dictionary with a well-defined key.
I honestly do not understand why that's not the first solution most devs think of and why n^2 seems to dominate.
As an example, I've seen code like this quiet a bit
Easily replaced and massively easier to read as The mindset of re-looking for values in a collection you are already iterating through is just foreign to me. The first solution for something like this in my mind is always utilizing dictionaries.Measuring doesn't mean don't think. Measuring and thinking are two different things. You need to do them both to optimize effectively.
The fact remains most projects that do small trivial modular prototypes first will ultimately know which paths are viable before painting themselves into a corner algorithmically.
Best of luck =3
> This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?
I believe the mantra “intuition doesn’t work, profile your code” is understood to mean “don't rely on your intuition alone; profile your work.” It's a warning that when it comes to performance optimization, intuition is often unreliable, so you need to supplement your mental model with actual data if you want to avoid big mistakes.
I don't know why the author of the blog post is interpreting the mantra literally, as if it claims intuition is obsoleted by profiling.
But it's also very easily to mislead yourself that way, many "optimizations" might do much less than you think. So you should avoid implementing more complex or harder to understand code just because you think it is faster, but otherwise I'd certainly try to write faster code by default in areas I know well enough to judge that.
After narrowing the smallest leaf, think about another branch and if some optimizations studied could lead to nice results there too.
With time they learn which branch is more promising and which optimizations are good beforehand with their problem.
I guess this could be called branch and bound with memoization instead of brute force aha.
Note: we write code in cuda
One of my early-career successes was just creating a framework for generating every permutation of perf optimizations for every (log-scaled -- clz is very fast) input size and checking which was best, dropping the results into a lookup table of function pointers to branch on. The university had a large supply of heterogeneous computers, replete with all the normal problems like being able to double floating-point addition throughput on Haswell CPUs by abusing the fmadd instruction, so I made a framework (probably closer to a DSL) for encoding your algorithms in a way that you could analyze perf tradeoffs at compile time and tune your result for the given computer. It's kind of like what ATLAS does for some linear algebra tasks.
Such practices are almost never optimal, but they're pretty easy to implement, and the results are near-optimal for almost all inputs. In the tradeoff between human and computer performance, I think it's a nice option.
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
This is true, but can be seen as a failure of language and tooling. For example, Halide [1] pioneered (AFAIK) the concept of separating algorithm from implementation at the language level. This separation lets you express the algorithm once, and then "schedule" it by specifying parallelism, vectorization, etc. You can provide multiple schedules for one algorithm, which allows you to specialize / make different choices depending on varying factors.
It's a really interesting concept, though maybe limited in practice to DSLs. I'm not sure a general purpose language would be a good fit for this model, but then, for general purpose programs written in general purpose languages, perf optimization at the level TFA discusses is frequently limited to just specific hot sections. Those hot sections could be extracted out into specialized components written in such a DSL.
1 - https://halide-lang.org/
you don't need to go all the way to Halide to do what the article is claiming isn't possible - you can do it just by including a "micro-kernel" in your library and have the code branch to that impl (depending on something at runtime) instead of whatever the C code compiled down to. this is done every single day in every single GPU lib (famously cublas ships with hundreds/thousands of these of such ukernels for gemms depending on shapes).
E.g. I wrote a `pextComptime` function, which will compile to just a `pext` instruction on machines that have a fast implementation, otherwise it will try to figure out if it can use a few clever tricks to emit just a couple of instructions, but if those aren't applicable it will fallback on a naïve technique.
https://github.com/Validark/Accelerated-Zig-Parser/blob/8782...
Your suggestions introduce, in effect, a hypothetical `if` statement, only one branch of which is taken. I can change the condition arbitrarily, but ultimately it's still going to be either one or the other.
I want the `if` to take both branches at once. I want the compiler to assume that both branches trigger the exact same side effects and return the same results. I want it to try both approaches and determine the better one depending on the environment, e.g. the number of free registers, (lack of) inlining, facts statically known about the input -- all those things that you can't write a condition for on the source level.
Think about it this way. A standard compiler like LLVM contains passes which rewrite the program in order. If something has been rewritten, it will never be rolled back, except it another pass performs a separate rewrite that explicitly does that. In contrast, e-graphs-based compilers like Cranelift maintain an equivalence graph that represents all possible lowerings, and after the whole graph is built, an algorithm finds a single optimal lowering.
Existing solutions make me choose immediately without knowing all the context. The solution I'd like to see would delay the choice until lowering.
- https://internals.rust-lang.org/t/expose-llvms-or-disjoint-i...
- https://github.com/rust-lang/libs-team/issues/373
- https://github.com/rust-lang/rust/pull/124601
Which ultimately culminated in an opinion that should sound familiar - https://github.com/rust-lang/rust/pull/124601#issuecomment-2....
I'm glad to know that's the design of Cranelift, since that's how I would think it should be done, although I haven't written a middle or backend for a compiler yet.
Do you have a good entrypoint reference for learning about how this works? This (and the associated mention in the article) is the first time I've heard of this approach.
this is called symbolic execution - as i have already told you, many compilers do this in the form sccp and scev. you don't need silly things like egraphs for this.
> If something has been rewritten, it will never be rolled back
this is patently false - not only do passes get rolled back to catch/correct errors but there is a fixed-point iteration system (at least in MLIR) that will apply passes as long as they are "profitable".
> not only do passes get rolled back to catch/correct errors but there is a fixed-point iteration system (at least in MLIR) that will apply passes as long as they are "profitable"
Huh? I'm not arguing against fixed-point iteration, that's perfectly fine because it's still a unidirectional process. What do you mean by "passes get rolled back to catch/correct errors" though? Certain rewrites can certainly be not performed in the first place if they pessimize the code, but that's not what I'm talking about.
If there's pass 1 that chooses between rewrite A and B, pass 2 that chooses between rewrite C or D, and pass 3 choosing between E or F, in a typical compiler, this choices would be made one by one mostly greedily. An e-graph style approach allows all of those eight combinations to be tried out without necessarily leading to a combinatorial explosion.
> There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases.
this is manifestly obviously possible (as i've said).
what you're talking about is something completely different goes by many names and uses many techniques (symbex, conex, sccp, scev, blah blah blah). many of these things are implemented in eg LLVM.
https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning....
Strategic optimisations is often basically free if you have domain expertise. It's that easy to know that the business wants x outcome and algorithm y is the right choice etc if its all internal thought processes. Whereas if you don't know enough then you're likely to make very expensive to undo decisions.
They are very much not the simplest possible implementation of the algorithm and they run an order of magnitude faster for some algorithms.
Optimization of computer programs is a fundamentally uncomputable affair due to Rice's theorem – in the general case you can't check if two programs are equivalent.
I have found that many engineers write complex code faster than simple code.
You're given requirements like: "the program should do W when the user does A, X when the user does B, Y when the user does C, and Z when the user does D." And a naive programmer will happily trot off and write a pile of code for each of those cases, often with a lot of redundancy between them.
It takes more time and judgement to analyze those cases, see what they have in common, and distill a simpler underlying model for the behavior that encompasses all of the requirements.
"Even Apple’s LLVM fork lacks scheduling annotations for Apple Silicon. How am I supposed to write efficient code when Apple doesn’t bother to tune their own compiler?"
In addition to its public fork of LLVM, Apple also keeps a very private fork where (one assumes) they keep all the really juicy stuff.
I agree that it is frustrating not to have the data, but don't confuse "I don't have the data" with "no one has the data".
I'm not so sure. How many cycles would you expect this code to take?
According to Agner Fog, these 3 instructions have a latency of 15 cycles on an AMD Zen 1 processor. On Zen 2, its latency is 2 cycles. This is because the CPU was given the ability to assign a register to `dword [rsi]`, overcoming the limit of 16 registers.This optimization is subject to problems, obviously pointer aliasing will enable the CPU to make the wrong assumption at times, and cause a situation not entirely unlike a branch mispredict.
There are constraints imposed by the micro-architecture for this feature. For you and I, a big one is it only works with general purpose registers. But is there a reason it couldn't or shouldn't be done for vectors? It seems like a micro-arch issue to me. Perhaps in a few years or in a lot of years, we'll have a CPU that can do this optimization for vectors.
I've known that similar optimizations exist, namely store-to-load forwarding, but I didn't know that AMD has experimented with mapping in-flight writes straight into the register file. Sounds like they've abandoned this approach, though, and Zen 3 doesn't feature this, supposedly because it's expensive to implement. So for all intents and purposes, it doesn't exist anymore, and it probably won't be brought back in the same fashion.
I do still think this is something better solved by ISA changes. Doing this on the uarch level will either be flaky or more costly. It is absolutely possible, but only with tradeoffs that may not be acceptable. The APX extension doubles the number of GPRs and improves orthogonality, so there's at least work in that direction on the ISA level, and I think that's what we're realistically going to use soon.
That doesn't seem to be what this post it talking about. It seems to talking about well worn areas trying to improve the state of the art. An example that illustrates it for me is DeepMind's AlphaTensor finding a better way to multiply matrices[0] in 2022. It wasn't a brute-force solution, but the scale of it makes it appear so.
> On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97[24] in normal arithmetic.[25] Some algorithms were completely new: for example, (4, 5, 5) was improved to 76 steps from a baseline of 80 in both normal and mod 2 arithmetic.
This to me shows that direct profiling and observation wouldn't have led to the optimization. Improvements needed a sort-of but not actually brute-force effort of many people trying, but also being clever with their attempts.
[0] https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...
It's really just debugging and troubleshooting, but with a different goal in mind.
It’s a process of continuous uncovering, and unless you have visibility across the whole stack — from kernel to cluster — you’ll spend all your time slicing through surface layers with lots of tears being shed.
Fortunately, there are software automation solutions to this.
In my experience most webapps can fix so much low hanging performance issues by mapping the API in a way that matches how its used in the client. It can remove so much mapping and combining for data all over.
In C++, you can achieve performance using the often-denigrated standard template library, if you would only pay attention to the documented performance requirements for given function calls. But even there it's not the panacae that it could be, because it often provides an amortized cost while handwashing the access pattern (for example: std::unordered_map is great in algorithmic cost theory and terrible in memory access patterns).
What's the algorithmic cost (big-O) of this function call? What's the memory footprint of that function call? Worse: is that documented big-O estimated across a contiguous dataset, discontiguous paged datasets, or does it have to dereference pointers? What's the network cost of a given call? It's hard to know if you don't know that `write()` to a socket could incur 40 or 400 bytes across a network, and don't know whether that network is 1mbps, 10mbps, 1gbit, or localhost, etc; and with how much latency.
For example, when I hand-rolled some x86_64 SIMD instructions to analyze some DNA, I found the Intel Intrinsics Guide [0] 1000% helpful because many of the instructions detailed what sort of performance to expect on specific processor architectures (or, at least, in general). If you read the Intel Instruction Set Reference [1], a lot of similar performance information can be found. The performance I achieved was approximately the theoretical bottleneck between CPU and main RAM; getting better performance would have required a complete algorithm change which would have been Ph.D paper publishing worthy.
Of course, sometimes even low level hardware can have performance-killing bugs.
[0]: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
[1]: https://www.intel.com/content/www/us/en/developer/articles/t...
For example: do you compression data sent over the network? What level of compression do you use? Changing the data format can affect the optimal compression level, and vice versa, using a higher compression level means you can keep the underlying data simpler. For example, you can replace deserialization with zero-cost casts. But that might mean you spend more memory. Do you have that much memory? If you do, would giving that memory to the database for use as cache be better? And so on.
The individual choices are simple, but they compound and affect the overall performance in unpredictable ways. The only way to be sure you aren't missing something obvious is to check all, or at least most combinations.
Very likely we agree on many points but perhaps disagree at the outcome :)
So to expand & reply...
> do you compression data sent over the network?
Depends on size of data, complexity of data, and performance of your system, and performance of the receiving system.
The choice you make might be correct for one set of variables but there might be a better choice for a different set of variables. If the receiving system is overloaded, then decompression might add to that. If the sending system is overloaded, then compressing might add to that.
But compression over the network is often a moot point since most networks should be encrypted whereas compression can be used to defeat encryption (see compression-oracle-attacks such as BREACH [0] and CRIME [1]).
[0]: https://en.wikipedia.org/wiki/BREACH_(security_exploit)
[1]: https://en.wikipedia.org/wiki/CRIME
> using a higher compression level means you can keep the underlying data simpler
Why? Moreover, I would never rely on data to be compressed-well to then leave data simpler (but perhaps more data).
> would giving that memory to the database for use as cache be better?
As in, shared-memory? Or do you mean to implement your own database?
> The individual choices are simple, but they compound and affect the overall performance in unpredictable ways.
I disagree about unpredictability. I've rarely found some effect to be unpredictable. Generally when something is unpredictable it actually represented something I didn't understand.
> The only way to be sure you aren't missing something obvious is to check all, or at least most combinations.
Would you check all combinations of inputs for a given continuous function? No, of course not, that would be effectively impossible.
The engineer should identify the edge cases and the corner cases and ensure that the product works within acceptable behavior for the systems being targeted. There's a saying: "you only care about performance of the things you benchmark, and be careful what you benchmark". So if you don't have any performance benchmarks then you don't care about performance. If you do have performance benchmarks, then use them in the systems you intend to deploy to.
Ultimately, it takes a seasoned engineer to understand a whole system and especially to understand what a given change might incur in terms of performance to that whole system. If a seasoned engineer can do it then AI is just around the corner to do it automatically. That's not a brute-force task. We just don't have a whole lot of seasoned engineers; but we do have a ton of engineers with deep domain-specific knowledge though, and those domain-specific knowledge engineers often would brute-force some aspect they don't understand. They can usually fix it in a patch, after all.
Performance is a trade-off between various aspects of various systems. That doesn't make it a brute-force task. The trade-off decisions can be made with the right data. The right data can be found either: via brute force with benchmarks, or with proper documentation about the system. The system might tell you about its performance if you ask it (eg, the system is self-documenting), but you also have to beware that the performance can change -- and neither benchmark (brute force) nor documentation will tell you unless you look for it (re-run benchmarks or re-query documentation).
The fact that sometimes optimization work is tricky or requires some pre-thinking, or is even gasp counter-intuitive is such a hilarious way to say "this is hard". That's just table stakes starting points for so-so-so much work.
Edit: Heck, even deciding whether to prioritize optimization work or feature work is usually a harder problem than the actual optimization work itself.
For them, the systematic way to optimize goes: profile your code, and then apply domain knowledge of the product to optimize the hotspots with these common techniques:
Don't do repeated work. (If you have an expensive invariant calculation in a loop or function call, move it out, to a higher level of the program where it can be done once.)
Save your work. (Caching and memoization.)
Do less work. (Alter the product requirements to use less computationally-intensive techniques in cases where users won't notice the difference.)
Do work when the user isn't looking. (Move computations to background tasks, apply concurrency, perform async requests.)
If all else fails, call in a performance engineer like the author to micro-optimize your building blocks.
You can often get speedups of 3-6 orders of magnitude by applying these techniques, simply because the original code is so brain-dead. Performance engineers like the author tend to work on code that has already been tightly optimized, and so there is less low-hanging fruit to pick.
Being how I fall under any developer, I'm going to bite the bullet and ask:
How are they supposed to be equivalent? One is a HashSet, other is boolean? And what is `c`.
fundamentally disagree. First it is a building of a mental model of what happens, a kind of analysis stage, and then compare it to the mental model of how it should or could work or producing a more efficient algorithm/way of accomplishing the target task.
When people try to brute-force, lets try this or this, without having the model in mind, that is frequently a waste, and even when/if it produces some improvement there is no understanding and guarantee what use cases the improvement will cover, whether it would regress some use cases, whether it still be there after we push those new features/fixes/etc.
My father had a PhD in Operations Research/Industrial Engneering.
Grock was a famous Swiss clown and once the highest paid entertainer in Europe.
The non-clown word you’re looking for is grok and Robert Heinlein coined it in 1961.