The rules are: 1) scalars always broadcast, 2) if one vector has fewer dimensions, left pad it with 1s and 3) starting from the right, check dimension compatibility, where compatibility means the dimensions are equal or one of them is 1. Example: np.ones((2,3,1)) * np.ones((1,4)) = np.ones((2,3,4))
Once your dimensions are correct, it's a lot easier to reason your way through a problem, similar to how basic dimensional analysis in physics can verify your answer makes some sense.
(I would disable broadcasting if I could, since it has caused way too many silent bugs in my experience. JAX can, but I don't feel like learning another library to do this.)
Once I understood broadcasting, it was a lot easier to practice vectorizing basic algorithms.
After taking the time to work through that doc and ponder some real-world examples, I went from being very confused by broadcasting to employing intermediate broadcasting techniques in a matter of weeks. Writing out your array dimensions in the same style of their examples (either in a text file or on a notepad) is the key technique IMO:
Image (3d array): 256 x 256 x 3
Scale (1d array): 3
Result (3d array): 256 x 256 x 3
And of course with practice you can do it in your head.That said, yes, you definitely should at least make an attempt to clarify your broadcasting logic if you want to be able to read your own scripts in a month from now, let alone write maintainable production code.
Unfortunately there is far too much existing code and python is not type-safe.
The others are just messy shit. Like you got np.abs but no arr.abs, np.unique but no arr.unique. But now you have arr.mean.
Sometimes you got argument name index, sometimes indices, sometimes accept (list, tuple), sometime only tuple.
I guess, it might be hard to achieve similar feature in Python without metaprogramming.
I've been writing my own low-level numeric routines lately, so I'm not up-to-date on the latest news, but there have been a few ideas floating around over the last few years about naming your axes and defining operations in terms of those names [0,1,2,3]. That sort of thing looks promising to me, and one of those projects might be a better conceptual fit for you.
[0] https://nlp.seas.harvard.edu/NamedTensor
[1] https://pypi.org/project/named-arrays/
I particularly recommend checking out xarray. It has made my numpy-ish code like 90% shorter and it makes it trivial to juggle six+ dimensional arrays. If your data is on a grid (not shaped like a table/dataframe), I see no downsides to using xarray instead of bare numpy.
I wish ChatGPT had been around when I learned C. It would sure have saved the programmers in my neighboring offices a lot of grief.
Implicit type casting is considered a mistake in most programming languages; if I were to redesign numpy from scratch I would make all broadcasting explicit.
My solution to these problems is asserting an array's shape often. Does anybody know is there's a tool like mypy or valgrind, but that checks mismatched array shapes rather than types or memory leaks?
Pandas has a .pipe(fn) method, but without the lazy evaluation to enable the R symbol capturing magic, the syntax is pretty clunky and not particularly useful. The closest approximation is method chaining, which at least is more consistently available in Pandas than in Numpy.
If you're talking about Dplyr "verbs" then no, there's nothing quite like that in Python, but it's much less necessary in Pandas or Polars than in R, because the set of standard tools for working with data frames in the bear libraries is much richer than in the R standard library.
This was a footgun due to C long being int32 in win64. Glad that they changed it.
I am much more of an "upgrade when there is a X.1" release kind of guy, so my hat off to those who will bravely be testing the version on my behalf.
One new interesting feature, though, is the support for string routines: https://numpy.org/devdocs/reference/routines.strings.html#mo...
Sounds almost like they're building a language inside a language.
That’s not true? Python string implementation is very optimized, probably have similar performance to C.
Strings are immutable, so no efficient truncation, concatenation, or modifications of any time, you're always reallocating.
There's no native support for a view of string, so operations like iteration over windows or ranges have to allocate or throw away all the string abstractions.
By nature of how the interpreter stores objects, Strings are always going to have an extra level of indirection compared to what you can do with a language like C.
Python strings have multiple potential underlying representations, and thus have some overhead for managing and dealing with those multiple representations without exposing those details to user code
[1] - https://github.com/ashvardanian/StringZilla?tab=readme-ov-fi...
They're not building a language. They're carefully adding a newly-in-demand feature to a mature, already-built language.
> arange’s start argument is positional-only
You may want to check out cupy
Even the new string dtype I expect would go unnoticed by half of users or more, because they won't be using it (because Numpy historically only had fixed-length strings and generally poor support for them) and so won't even think to try it. Pandas meanwhile has had a proper string dtype for a while, so anyone interested in doing serious work on strings in data frames / arrays would presumably be using Pandas anyway.
Most of the breaking changes are in long-deprecated oddball functions that I literally have never seen used in the wild, and in the internal parts that will be a headache for library developers.
The only change that a casual user might actually notice is the change in repr(np.float64(3.0)), from "3.0" to "np.float64(3.0)".
let me do `pip install numpy2` and not have to worry about whether or not some other library in my project requires numpy<2.
From a project point of view, there are some pretty strong contra-indicators in the last 20 years of language development that make this plan suspect, or at least pretty scary — both Perl and Python had extremely rocky transitions around major versions; Perl’s ultimately failing and Python’s ultimately taking like 10 years. At least. I think the last time I needed Python 2 for something was a few months ago, and before that it had been a year or so. I’ve never needed Perl 6, but if I did I would be forced to read a lot of history while I downloaded and figured out which, if any, Perl 5 modules I’m looking for got ported.
I’d imagine the numpy devs probably don’t have the resources to support what would certainly become two competing forks that each have communities with their own needs.
raku has good package compatibility via Inline::Perl5 and Inline::Python and FFI to languages like Rust and Zig
among the many downsides of the transition, one upside is that raku is a clean sheet of paper and has some interesting new work for example in LLM support
I have started work on a new raku module called Dan::Polars and would welcome contributions from Numpy/Pandas folks with a vision of how to improve the APIs and abstractions … it’s a good place to make a real contribution, help make something new and better and get to grips with some raku and some rust.
just connect via https://github.com/librasteve/raku-Dan-Polars if you are interested and would like to know more
One huge pain point for me in perl 5 was just how incredibly slow CPAN was compared to `go import`, like two orders of magnitude slower. I remember putting up with those times in the ‘90s because package management was a kind of miracle over FTP sites, but it’s a big ask in today’s world.
What’s raku’s story here, out of curiosity?
the raku package manager - zef comes bundled with the rakudo compiler - I use https://rakubrew.org
https://raku.land is a directory of raku packages
I would say that zef is very good (it avoids the frustrations of Python package managers like pip and conda) like perl before it, raku was designed with packages and installers in mind with a concern for a healthy ecosystem
for example, all versions (via the META6.json payload descriptor) carry versioning and the module version descriptor is a built in language type https://docs.raku.org/type/Version that does stuff like this:
say v1.0.1 ~~ v1.*.1; # OUTPUT: «True»
and this zef install Dan::Pandas:ver<4.2.3>:auth<github:jane>:api<1>
and this use Dan::Pandas:ver<4.3.2+>:auth<github:jane>:api<1>;
(of course, authors must authenticate to upload modules)My point, or at least the point I had in mind, was that the social and technical go together in a lot of subtle and sometimes surprising ways; in this case, I’d bet the idea of a second package name a) is a bad one because it’s likely to create differing community expectations about whether or not it’s okay to keep using the 1.0 package, and b) would let people feel okay not upgrading for a while / demanding security and bug fix point releases on 1.x longer than if the package itself just updates its major version.
So at least the migration path for python modules is clear: upgrade to be numpy 2 compatible, wait for critical mass, start adding numpy 2 features. Sounds way better than python2 -> python3 migration, for example.
However, the fact that I had to look at 3rd party page to find this out is IMHO a big documentation problem. It should be plastered in all announcements, on documentation and migration page: "there is a common subset for python code of numpy 1 and 2, so you can upgrade now, no need to wait for full adoption"
Is this not common knowledge? Also, pip install? Or do you mean some requirements file?
Makes it look like they pressed publish before filling in their template, or is this on purpose?
Most people are simply unaware of them, which is why we get stuff like pandas on top of everything.
Maybe try Pixi? [1] Python programming enjoyability really increased for me after using Pixi for dependencies, VSCode+Pylance [2] for editing, and Ruff [3] for formatting.
Pixi can install both python and dependencies _per project_. Then, I add this to .vscode/settings.json:
{
"python.analysis.typeCheckingMode": "strict",
"python.defaultInterpreterPath": ".pixi/envs/default/bin/python3"
}
and I'm all set!--
> It is the result of 11 months of development since the last feature release and is the work of 212 contributors spread over 1078 pull requests
instead of:
> It is the result of X months of development since the last feature release by Y contributors