Most of this is not about Python, it’s about matplotlib. If you want the admittedly very thoughtful design of ggplot in Python, use plotnine
> I would consider the R code to be slightly easier to read (notice how many quotes and brackets the Python code needs)
This isn’t about Python, it’s about the tidyverse. The reason you can use this simpler syntax in R is because it’s non-standard-evaluation allows packages to extend the syntax in a way Python does not expose: http://adv-r.had.co.nz/Computing-on-the-language.html
Oh god no, do people write R like that, pipes at the end? Elixir style pipe-operators at the beginning is the way.
And if you really wanted to "improve" readability by confusing arguments/functions/vars just to omit quotes, python can do that, you'll just need a wrapper object and getattr hacks to get from `my_magic_strings.foo` -> `'foo'`. As for the brackets.. ok that's a legitimate improvement, but again not language related, it's library API design for function sigs.
They are of course now abandoning this idea.
The irony here: We are talking about data science. 98% of "data science" Python projects start by creating a virtual env and adding Pandas and NumPy which have numerous (really: squillions of) dependencies outside the foundation library.
pandas==2.3.3
├── numpy [required: >=1.22.4, installed: 2.2.6]
├── python-dateutil [required: >=2.8.2, installed: 2.9.0.post0]
│ └── six [required: >=1.5, installed: 1.17.0]
├── pytz [required: >=2020.1, installed: 2025.2]
└── tzdata [required: >=2022.7, installed: 2025.2]
> it’s non-standard-evaluation allows packages to extend the syntax in a way Python does not expose
Well this is a fundamental difference between Python and R.
groups = {}
for row in filtered:
key = (row['species'], row['island'])
if key not in groups:
groups[key] = []
groups[key].append(row['body_mass_g'])
can be rewritten as: groups = collections.defaultdict(list)
for row in filtered:
groups[(row['species'], row['island'])].append(row['body_mass_g'])
and variance = sum((x - mean) ** 2 for x in values) / (n - 1)
std_dev = math.sqrt(variance)
as: std_dev = statistics.stddev(values)It's also funny that one would write their own standard deviation function and include Bessel's correction. Usually if I'm manually re-implementing a standard deviation function it's because I'm afraid the implementors blindly applied the correction without considering whether or not it's actually meaningful for the given analysis. At the very least, the correct name for what's implemented there should really be `sample_std_dev`.
In the first instance, the original code is readable and tells me exactly what's what. In your example, you're sacrificing readability for being clever.
Clear code(even if verbose) is better than being clever.
defaultdict is ubiquitous in modern python, and is far from a complicated concept to grasp.
The difference between the examples is so trivial I'm not really sure why the parent comment felt compelled to complain.
If you step back, it's kind of weird that there's no mainstream programming language that has tables as first class citizens. Instead, we're stuck learning multiple APIs (polars, pandas) which are effectively programming languages for tables.
R is perhaps the closest, because it has data.frame as a 'first class citizen', but most people don't seem to use it, and use e.g. tibbles from dplyr instead.
The root cause seems to be that we still haven't figured out the best language to use to manipulate tabular data yet (i.e. the way of expressing this). It feels like there's been some convergence on some common ideas. Polars is kindof similar to dplyr. But no standard, except perhaps SQL.
FWIW, I agree that Python is not great, but I think it's also true R is not great. I don't agree with the specific comparisons in the piece.
What tools are easily available in a language, by default, shape the pretty path, and by extension, the entire feel of the language. An example that we've largely come around on is key-value stores. Today, they're table stakes for a standard library. Go back to 90's, and the most popular languages at best treated them as second-class citizens, more like imported objects than something fundamental like arrays. Sure, you can implement a hash map in any language, or import some else's implementation, but oftentimes you'll instead end up with nightmarish, hopefully-synchronized arrays, because those are built-in, and ready at hand.
> There's a number of structures that I think are missing in our major programming languages. Tables are one. Matrices are another.
I disagree. Most programmers will go their entire career and never need a matrix data structure. Sure, they will use libraries that use matrices, but never use them directly themselves. It seems fine that matrices are not a separate data type in most modern programming languages.And all of those programmers are either using specialized languages, (suffering problems when they want to turn their program into a shitty web app, for example), or committing crimes against syntax like
rotation_matrix.matmul(vectorized_cat)
Everyone in R uses data.frame because tibble (and data.table) inherits from data.frame. This means that "first class" (base R) functions work directly on tibble/data.table. It also makes it trivial to convert between tibble, data.table, and data.frames.
I suspect that in the fullness of time, mainstream languages will eventually fully incorporate tabular programming in much the same way they have slowly absorbed a variety of idioms traditionally seen as part of functional programming, like map/filter/reduce on collections.
[0] https://en.wikipedia.org/wiki/Q_(programming_language_from_K...
[1] https://ryelang.org/blog/posts/comparing_tables_to_python/
But ultimately data analysis is going beyond Python and R into the realm of Stan and PyMC3, probabilistic programming languages. It’s because we want to do nested integrals and those software ecosystems provide the best way to do it (among other probabilistic programming languages). They allow us to understand complex situations and make good / valuable decisions.
Soon as you start doing things like joins, it gets complicated but in theory you could do something like an API of an ORM to do most things. With using just operators you quickly run into the fact that you have to overload (abuse) operators or write a new language with different operator semantics:
orders * customers | (customers.id == orders.customer_id | orders.amount > Decimal(‘10.00’)
Where * means cross product/outer join and | means filter. Once you add an ordering operator, a group by, etc. you basically get SQL with extra steps.But it would be nice to have it built in so talking to a database would be a bit more native.
You're forgetting R's data.table, https://cran.r-project.org/web/packages/data.table/vignettes...,
which is amazing. Tibbles only wins because they fought the docs/onboarding battle better, and dplyr ended up getting industry buy-in.
But you can have the best of both worlds with https://dtplyr.tidyverse.org/, using data.table's performance improvements with dplyr syntax.
Relevant to the author's point, Python is pretty poor for this kind of thing. Pandas is a perf mess. Polars, duckdb, dask etc, are fine perhaps for production data pipelines but quite verbose and persnickety for rapid iteration. If you put a gun to my head and told me to find some nuggets of insight in some massive flat files, I would ask for an RStudio cloud instance + data.table hosted on a VM with 256GB+ of RAM.
most procs use tables as both input and output, and you better hope the tables have the correct columns.
you want a loop? you either get an implicit loop over rows in a table, write something using syscalls on each row in a table, or you're writing macros (all text).
Simplifying a lot, R is heavily inspired by Scheme, with some lazy evaluation added on top. Julia is another take at the design space first explored by Dylan.
My problem with APL is 1.) the syntax is less amazing at other more mundane stuff, and 2.) the only production worthy versions are all commercial. I'm not creating something that requires me to pay for a development license as well as distribution royalties.
The languages devs use are largely Algol derived. Algol is a language that was used to express algorithms, which were largely abstractions over Turing machines, which are based around an infinite 1D tape of memory. This model of 1D memory was built into early computers, and early operating systems and early languages. We call it "mechanical sympathy".
Meanwhile, other languages at the same time were invented that weren't tied so closely to the machine, but were more for the purpose of doing science and math. They didn't care as much about this 1D view of the world. Early languages like Fortran and Matlab had notions of 2D data matrices because math and science had notions of 2D data matrices. Languages like C were happy to support these things by using an array of pointers because that mapped nicely to their data model.
The same thing can be said for 1-based and 0-based indexing -- languages like Matlab, R, and Excel are 1-based because that's how people index tables; whereas languages like C and Java are 0-based because that's how people index memory.
Nitpicking aside, a nice library for doing “table stuff” without “the whole ass big table framework” would be nice.
It’s not hard to roll this stuff by hand, but again, a nicer way wouldn’t be bad.
What is a paragraph but an array of sentences? What is a sentence but an array of words? What's a word but an array of letters? You can do this all the way down. Eventually you need to assign meaning to things, and when you do, it helps to know what the thing actually is, specifically, because an array of structs can be many things that aren't a table.
Matlab has them, in fact it has multiple competing concepts of it.
The author's priorities are sensible, and indeed with that set of priorities, it makes sense to end up near R. However, they're not universal among data scientists. I've been a data scientist for eight years, and have found that this kind of plotting and dataframe wrangling is only part of the work. I find there is usually also some file juggling, parsing, and what the author calls "logistics". And R is terrible at logistics. It's also bad at writing maintainable software.
If you care more about logistics and maintenance, your conclusion is pushed towards Python - which still does okay in the dataframes department. If you're ALSO frequently concerned about speed, you're pushed towards Julia.
None of these are wrong priorities. I wish Julia was better at being R, but it isn't, and it's very hard to be both R and useful for general programming.
Edit: Oh, and I should mention: I also teach and supervise students, and I KEEP seeing students use pandas to solve non-table problems, like trying to represent a graph as a dataframe. Apparently some people are heavily drawn to use dataframes for everything - if you're one of those people, reevaluate your tools, but also, R is probably for you.
Except its not. Data science in python pretty much requires you to use numpy. So his example of mean/variance code is a dumb comparison. Numpy has mean and variance functions built in for arrays.
Even when using raw python in his example, some syntax can be condesed quite a bit:
groups = defaultdict(list) [groups[(row['species'], row['island'])].append(row['body_mass_g']) for row in filtered]
It takes the same amount of mental effort to learn python/numpy as it does with R. The difference is, the former allows you to integrate your code into any other applicaiton.
Even outside of Numpy, the stdlib has the statistics packages which provides mean, variance, population/sample standard deviation, and other statistics functions for normal iterables. The attempt to make Python out-of-the-box code look bad was either deliberately constructed to exaggerate the problems complained of, or was the product of a very convenient ignorance of the applicable parts of Python and its stdlib.
Of course there's a bunch of loops and things; you're exposing what has to happen in both R and Python under the hood of all those packages.
It's pretty clear the post is focused on the context of work being done in an academic research lab. In that context I think most of the points are pretty valid, but most of the real world benefit I've experience from using Python is being able to work more closely with engineering (even on non-Python teams).
I shipped R code to a production environment once over my career and it felt incredibly fragile.
R is great for EDA, but really doesn't work well for iteratively building larger software projects. R is has a great package system, but it's not so great when you need abstraction in between.
If you're doing data science all day, you should learn R, even if it's so weird at first (for somebody coming from a C-style language) that it seems way harder; R is made for the way statisticians work and think, not the way computer programmers work and think. If you're doing data science all day, you should start thinking and working like a statistician and working in R, and the fact that it seems to bend your mind is probably at least in part good, because a statistician needs to think differently than a programmer.
I work in python, though, almost all of the time.
BTW AI is not helping and in fact is leading to a generation of scientists who know how to write prompts, but do not understand the code those prompts generate or have the ability to peer review it.
The early tooling was also pretty dependent on Vim or Emacs. Maybe it's all easier now with VSCode or something like that.
The tooling story is also very solid - I use Emacs, but many of my friends and colleagues use IntelliJ, Vim, Sublime and VSCode, and some of them migrated to it from Atom.
Clojure, unlike lists in traditional Lisps, based on composable, unified abstraction for its collections, they are lazy by default and literal readable data structures, they are far easier to introspect and not so "opaque" compared to anything - not just CL (even Python), they are superb for dealing with heterogeneous data. Clojure's cohesive data manipulation story is where Common Lisp's lists-and-symbols just can't match.
While I agree with you in principal this also leads to what I call the "VB Effect". Back in the day VB was taught at every school as part of the standard curriculum. This made every kid a 'computer wizz'. I have had to fix many a legacy codebase that was started by someone's nephew the whizz kid.
[Data Preparation] --> [Data Analysis] --> [Result Preparation]
Neither Python or R does a good job at all of these.
The original article seems to focus on challenges in using Python for data preparation/processing, mostly pointing out challenges with Pandas and "raw" Python code for data processing.
This could be solved by switching to something like duckdb and SQL to process data.
As far as data analysis, both Python and R have their own niches, depending on field. Similarly, there are other specialized languages (e.g., SAS, Matlab) that are still used for domain-specific applications.
I personally find result preparation somewhat difficult in both Python and R. Stargazer is ok for exporting regression tables but it's not really that great. Graphing is probably better in R within the ggplot universe (I'm aware of the python port).
> Python is pretty good for deep learning. There’s a reason PyTorch is the industry standard. When I’m talking about data science here, I’m specifically excluding deep learning.
I've written very little deep learning code over my career, but made very frequent use of the GPU and differentiable programming for non-deep learning specific tasks. In general Python is much easier to write quantitative programs that make use of the hardware, and you have a lot more options when your problem doesn't fit into RAM.
> I have been running a research lab in computational biology for over two decades.
I've been working nearly exclusively in industry for these two decades and a major reason I find Python just better is it's much, much easier to interface with other parts of engineering when you're a using truly general purpose PL. I've actually never worked for a pure Python shop, but it's generally much easier to get production ML/DS solutions into prod when working with Python.
> Data science as I define it here involves a lot of interactive exploration of data and quick one-off analyses or experiments
This re-iterates the previous difference. In my experience I would call this "step one" in all my DS related work. The first step is to understand the problem and de-risk. But the vast majority of code and work is related to delivering a scalable product.
You can say that's not part of "data science", but if you did you'd have a hard time finding a job on most of the teams I've worked on.
All that said, my R vs Python experience has boiled down to: If your end result is a PDF report, R is superior. If your end result is shipping a product, then Python is superior. And my experience has been that, outside of university labs, there aren't a lot of jobs out there for DS folks who only want to deliver PDFs.
If I want to wrangle, explore, or visualise data I’ll always reach for R.
If I want to build ML/DL models or work with LLM’s I will usually reach for Python.
Often in the same document - nowadays this is very easy with Quarto.
Julia allows embedding both R and Python code, and has some very nice tools for drilling down into datasets:
It is the first language I've seen in decades that reduces entire paradigms into single character syntax, often outperforming both C and Numpy in many cases. =3
Not a "smear", but rather a well known limitation of the language. Perhaps your environment context works differently than mine.
It is bizarre people get emotionally invested in something so trivial and mundane. Julia is at v1.12.2 so YMMV, but Queryverse is a lot of fun =3
The other thing is that a lot of R’s strengths are really the tidyverse’s. Some of that is to R’s credit as an extensible language that enables a skilled API designer to really shine of course, but I think there’s no reason Python the language couldn’t have similar libraries. In fact it has, in plotnine. (I haven’t tried Polars yet but it does at least seem to have a more consistent API.)
If by data science you mean loading data to memory and running canned routines for regression, classification and other problems, then Python is great and mostly calls C/FORTRAN binaries under the hood, so Python itself has relatively little overhead.
As annoying as it is to admit it, python is a great language for data science almost strictly because it has so many people doing data science with it. The popularity is, itself, a benefit.
1. Is easy to read
2. Was easy to extend in languages that people who work with scientific data happen to like.
When I did my masters we hacked around in the numpy source and contributed here and there while doing astrophysics.
Stuff existed in Java and R, but we had learned C in the first semester and python was easier to read and contrary to MATLAB numpy did not need a license.
When data science came into the picture, the field was full of physicists that had done similar things. They brought their tools as did others.
My take (and my own experience) is that python won because the rest of the team knows it. I prefer R but our web developers don't know it, and it's way better for me to write code that the rest of our team can review, extend, and maintain.
> Either way, I’ll not discuss it further here. I’ll also not consider proprietary languages such as Matlab or Mathematica, or fairly obscure languages lacking a wide ecosystem of useful packages, such as Octave.
I feel, to most programming folks R is in the same category. R is to them what Octave is to the author. R is nice nice, but do they really want to learn a "niche" language, even if it has better some features than Python? Is holding a whole new paradigm, syntax, library ecosystem in your head worth it?
That the author avoided saying Python was a bad language outright speaks a great deal of its suitability. Well, that, and the majority data science in practice.
R is kind of a super-specialized language. Python is much more general purpose.
R failed to evolve, let's be honest. Python won via jupyter - I see this used ALL the time in universities. R is used too, but mostly for statistics related courses only, give or take.
Perhaps R is better for its niche, but Python has more momentum and in thus, dominates over R. That's simply the reality of the situation. It is like the bulldozer moving forward, at a fast speed.
> I say “This is great, but could you quickly plot the data in this other way?”
Ok so ... he would have to adjust R code too, right? And finding good info on that is simply harder. He says he has experience with universities. Well, I do too, and my experience is that people are WAY better with python than with R. You simply see that more students will drop out from R than from python. That's also simply the reality of the situation.
> They appear to be sufficiently cumbersome or confusing that requests that I think should be trivial frequently are not.
I am sure the reverse also applies. Pick some python library, do something awesome, then tell the R students to do the same. I bet he will have the same problems.
> So many times, I felt that things that would be just a few lines of simple R code turned out to be quite a bit longer and fairly convoluted.
Ok, so here he is trolling. Flat out - I said it.
I wrote a LOT of python and quite a bit of R. There is no way in life that the R code is more succinct than the python code for about 90% of the use cases out there. Sorry, that's simply not the case. R is more verbose.
> Here is the relevant code in R, using the tidyverse approach:
penguins |>
filter(!is.na(body_mass_g)) |>
group_by(species, island) |>
summarize(
This is like perl. They also don't adapt. R is going to lose grounds.This professor just hasn't realised that he is slowly becoming a fossil himself, by being unable to see that x is better than y.
- [1] https://scicloj.github.io
Now, is Python a SUCCESSFUL language? Very.
Python, the language itself, might not be a great language for data science. BUT the author can use Pandas or Polars or another data-science-related library/framework in Python to get the job done that s/he was trying to write in R. I could read both her R and Pandas code snippets and understand them equally.
This article reads just like, "Hey, I'm cooking everything by making all ingredients from scratch and see how difficult it is!".
Recently I am seeing that Python is heavily pushed for all data science related things. Sometimes objectively Python may not be the best option especially for stats. It is hard to change something after it becomes the "norm" regardless of its usability.
A better stdlib-only version would be:
from palmerpenguins import load_penguins
import math
from itertools import groupby
from statistics import fmean, stdev
penguins = load_penguins()
# Convert DataFrame to list of dictionaries
penguins_list = penguins.to_dict('records')
# create key function for grouping/sorting by species/island
def key_func(x):
return x['species'], x['island']
# Filter out rows where body_mass_g is missing and sort by species and island
filtered = sorted((row for row in penguins_list if not math.isnan(row['body_mass_g'])), key=key_func)
# Group by species and island
groups = groupby(filtered, key=key_func)
# Calculate mean and standard deviation for each group
results = []
for (species, island), group in groups:
values = [row['body_mass_g'] for row in group]
mean_value = fmean(values)
sd_value = stdev(values, xbar=mean_value)
results.append({
'species': species,
'island': island,
'body_weight_mean': mean_value,
'body_weight_sd': sd_value
})It also helps that in R any function can completely change how its arguments are evaluated, allowing the tidyverse packages to do things like evaluate arguments in the context of a data frame or add a pipe operator as a new language feature. This is a very dangerous feature to put in the hands of statisticians, but it allows more syntactic innovation than is possible in Python.
Julia and Nim [1] are dynamic and static approaches (respectively) to 1 language systems. They both have both user-defined operators and macros. Personally, I find the surface syntax of Julia rather distasteful and I also don't live in PLang REPLs / emacs all day long. Of course, neither Julia nor Nim are impractical enough to make calling C/Fortran all that hard, but the communities do tend to implement in the new language without much prompting.
Best part is, write a --help, and you can load them into LLMs as tools to help the LLMs figure it out for you.
Fight me.
I use mlr, sqlite, rye, souffle, and goawk in the shell scripts, and visidata to interactively review the intermediate files.
Personally I've found polars has solved most of the "ugly" problems that I had with pandas. It's way faster, has an ergonomic API, seamless pandas interop and amazing support for custom extensions. We have to keep in mind Pandas is almost 20 years old now.
I will agree that Shiny is an amazing package, but I would argue it's less important now that LLMs will write most of your code.
it was easy to think about the structures (iterators) it was easy to extend. it had a good community.
And for that, people start extending it via libraries.
There are plenty more alternatives now.
I can't help to conclude that Python is as good as R because I still have the choice of using Pandas when I need it. What did I get wrong?
also, we didn't define "good".
- A General programming language like Python is good enough for data science but isn't specifically designed for it.
- A language that is specifically designed for Data Science like R is better at Data Science.
Who would have thought?
Python doesn't need to be the best at any one thing; it just has to be serviceable for a lot of things. You can take someone who has expertise in a completely different domain in software (web dev, devops, sysadmin, etc.) and introduce them to the data science domain without making them learn an entirely new language and toolchain.
It's used in data science because it's used in data science.
You need to get the data from somewhere. Do you need to scrape that because Python is okay at scraping? Oh, after its scraped, we looked at it and it's in ObtuseBinaryFormat0.0.LOL.Beta and, what do you know, somebody wrote a converter for that for Python. And we need to clean all the broken entries out of that and Python is decent at that. etc.
The trick is that while Python may or may not be anybody's first choice for a particular task, Python is an okay second or third choice for most tasks.
So, you can learn Python. Or you learn <best language> and <something else>. And if <something else> is Python, was <best language> sufficiently better than Python to be worth spending the time learning?
And it got this unprecedented level of support because right from the start it made its focus clear syntax and (perceived) simplicity.
There is also a sort of cumulative effect from being nice for algorithmic work.
Guido's long-term strategy won over numerous other strong candidates for this role.
1. data scientists aren't programmers, so why do they need a programming language? the tools they should be using don't exist. they'd need programmers to make them, and all we have to offer is... more programming languages.
2. the giant problem at the heart of modern software: the most important feature of a modern programming language is being easy to read and write. this feature is conspicuously absent from most important languages.
they're trapped. they can't do what they need without a programming language but there are only a handful they can possibly use. the real reason python ended up with such good library support is they never really had a choice.
Use whatever you want on your one off personal projects but use something more non-data science friendly if you ever want your model to run directly in a production workflow.
Productionizing R models is quite painful. The normal way is to just rewrite it not in R.
If you write it in R and then rewrite it in C (better: rewrite it in English with the R as helpful annotations, then have someone else rewrite it in C), at least there is some chance you've thought about the abstractions and operations that are actually necessary for your problem.
Worked quite well, but the TS/JS langgraph version is way behind. React agents are just a few lines of code, compared to 50 odd lines for the same thing in JS/TS.
Better to use a different language, even one i'm not familiar with, to be able to maintain a few lines of code vs 50 lines.
I agree that Python is not great at anything specifically, but it is good at almost everything, and that's what makes it great.
It makes it look like perl, on a bad day, or worse autogenerated javascript.
Why on earth is it so many levels deep in objects?
Python is not a great language
First, the white space requirements are a bad flashback to 1970s fortran.
Second, it is the language that is least compatible with itself.
The focus of SAS and R were primarily limited to data science-related fields; however, Python is a far more generic programming language, thus the number of folks exposed to it is wider and thus the hiring pool of those who come in exposed to Python is FAR LARGER than SAS/R ever were, even when SAS was actively taught/utilized in undergraduate/graduate programs.
As a hiring leader in the Data Science and Engineering space, I have extensive experience with all of these + SQL, among others. Hiring has become much easier to go cross-field/post-secondary experience and find capable folks who can hit the ground running.
Often they'd be doing very simplistic querying and then manipulating via DATAstep prior to running whatever modeling and/or reporting PROCs later, rather than pushing it upstream into a far faster native database SQL pull via pass-through.
Back in 2008/2009, I saved 30h+ runtime on a regular report by refactoring everything in SQL via pass-through as opposed to the data scientists' original code that simply pulled the data down from the external source and manipulated it in DATAstep. Moving from 30h to 3m (Oracle backend) freed up an entire FTE to do more than babysit a long-running job 3x a week to multiple times per day.
In the first place data science is more a label someone put on bag full of cats, rather than a vast field covered by similarly sized boxes.
At the same time it is an absolute necessity to know if you are doing numerics. What this shows, at least to me, is that it is "good enough" and that the million integrations, examples and pieces of documentation matter more than whether the peculiarities of the language work in favor of its given use case, as long as the shortcomings can be mostly addressed.