Ah Pandas, cute fuzzy animals we all love…but also a Python library that has been the go-to for dealing with tabular data, until now. The problem with Pandas is well-known: toss it a normal(ish)-sized spreadsheet, and it handles it like a champ, but pass it a massive amount of tabular data, and it’s slow as molasses if it even works.
Pandas introduced the concept of a DataFrame, essentially the standard for working with tabular data in Python. I love DataFrames, you love DataFrames, but in a world of LLMs and massive datasets, Pandas looks like it’s going the way of the Dodo, and that’s probably more than okay because the alternatives are pretty darn awesome.
Devs bidding farewell to Pandas isn’t an entirely new phenomenon, for years, people have been complaining that Pandas are too slow to deal with large datasets. Hop onto Reddit, and you’ll find a zillion posts like this one ⬇️
As you can see, this was written two years ago, and over the last couple of years…well things have changed. LLMs are all the rage, and they need data, lots of it; try passing a 10GB file to Pandas. It’s not going to work.
Now, there are more Pandas alternatives than ever; they’re faster, much faster, and some even have gone as far as to match Pandas method names so you can import the new library as pd and jam away. While I could share 5–10 libraries that you could switch to, I’m all for short and sweet, so I’m going to give you two alternatives, starting with my personal fav because it’s just so crazy fast.
Billing themselves as “DataFrames for the new era” Polars isn’t just fast, it’s like rocketship level, melt your face off fast. As for a quick TLDR; on it, here are a couple of sentences from the crew over there:
Polars is an open-source library for data manipulation, known for being one of the fastest data processing solutions on a single machine. It features a well-structured, typed API that is both expressive and easy to use.
As for exactly how much faster Polars is than anything else, I think this image just about sums it up — a task that takes Pandas 207 seconds, Polars can do in 8 seconds 🫠
Okay, so FireDucks didn’t have an image quite as polished as Polars so I created one for them. How’d I do?
What makes FireDucks pretty special IMHO, is that they’re a no-frills library build to be able to pretty much instantly swap out Pandas. What I like about FireDucks, as you can see from the image below, is that you can simply change your important statement from import Pandas as pd to import fireducks.pandas as pd and poof, done, instant performance improvement.
Polars does not integrate as seamlessly, so you will need to go file by file making code changes to get Polars to play nicely with your codebase. Oh and FireDucks is completely free so that’s a nice bonus too.
Regardless of which library you choose, I’m pretty convinced that Pandas will slowly go the way of the Dodo as devs around the world opt for faster more modern libraries.
If you’ve been on the fence about making the move, I’d say start with something like FireDucks where you can make one simple code change and immediately see how a new library performs. Then if you really want to juice up performance, it’s safe to say there’s nothing quite as fast as Polars.
This is my first article on HackerNoon and I have no idea how to promote this or get more reads, but if you like it, and you know how to spread the word, I’d be incredibly grateful for your support!