2026-01-16
The use of big data in all disciplines is growing:
Usually, we load datasets directly in R (or Stata, or Python, or something else).
This means that we are limited by the amount of RAM1 on our computer.
We might think that we need a bigger computer to handle more data.
But we can’t necessarily have a bigger computer.
And we don’t necessarily need a bigger computer in the first place.
Multiple tools exist for user-friendly large data handling:
But we can’t necessarily have a bigger computer.
And we don’t necessarily need a bigger computer in the first place.
Multiple tools exist for user-friendly large data handling:
polars is a recent DataFrame library that is available in several languages:
Very consistent and readable syntax.
Built to be very fast and memory-efficient thanks to several mechanisms.
The main mechanism is called lazy evaluation.
The next section is language agnostic: it is based on concepts, not language features.
Eager evaluation: all operations are run line by line, in order, and directly applied to the data. This is the way we’re used to.
Lazy evaluation: operations are only run when we call a specific function at the end of the chain, usually called collect().
### Lazy
# Get the data...
my_data = lazy_data
# ... and then sort by iso...
.sort(pl.col("iso"))
# ... and then keep only Japan...
.filter(pl.col("country") == "Japan")
# ... and then compute GDP per cap.
.with_columns(gdp_per_cap = pl.col("gdp") / pl.col("pop"))
# => you don't get results yet!
my_data.collect() # this is how to get resultsPolars has both eager and lazy evaluation.
When dealing with large data, it is better to use lazy evaluation:
The code below takes some data, sorts it by a variable, and then filters it based on a condition:
Do you see what could be improved here?
The problem is the order of operations: sorting data is much slower than filtering it.
Let’s test with a dataset of 50M observations and 10 columns:
There is probably a lot of suboptimal code in our scripts.
Most of the time, it’s fine.
But when it starts to matter, we don’t want to spend time thinking about the best order of operations.
Let polars do this automatically by using lazy evaluation.
collect()
-> check the code
-> optimize the code
-> execute the code
Examples of optimizations (see the entire list):
Calling collect() doesn’t start computations right away.
First, polars checks the code to ensure there are no schema errors, i.e. check that we don’t do “forbidden” operations.
For example, pl.col("gdp") > "France" would return the following error before executing the code :
polars.exceptions.ComputeError: cannot compare string with numeric data
We’re used to a few file formats: CSV, Excel, .dta. Polars can read most of them (.dta needs an extension).
When possible, use the Parquet format (.parquet).
Pros:
Cons:
Polars can be used in Python and in R:
polars. This is developed directly by the creators of Polars.R packages polars and tidypolars.
polars is an almost 1-to-1 copy of the Python polars’ syntaxtidypolars is a package that implements the tidyverse syntax but uses polars in the background.Workflow:
This only returns the schema of the data: the column names and their types (character, integers, …).
Workflow:
Workflow:
collect() at the end of the code to execute it.Problem: you don’t know what the data looks like.
Solution: Use eager evaluation to explore a sample, build the data processing scripts, etc. but do it like this:
Use lazy evaluation to perform the data processing on the entire data.
To modify variables, we chain Polars expressions:
Maybe you need to use a function that doesn’t exist as-is in Polars or in a plugin.
You have mainly three options:
pandas in Python or to a data.frame in R), and use the function you needmap_ function (e.g. map_elements() or .map_batches()), but this is slower (available only in Python).Example: let’s write a custom function for standardizing numeric variables.
Streaming is a way to run the code on data by batches to avoid using all memory at the same time.
This is both more memory-efficient and faster.
Using this is extremely simple: instead of calling collect(), we call collect(engine = "streaming").
Tip
The streaming engine will become the default in a few months so you will be able to simply call collect() without the engine argument.
Sometimes, data is just too big for our computer, even after all optimizations.
In this case, collecting the data will crash the Python or R session. Polars is not a panacea.
What are possible strategies for this?
Maybe you can use aggregate values instead (e.g. district-level statistics instead of individual statistics).
Maybe you just want to export the data to another file: use sink_*() (e.g. sink_parquet()) instead of collect() + write_*().
Python only (for now)
Polars accepts extensions taking the form of new expressions namespaces.
Just like we have .str.split() for instance, we could have .dist.jarowinkler().
List of Polars plugins (not made by the Polars developers): https://github.com/ddotta/awesome-polars#polars-plugins
While there is demand for a geopolars that would enable GIS operations in Polars, this doesn’t exist for now.
For some time, this was blocked by Polars’ capabilities, but this is now technically possible.
You might expect some movement in geopolars in 2026.
To get more details:
There isn’t one tool to rule them all.
Polars is great in many cases but your experience might vary depending on your type of data or operations you perform.
There are other tools to process large data in R and Python.
tidypolars but uses DuckDB in the background)duckplyr is supported by Posit (the company behind RStudio and many packages)arrow: also has lazy evaluation but less optimizations than Polars and DuckDB.
Spark: never used but available in R and Python.
So far, the focus was on data processing but there are also tools to apply econometric methods to large data:
Polars allows us to handle large data even on a laptop.
Use eager evaluation to explore a sample, build the data processing scripts, etc. but do it like this:
Use lazy evaluation to perform the data processing on the entire data.
Which language to use (for Polars)?
tidypolars is available.There exists other tools using the same mechanisms, they are worth exploring!
In R, several packages provide the tidyverse syntax while running those more powerful tools under the hood: tidypolars, duckplyr, arrow, etc.
About 19M observations, 96 variables
5GB CSV, 240MB Parquet (~ 21x smaller)
See the SwissTransfer link I sent you
polars and tidypolars)Keeping a clean Python setup is notoriously hard:
Python relies a lot on virtual environments: each project should have its own environment that contains the Python version and the libraries used in the project.
This is to avoid conflicts between different requirements:
polars <= 1.20.0polars >= 1.22.0If we had only a system-wide installation of polars, we would constantly have to reinstall one or the other.
Having one virtual environment per project allows better project reproducibility.
For the entire setup, I recommend using uv:
Basically 4 commands to remember (once uv is installed):
uv init my-project (or uv init) creates the basic files requireduv add to add a library to the environmentuv sync to update the venv (e.g. after adding a dependency or to restore a project)uv run file.py to run a script in the project’s virtual environmentThe files uv.lock and .python-version are the only thing needed to restore a project.
You do not need to share the .venv folder with colleagues, they can just call uv sync and the .venv with the exact same packages will be created.
For more exploratory analysis (i.e. before writing scripts meant to be run on the whole data), we can use Jupyter notebooks.
demo.ipynb file in the project folderuv add ipykernel in the terminalAutomatically formatting the code when we save the file is a nice feature (not essential for today however).
If you want this:
and you should be good to go!
Constraints:
you need R >= 4.3
polars and tidypolars are not on CRAN, so install.packages("polars") is not enough.
renv allows you to create a virtual environment in R.