LmCast :: Stay tuned in

Data Manipulation in Clojure Compared to R and Python

Recorded: March 25, 2026, 3 a.m.

Original Summarized

Data Manipulation in Clojure Compared to R and Python

Code with Kira

Archive
Tags
Feed
Mastodon
LinkedIn
Github


Data Manipulation in Clojure Compared to R and Python

Published 2024-07-18
I spend a lot of time developing and teaching people about Clojure's open source tools for working with data. Almost everybody who wants to use Clojure for this kind of work is coming from another language ecosystem, usually R or Python. Together with Daniel Slutsky, I'm working on formalizing some of the common teachings into a course. Part of that is providing context for people coming from other ecosystems, including "translations" of how to accomplish data science tasks in Clojure.As part of this development, I wanted to share an early preview in this blog post. The format is inspired by this great blog post I read a while ago comparing R and Polars side by side (where "R" here refers to the tidyverse, an opinionated collection of R libraries for data science, and realistically mostly dplyr specifically). I'm adding Pandas because it's among the most popular dataset manipulation libraries, and of course Clojure, specifically tablecloth, the primary data manipulation library in our ecosystem.I'll use the same dataset as the original blog post, the Palmer Penguin dataset. For the sake of simplicity, I saved a copy of the dataset as a CSV file and made it available on this website. I will also refer the data as a "dataset" throughout this post because that's what Clojure people call a tabular, column-major data structure, but it's the same thing that is variously referred to as a dataframe, data table, or just "data" in other languages. I'm also assuming you know how to install the packages required in the given ecosystems, but any necessary imports or requirements are included in the code snippets the first time they appear. Versions of all languages and libraries used in this post are listed at the end. Here we go!Reading dataReading data is straightforward in every language, but as a bonus we want to be able to indicate on the fly which values should be interpreted as "missing", whatever that means in the given libraries. In this dataset, the string "NA" means "missing", so we want to tell the dataset constructor this as soon as possible. Here's the comparison of how to accomplish that in various languages:Tablecloth(require '[tablecloth.api :as tc])

(def ds
(tc/dataset "https://codewithkira.com/assets/penguins.csv"))
Note that tablecloth interprets the string "NA" as missing (nil, in Clojure) by default.RIn reality, in R you would get the dataset from the R package that contains the dataset. This is a fairly common practice in R. In order to compare apples to apples, though, here I'll show how to initialize the dataset from a remote CSV file, using the readr package's read_csv, which is part of the tidyverse:library(tidyverse)

ds <- read_csv("https://codewithkira.com/assets/penguins.csv",
na = "NA")
Pandasimport pandas as pd

ds = pd.read_csv("https://codewithkira.com/assets/penguins.csv")
Note that pandas has a fairly long list of values it considers NaN already, so we don't need to specify what missing values look like in our case, since "NA" is already in that list.Polarsimport polars as pl

ds = pl.read_csv("https://codewithkira.com/assets/penguins.csv",
null_values="NA")
Basic commands to explore the datasetThe first thing people usually want to do with their dataset is see it and poke around a bit. Below is a comparison of how to accomplish basic data exploration tasks using each library.Operationtableclothdplyrsee first 10 rows(tc/head ds 10)head(ds, 10)see all column names(tc/column-names ds)colnames(ds)select column(tc/select-columns ds "year")select(ds, year)select multiple columns(tc/select-columns ds ["year" "sex"])select(ds, year, sex)select rows(tc/select-rows ds #(> (% "year") 2008))filter(ds, year > 2008)sort column(tc/order-by ds "year")arrange(ds, year)Operationpandaspolarssee first n rowsds.head(10)ds.head(10)see all column namesds.columnsds.columnsselect columnds[["year"]]ds.select(pl.col("year"))select multiple columnsds[["year", "sex"]]ds.select(pl.col("year", "sex"))select rowsds[ds["year"] > 2008]ds.filter(pl.col("year") > 2008)sort columnds.sort_values("year")ds.sort("year")Note there are some differences in how different libraries sort missing values, for example in tablecloth and polars they are placed at the beginning (so they're at the top when a column is sorted in ascending order and last when descending), but dplyr and pandas place them last (regardless of whether ascending or descending order is specified).As you can see, these commands are all pretty similar, with the exception of selecting rows in tablecloth. This is a short-hand syntax for writing an anonymous function in Clojure, which is how rows are selected. Being a functional language, functions in Clojure are "first-class", which basically just means they are passed around as arguments willy-nilly, all over the place, all the time. In this case, the third argument to tablecloth's select-rows function is a predicate (a function that returns a boolean) that takes as its argument a dataset row as a map of column names to values. Don't worry, though, tablecloth doesn't process your entire dataset row-wise. Under the hood datasets are highly optimized to perform column-wise operations as fast as possible.Here's an example of what it looks like to string a couple of these basic dataset exploration operations together, for example in this case to get the bill_length_mm of all penguins with body_mass_g below 3800:Tablecloth(-> ds
(tc/select-rows #(and (% "body_mass_g")
(> (% "body_mass_g") 3800)))
(tc/select-columns "bill_length_mm"))
Note that in tablecloth we have to explicitly omit rows where the value we're filtering by is missing, unlike in other libraries. This is because tablecloth actually uses nil (as opposed to a library-specific construct) to indicate a missing value , and in Clojure nil is not treated as comparable to numbers. If we were to try to compare nil to a number, we would get an exception telling us that we're trying to compare incomparable types. Clojure is fundamentally dynamically typed in that it only does type checking at runtime and bindings can refer to values of any type, but it is also strongly typed, as we see here, in the sense that it explicitly avoids implicit type coercion. For example deciding whether 0 is greater or larger than nil requires some assumptions, and these are intentionally not baked into the core of Clojure or into tablecloth as a library as is the case in some other languages and libraries.This example also introduces Clojure's "thread-first" macro. The -> arrow is like R's |> operator or the unix pipe, effectively passing the output of each function in the chain as input to the next. It comes in very handy for data processing code like this.Here is the equivalent operation in the other libraries:dplyrds |>
filter(body_mass_g < 3800) |>
select(bill_length_mm)
Pandasds[ds["body_mass_g"] < 3800]["bill_length_mm"]
Polarsds.filter(pl.col("body_mass_g") < 3800).select(pl.col("bill_length_mm"))
More advanced filtering and selectingHere is what some more complicated data wrangling looks like across the libraries.Select all columns except for oneLibraryCodetablecloth(tc/select-columns ds (complement #{"year"}))dplyrselect(ds, -year)pandasds.drop(columns=["year"])polarsds.select(pl.exclude("year"))Another property of functional languages in general, and especially Clojure, is that they really take advantage of the fact that a lot of things are functions that you might not be used to treating like functions. They also leverage function composition to simply combine multiple functions into a single operation.For example a set (indicated with the #{} syntax in Clojure) is a special function that returns a boolean indicating whether the given argument is a member of the set or not. And complement is a function in clojure.core that effectively inverts the function given to it, so combined (complement #{"year"}) means "every value that is not in the set #{"year"}, which we can then use as our predicate column selector function to filter out certain columns.Select all columns that start with a given stringLibraryCodetablecloth(tc/select-columns ds #(str/starts-with? % "bill"))dplyrselect(ds, starts_with("bill"))pandasds.filter(regex="^bill")polarsimport polars.selectors as csds.select(cs.starts_with("bill"))Select only numeric columnsLibraryCodetablecloth(tc/select-columns ds :type/numerical)dplyrselect(ds, where(is.numeric))pandasds.select_dtypes(include='number')polarsds.select(cs.numeric())The symbol :type/numerical in Clojure here is a magic keyword that tablecloth knows about and can accept as a column selector. This list of magic keywords that tablecloth knows about is not (yet) documented anywhere, but it is available in the source code.Filter rows for range of valuesLibraryCodetablecloth(tc/select-rows ds #(< 3500 (% "body_mass_g" 0) 4000))dplyrfilter(ds, between(body_mass_g, 3500, 4000))pandasds[ds["body_mass_g"].between(3500, 4000)]polarsds.filter(pl.col("body_mass_g").is_between(3500, 4000))Note here we handle the missing values in the body_mass_g column differently than above, by specifying a default value for the map lookup. We're explicitly telling tablecloth to treat missing values as 0 in this case, which can then be compared to other numbers. This is probably the better way to handle this case, but the method above works, too, plus it gave me the opportunity to soapbox about Clojure types for a moment.Reshaping the datasetTablecloth(tc/pivot->longer ds
["bill_length_mm" "bill_depth_mm"
"flipper_length_mm" "body_mass_g"]
{:target-columns "measurement" :value-column-name "value"})
dplyrds |>
pivot_longer(cols = c(bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g),
names_to = "measurement",
values_to = "value")
Pandaspd.melt(
ds,
id_vars=ds.columns.drop(["bill_length_mm", "bill_depth_mm",
"flipper_length_mm", "body_mass_g"]),
var_name="measurement",
value_name="value"
)
Polarsds.unpivot(
index=set(ds.columns) - set(["bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g"]),
variable_name="measurement",
value_name="value")
Creating and renaming columnsAdding columns based on some other existing columnsThere are many reasons you might want to add columns, and often new columns are combinations of other ones. Here's how you'd generate a new column based on the values in some other columns in each library:LibraryCodetablecloth(require '[tablecloth.column.api :as tcc])(tc/add-columns ds {"ratio" (tcc// (ds "bill_length_mm") (ds "flipper_length_mm"))})dplyrmutate(ds, ratio = bill_length_mm / flipper_length_mm)pandasds["ratio"] = ds["bill_length_mm"] / ds["flipper_length_mm"]polarsds.with_columns( (pl.col("bill_length_mm") / pl.col("flipper_length_mm")).alias("ratio"))Note that this is where the wheels start to come off if you're not working in a functional way with immutable data structures. Clojure data structures (including tablecloth datasets) are immutable, which is not the case Pandas. The Pandas code above mutates the dataset in place, so as soon as you do any mutating operations like these, you now have to keep mental track of the state of your dataset, which can quickly lead to high cognitive overhead and lots of incidental complexity.Renaming columnsLibraryCodetablecloth(tc/rename-columns ds {"bill_length_mm" "bill_length"})dplyrrename(ds, bill_length = bill_length_mm)pandasds.rename(columns={"bill_length_mm": "bill_length"})polarsds.rename({"bill_length_mm": "bill_length"})Again beware, the Pandas implementation shown here mutates the dataset in place. Also manually specifying every column name transformation you want to do is one way to accomplish the task, but sometimes that can be tedious if you want to apply the same transformation to every column name, which is fairly common.Transforming column namesHere's how you would upper case all column names:LibraryCodetablecloth(tc/rename-columns ds :all str/upper-case)dplyrrename_with(ds, toupper)pandasds.columns = ds.columns.str.upper()polarsds.select(pl.all().name.to_uppercase())Like the other libraries, tablecloth's rename-columns accepts both types of arguments – a simple mapping of old -> new column names, or any column selector and any transformation function. For example, removing the units from each column name would look like this in each language:LibraryCodetablecloth(tc/rename-columns ds #".+_(mm|g)" #(str/replace % #"(.+)_(mm|g)" "$1"))dplyrrename_with(penguins, ~ str_replace(.x, "^(.+)_(mm|g)$", "\1"))pandasimport reds.rename(columns=lambda x: re.sub(r"(.+)_(mm|g)$", r"\1", x))polarsds = ds.rename({ col: col.replace("_mm", "").replace("_g", "") for col in ds.columns})Grouping and aggregatingGrouping behaves somewhat unconventionally in tablecloth. Datasets can be grouped by a single column name or a sequence of column names like in other libraries, but grouping can also be done using any arbitrary function. Grouping in tablecloth also returns a new dataset, similar to dplyr, rather than an abstract intermediate object (as in pandas and polars). Grouped datasets have three columns, (name of the group, group id, and a column containing a new dataset of the grouped data). Once a dataset is grouped, the group values can be aggregated in a variety of ways. Here are a few examples, with comparisons between libraries:Summarizing countsTo get the count of each penguin by species:Tablecloth(-> ds
(tc/group-by ["species"])
(tc/aggregate {"count" tc/row-count}))
dplyrds |>
group_by(species) |>
summarise(count = n())
Pandasds.groupby("species").agg(count=("species", "count"))
Polarsds.group_by("species").agg(pl.count().alias("count"))
Find the penguin with the lowest body mass by speciesTablecloth(-> ds
(tc/group-by ["species"])
(tc/aggregate {"lowest_body_mass_g" #(->> (% "body_mass_g")
tcc/drop-missing
(apply tcc/min))}))
dplyrds |>
group_by(species) |>
summarize(lowest_body_mass_g = min(body_mass_g, na.rm = TRUE))
Pandasds.groupby("species").agg(
lowest_body_mass_g=("body_mass_g", lambda x: x.min(skipna=True))
).reset_index()
Polarsds.group_by("species").agg(
pl.col("body_mass_g").min().alias("lowest_body_mass_g")
)
ConclusionsAs you can see, all of these libraries are perfectly suitable for accomplishing common data manipulation tasks. Choosing a language and library can impact code readability, maintainability, and performance, though, so understanding the differences between available toolkits can help us make better choices.Clojure's tablecloth emphasizes functional programming concepts and immutability, which can lead to more predictable and re-usable code, at the cost of adopting a potentially new paradigm. Hopefully this comparison serves not only as a translation guide, but an an intro to the different philosophies underpinning these common data science tools.Thanks for reading :)VersionsThe code in this post works with the following language and library versions:ToolVersionMacOSSonoma 14.5JVM21.0.2Clojure1.11.1Tablecloth7.021R4.4.1Tidyverse2.0.0Python3.12.3Pandas2.1.4Polars1.1.0

Tagged:


tools

clojure

scicloj

r

python

Archive

Data Manipulation in Clojure Compared to R and Python

This blog post provides a comparative overview of data manipulation techniques across Clojure’s `tablecloth` library, R’s tidyverse (specifically `dplyr`), and Python’s `pandas` and `polars`. The author, Code with Kira, aims to assist Clojure developers transitioning from other data science ecosystems by offering translations of common tasks. The primary dataset used for demonstration is the Palmer Penguin dataset, accessible via a provided CSV link. The post highlights key differences in syntax and functional approaches, emphasizing Clojure’s emphasis on immutability and functional programming.

**Reading Data**

Each language demonstrates straightforward methods for reading data from a CSV file, including the ability to designate “missing” values, signified as "NA" in the dataset.

* **Tablecloth:** Uses `tc/dataset` to directly interpret "NA" as nil (Clojure's default missing value).
* **R (tidyverse):** Employs `read_csv` from the `tidyverse` package, explicitly specifying "NA" as the missing value indicator.
* **Pandas:** Automatically recognizes "NA" as a missing value due to pre-configured NaN values.
* **Polars:** Utilizes `pl.read_csv` and explicitly defines "NA" as null values.

**Basic Data Exploration**

The post compares common data exploration operations across all libraries, including:

* **Viewing Rows:** `head` (Tablecloth, Pandas) vs. `head` (dplyr)
* **Column Names:** `column-names` (Tablecloth), `colnames` (R), `ds.columns` (Pandas), `ds.columns` (Polars)
* **Selecting Columns:** Tablecloth uses a predicate function within `select-columns`, while `select` is used with column names in R, `select` in Pandas, and `select` in Polars. These approaches, however, may differ in how they handle missing values.
* **Filtering Rows:** Demonstrates row-wise filtering using predicates, differences are noted regarding how missing values are handled (Tablecloth & Polars place them first, while dplyr and pandas place them last).

**Advanced Operations**

The post then details more complex scenarios, providing comparative examples:

* **Selecting Columns Based on String:** Illustrates how to select columns starting with a specific string using string matching functions.
* **Filtering Rows for Range of Values:** Complements how to filter based on range of values, highlighting the differences in syntax and functional approaches.
* **Creating and Renaming Columns:** Demonstrates how to create new columns based on existing data and rename columns, noting the mutable nature of Pandas’ approach.
* **Grouping and Aggregating:** Presented a side-by-side comparison for counting, minimizing, and other aggregations across libraries. Tablecloth’s aggregation approach involves grouping by names, with the results having three columns, while R, Pandas and Polars utilize the `summarise` or `agg` function to perform aggregation/summarization, showing their differences in syntax.

**Key Takeaways**

The summary highlights key differences in the approach to data manipulation across these languages:

* **Immutability:** Clojure's `tablecloth` and Polars emphasize immutability, necessitating explicit transformations and comparisons which can be more verbose.
* **Functional Programming:** Clojure's functional nature, using the arrow macro (`->`) for chaining operations, is emphasized.
* **Syntax Differences:** The post points out similarities and differences in data manipulation syntax across the libraries.
* **Missing Value Handling:** Differences in handling missing values (nil vs. NaN vs. "NA") are noted as a key consideration.

The blog post concludes with a call for continued development and offers version information for the used tools.