There are many advantages to coding in JavaScript, but data wrangling probably isn’t near the top of that list. However, there’s good news for those who find JavaScript data wrangling a challenge: The same “grammar-of-data” ideas behind the hugely popular dplyr R package are also available in JavaScript, thanks to the Arquero library.
Arquero, from the University of Washington Interactive Data Lab, is probably best known to users of Observable JavaScript, but it’s available in other ways, too. One of these is Node.js.
This article will show you how to filter JavaScript objects with Arquero, with a few bonus tasks at the end.
Step 1. Load Arquero
Arquero is a standard library with Observable JavaScript and in Quarto, which is how I use it. In that case, no installation is needed. If you are using Arquero in Node, you’ll need to install it with npm install arquero --save
. In the browser, use <script src="https://cdn.jsdelivr.net/npm/arquero@latest"></script>
.
In Observable, you can load Arquero with import {aq, op} from "@uwdata/arquero"
. In the browser, Arquero will be loaded as aq
. In Node, you can load it with const aq = require('arquero')
.
The remainder of the code in this tutorial should run as-is in Observable and Quarto. If you’re using it in an asynchronous environment like Node, you will need to make the necessary adjustments for data loading and processing.
Step 2. Transform your data into an Arquero table
You can turn an existing “regular” JavaScript object into an Arquero table with aq.from(my_object)
.
Another option is to directly import remote data as an Arquero table with Arquero’s load
family of functions—functions like aq.loadCSV("myurl.com/mycsvfile.csv")
for a CSV file and aq.loadJSON("myjsonurl.com/myjsonfile.json")
for a JSON file on the web. There’s more information about table input functions at the Arquero API documentation website.
In order to follow along with the rest of this tutorial, run the code below to import sample data about population changes in U.S. states.
states_table = aq.loadCSV("https://raw.githubusercontent.com/smach/SampleData/master/states.csv")
Arquero tables have a special view()
method for use with Observable JavaScript and in Quarto. The states_table.view()
command returns something like the output shown in Figure 1.
Observable JavaScript’s Inputs.table(states_table)
(which has clickable column headers for sorting) also works to display an Arquero table.
Outside of Observable, you can use states_table.print()
to print the table to the console.
Step 3. Filter rows
Arquero tables have a lot of built-in methods for data wrangling and analysis, including filtering rows for specific conditions with filter()
.
A note to R users: Arquero’s filter()
syntax isn’t quite as simple as dplyr’s filter(Region == 'RegionName')
. Because this is JavaScript and most functions are not vectorized, you need to create an anonymous function with d =>
and then run another function inside of it—usually a function from op
(imported above with arquero). Even if you are accustomed to a language other than JavaScript, once you are familiar with this construction, it’s fairly easy to use.
The usual syntax is:
filter(d => op.opfunction(d.columnname, 'argument')
In this example, the op
function I want is op.equal()
, which (as the name implies) tests for equality. So, the Arquero code for only states in the Northeast region of the United States would be:
states_table
.filter(d => op.equal(d.Region, 'Northeast'))
You can tack on .view()
at the end to see the results.
A note on the filter() syntax: The code inside filter()
is an Arquero table expression. “At first glance table expressions look like normal JavaScript functions … but hold on!” the Arquero website API reference website explains. “Under the hood, Arquero takes a set of function definitions, maps them to strings, then parses, rewrites, and compiles them to efficiently manage data internally.”
What does that mean for you? In addition to the usual JavaScript function syntax, you can also use special table expression syntax such as filter("d => op.equal(d.Region, 'Northeast')")
or filter("equal(d.Region, 'Northeast')")
. Check out the API reference if you think one of these versions might be more appealing or useful.
This also means that you can’t use just any type of JavaScript function within filter()
and other Arquero verbs. For example, for
loops are not allowed unless wrapped by an escape()
“expression helper.” Check out the Arquero API reference to learn more.
A note to Python users: Arquero filter
is designed for subsetting rows only, not either rows or columns, as seen with pandas.filter
. (We’ll get to columns next.)
Filters can be more complex than a single test, with negative or multiple conditions. For example, if you want “one-word state names in the West region,” you’d look for state names that don’t include a space and Region equals West. One way to accomplish that is !op.includes(d.State, ' ') && op.equal(d.Region, 'West')
inside the filter(d =>)
anonymous function:
states_table
.filter(d => !op.includes(d.State, ' ') &&
op.equal(d.Region, 'West'))
To search and filter by regular expression instead of equality, use op.match() instead of op.equal()
.
Step 4. Select columns
Selecting only certain columns is similar to dplyr’s select()
. In fact it’s even easier, since you don’t need to turn the selection into an array; the argument is just comma-separated column names inside select():
:
states_table
.select('State', 'State Code', 'Region', 'Division', 'Pop_2020')
You can rename columns while selecting them, using the syntax: select{{ OldName1: 'NewName1', OldName2: 'NewName2' })
. Here’s an example:
states_table
.select({ State: 'State', 'State Code': 'Abbr', Region: 'Region',
Division: 'Division', Pop_2020: 'Pop' })
Step 5. Create an array of unique values in a table column
It can be useful to get one column’s unique values as a vanilla JavaScript array, for tasks such as populating an input dropdown list. Arquero has several functions to accomplish this:
dedupe()
gets unique values.orderby()
sorts results.array()
turns data from one Arquero table column into a conventional JavaScript array.
Here’s one way to create a sorted array of unique Division names from states_table
:
region_array = states_table
.select('Region')
.dedupe()
.orderby('Region')
.array('Region')
Since this new object is a JavaScript array, Arquero methods won’t work on it anymore, but conventional array methods will. Here’s an example:
'The regions are ' + region_array.join(', ')
This code gets the following output:
"The regions are , Midwest, Northeast, South, West"
That first comma in the above character string is because there’s a null value in the array. If you’d like to delete blank values like null, you can use the Arquero op.compact()
function on results:
region_array2 = op.compact(states_table
.select('Region')
.dedupe()
.orderby('Region')
.array('Region')
)
Another option is to use vanilla JavaScript’s filter()
to remove null values from an array of text strings. Note that the following vanilla JavaScript filter()
function for one-dimensional JavaScript arrays is not the same as Arquero’s filter()
for two-dimensional Arquero tables:
region_array3 = states_table
.select('Region')
.dedupe()
.orderby('Region')
.array('Region')
.filter(n => n)
Observable JavaScript users, including those using Quarto, can also employ the md
function to add styling to the string, such as bold text with **
. So, this code
md`The regions are **${region_array2.join(', ')}**.`
produces the following output:
The regions are Midwest, Northeast, South, West
As an aside, note that the Intl.ListFormat() JavaScript object makes it easy to add “and” before the last item in a comma-separated array-to-string. So, the code
my_formatter = new Intl.ListFormat('en', { style: 'long', type: 'conjunction' });
my_formatter.format(region_array3)
produces the output:
"Midwest, Northeast, South, and West"
There’s lots more to Arquero
Filtering, selecting, de-duping and creating arrays barely scratches the surface of what Arquero can do. The library has verbs for data reshaping, merging, aggregating, and more, as well as op
functions for calculations and analysis like mean, median, quantile, rankings, lag, and lead. Check out Introducing Arquero for an overview of more capabilities. Also see, An Illustrated Guide to Arquero Verbs and the Arquero API documentation for a full list, or visit the Data Wrangler Observable notebook for an interactive application showing what Arquero can do.
For more on Observable JavaScript and Quarto, don’t miss A beginner’s guide to using Observable JavaScript, R, and Python with Quarto and Learn Observable JavaScript with Observable notebooks.
Copyright © 2022 IDG Communications, Inc.