- About This Project
- Exploratory Analysis
- Loading Snapshot Files into a Local DuckDB Database
- Using
dbt
anddbt-duckdb
- Using MotherDuck
- Next Steps
- Read More
First, a little background:
OurResearch came to the rescue with OpenAlex after Microsoft announced it would end its support for Microsoft Academic Graph (MAG) at the end of 2021.
In a short amount of time, OpenAlex started compiling data related to academic publications (e.g. works, authors, publications, institutions, etc.). They stood up an API and documentation and made this data available to the community at no charge.
In addition, they began releasing regular data snapshots in s3 for those who have needs outside of what's provided by the API. The snapshots are released monthly, and the August author data snapshot from 2023 included an improved author disambiguation feature, which can be a thorny problem to solve.
More recently, they've started to use a method for classifying research topics of a particular work developed in partnership with CWTS at Leiden University. This is a move away from concepts
that existed in the former data model which were inherited from MAG.
DuckDB is an in-process analytic database that's come on the scene in the past few years that looks like it could change a lot of workflows. It's fast, it's designed to be embedded, but you can choose to persist a database if you want to.
dbt-duckdb
is a dbt adapter built for use with DuckDB (and also, recently MotherDuck) so we can:
- create data sources from all the most recent OpenAlex snapshot files, and
- build a dbt project on top of it, allowing us to make use of all the great things about
dbt
, like testing and macros and packages created by that community.
MotherDuck is a serverless instance of DuckDB in the cloud.
I was granted access to the beta a little while ago (though they've since announced they're welcoming sign-ups from anyone), and the timing has been perfect to play around with this and see what I can do.
Importantly: OpenAlex and MotherDuck both use s3 region us-east-1
so the data won't be moving between regions. We can explore the OpenAlex snapshots directly from s3 using a local instance of duckdb
and dbt-duckdb
too; we'll just need to set s3_region
in either to access the snapshot data files.
For our exploratory analysis, we can follow the OpenAlex documentation to download all the snapshot files to our machine. Or alternatively, we can use DuckDB to query one or more of those file directly without downloading them first.
For now, we'll query just one of the files using the file path from the most recent manifest, and once we're more familiar with the data model, we'll grab all the files that make up the snapshot.
We'll start by using DESCRIBE to take a look at the schema of one of the author snapshot files.
describe (select * from read_ndjson_auto('s3://openalex/data/authors/updated_date=2023-02-24/part_000.gz'))
Results
┌──────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├──────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼─────────┤
│ id │ VARCHAR │ YES │ │ │ │
│ orcid │ VARCHAR │ YES │ │ │ │
│ display_name │ VARCHAR │ YES │ │ │ │
│ display_name_alter… │ VARCHAR[] │ YES │ │ │ │
│ works_count │ BIGINT │ YES │ │ │ │
│ cited_by_count │ BIGINT │ YES │ │ │ │
│ most_cited_work │ VARCHAR │ YES │ │ │ │
│ summary_stats │ STRUCT("2yr_mean_citedness" DOUBLE, h_index BIGINT, i10_index BIGINT, oa_percent DOUBLE, works_count BIGINT, cited_by_count BIGINT, "2yr_… │ YES │ │ │ │
│ affiliations │ STRUCT(institution STRUCT(id VARCHAR, ror VARCHAR, display_name VARCHAR, country_code VARCHAR, "type" VARCHAR, lineage VARCHAR[]), "years… │ YES │ │ │ │
│ ids │ STRUCT(openalex VARCHAR, orcid VARCHAR, scopus VARCHAR) │ YES │ │ │ │
│ last_known_institu… │ STRUCT(id VARCHAR, ror VARCHAR, display_name VARCHAR, country_code VARCHAR, "type" VARCHAR, lineage VARCHAR[]) │ YES │ │ │ │
│ last_known_institu… │ STRUCT(id VARCHAR, ror VARCHAR, display_name VARCHAR, country_code VARCHAR, "type" VARCHAR, lineage VARCHAR[])[] │ YES │ │ │ │
│ counts_by_year │ STRUCT("year" BIGINT, works_count BIGINT, oa_works_count BIGINT, cited_by_count BIGINT)[] │ YES │ │ │ │
│ x_concepts │ STRUCT(id VARCHAR, wikidata VARCHAR, display_name VARCHAR, "level" BIGINT, score DOUBLE)[] │ YES │ │ │ │
│ works_api_url │ VARCHAR │ YES │ │ │ │
│ updated_date │ DATE │ YES │ │ │ │
│ created_date │ DATE │ YES │ │ │ │
│ updated │ VARCHAR │ YES │ │ │ │
├──────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴─────────┤
│ 18 rows 6 columns |
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
We can see the parsed data fields and how DuckDB will infer the data types while using the read_ndjson_auto
function. While inferring the data types is an option, I've found it more useful to be as explicit as possible when parsing JSON source files for this project, especially for some of the nested fields.
This OpenAlex snapshot documentation page makes the distinction between loading data into a cloud data warehouse or into a relational database. When working with nested data structures in DuckDB, we can can choose to load the data as-is into the database and parse the JSON later. Or we can unnest the JSON as we're reading it to speed up later queries and improve efficiency.
To be as explicit as possible (as well as to avoid loading large amounts of inverted_abstract_index
data specifically from the works
snapshot files), I've chosen to define columns
as a parameter to include with the read_ndjson
function along with the data types of those columns. The data types can be found by using the DESCRIBE
function as used above, or through a method similar to the one documented here.
Whether the data types are defined as the file is being read initially, or later from a raw table loaded into the database as is, it's most efficient to parse the JSON once and only once.
At this point, we can download all the snapshot files and store them locally so we don't need to pull them from the s3 bucket. There are a few different ways to do this, but the OpenAlex documentation using the aws cli
is probably simplest for now.
aws s3 ls --summarize --human-readable --no-sign-request --recursive "s3://openalex/data/authors/"
aws s3 sync --delete "s3://openalex/data/authors/" "data/authors/" --no-sign-request
We can make use of DuckDB's COPY or CREATE TABLE commands to insert these downloaded files into a persisted .duckdb
database that we store locally. Which looks like this when using the DuckDB CLI:
duckdb open_alex_authors.duckdb
select count(*)
from read_ndjson(
'~/open_alex_authors/february_2024/data/authors/*/*.gz'
);
Query Result
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 92840725 │
└──────────────┘
create table authors
from read_ndjson(
'~/open_alex_snapshot/february_2024_snapshot/data/authors/*/*.gz',
columns = {
id: 'VARCHAR',
orcid: 'VARCHAR',
display_name: 'VARCHAR',
display_name_alternatives: 'VARCHAR[]',
works_count: 'BIGINT',
cited_by_count: 'BIGINT',
most_cited_work: 'VARCHAR',
ids: 'STRUCT(openalex VARCHAR, orcid VARCHAR, scopus VARCHAR)',
last_known_institution: 'STRUCT(id VARCHAR, ror VARCHAR, display_name VARCHAR, country_code VARCHAR, type VARCHAR, lineage VARCHAR[])',
counts_by_year: 'STRUCT(year VARCHAR, works_count BIGINT, oa_works_count BIGINT, cited_by_count BIGINT)[]',
works_api_url: 'VARCHAR',
updated_date: 'DATE',
created_date: 'DATE',
updated: 'VARCHAR'
},
compression='gzip'
);
With dbt-duckdb
, we can work using a persisted .duckdb
database file, using memory, or using MotherDuck, which we'll get to. In our profiles.yml
, we can create a new profile with type: duckdb
and configure any other settings
or extensions
we'll use.
open_alex_snapshot:
target: dev
outputs:
dev:
type: duckdb
path: 'open_alex_snapshot.duckdb'
threads: 1
extensions:
- httpfs
We'll define the local copy of our OpenAlex snapshot files as various raw sources, but we could also use the external_location
meta option with the relevant read_ndjson
parameters if we're reading files directly from s3 without loading them into a local database. Here's what the raw_authors
source would look like in sources.yml
:
version: 2
sources:
- name: open_alex_snapshot
schema: main
description: "Latest OpenAlex data snapshot"
# meta:
# external_location: "read_json_auto('s3://openalex/data/authors/updated_date=2023-08-15/*.gz', format='newline_delimited', compression='gzip')"
tables:
- name: raw_authors
meta:
external_location: >
read_ndjson('~/open_alex_snapshot/february_2024_snapshot/data/authors/*/*.gz',
columns = {
id: 'VARCHAR',
orcid: 'VARCHAR',
display_name: 'VARCHAR',
display_name_alternatives: 'VARCHAR[]',
works_count: 'BIGINT',
cited_by_count: 'BIGINT',
most_cited_work: 'VARCHAR',
ids: 'STRUCT(openalex VARCHAR, orcid VARCHAR, scopus VARCHAR)',
last_known_institution: 'STRUCT(id VARCHAR, ror VARCHAR, display_name VARCHAR, country_code VARCHAR, type VARCHAR, lineage VARCHAR[])',
counts_by_year: 'STRUCT(year VARCHAR, works_count BIGINT, oa_works_count BIGINT, cited_by_count BIGINT)[]',
works_api_url: 'VARCHAR',
updated_date: 'DATE',
created_date: 'DATE',
updated: 'VARCHAR'
})
Once the sources are defined, we can use the OpenAlex Postgres schema diagram (with a few modifications) to detail the models that will be created in our dbt
project. I've set up a few of these as incremental models relying on the updated_date
, but I haven't seen a great improvement in overall processing as a result from snapshot to snapshot.
From these models, we can begin to output files (using dbt-duckdb
's functionality to write to external files), or export the data to parquet or another compressed format, or begin to use a BI tool like evidence.dev to generate reports that will produce analysis about specific parts of the most recent OpenAlex snapshot.
At this point, we'll connect to a MotherDuck account and create the database there that already exists locally:
.open md:
CREATE OR REPLACE DATABASE open_alex_snapshot FROM 'open_alex_snapshot.duckdb';
Or if only the snapshot table exists in MotherDuck, we can run our dbt-duckdb
models against that source, by addng a MotherDuck target (i.e. the prod
target) in the profiles.yml
file.
open_alex_snapshot:
target: dev
outputs:
dev:
type: duckdb
path: 'open_alex_snapshot.duckdb'
threads: 24
extensions:
- httpfs
prod:
type: duckdb
path: md:open_alex_snapshot
Depending on whether you're looking to sync files locally at all, you could decide to query and store the OpenAlex files directly from s3. Because both OpenAlex and MotherDuck exist in s3 region us-east-1
, you could choose to use the OpenAlex s3 snapshot location as a source to run your dbt-duckdb
models and create a MotherDuck database share or generate output files without relying on local storage at all.
I haven't tested that much and the cold storage fees may mean storing multiple full snapshots would grow to be prohibitively expensive over time for a hobby project. But that could be mitigated with local storage since MotherDuck will run hybrid queries from the CLI, combining local storage with what exists in the cloud, or only persisting certain aggregations from each snapshot.
Once the database is created in MotherDuck, we can also choose to share the database, including the base snapshot table and other models we created from it.
CREATE SHARE open_alex_snapshot_share FROM open_alex_snapshot
So, others can attach and query the database share.
ATTACH 'md:_share/open_alex_authors/5d0ef4a6-8f80-4c74-b821-08fda756ca2d'
From here, since different disciplines are cited at different rates, it'd make sense to start parsing the topics
fields to explore which authors and which institutions are most highly cited within a given topic so we can begin to track how those within top percentages change over time from snapshot to snapshot.