Description
The proposition
Dataset management / catalogues
Organizations have a lot of data living in a lot of different services and places. Various apps, databases, folders, files... Where do we keep track of this? Often times, the answer that this knowledge is tacit. For larger orgs, there are documents that track things like this. But still, this often means something like "an excel sheet listing a bunch of apps that we use". However, we're increasingly dependent on querying and linking these datasets to do things. Apps like Zapier link datasets to actions.
Truly large organizations like governments sometimes maintain a place for their public datasets. Standards like DCAT and tools like CKAN help in this process.
LLM integration
We're currently adding MCP / Assistant functionality to AtomicServer (see #1055 and #951). This allows LLMs to call various functions, which not only includes controlling AtomicServer, but also querying external services.
I think there is a big opportunity here for LLMs and MCP. I think many datasets should include MCP servers in the near future, which would mean AI models can have means to search through them. This would give incredible powers to LLMs, as they could use all information that an organization would have put in their dataset catalogue. For end-users, this would mean chatbots that just know everything there is to know within that organization.
Data importing & converting
AtomicServer wouldn't just be a place to keep the datasets and their API access, it would also be a place where you store the data itself. Importing CSVs to atomic #924 would allow users to make their existing datasets way better:
- Browseable (in GUI) & linkable (every row its own URL)
- RESTful JSON API
- Queryable (full text + search)
However, we'll have to compete with very powerful (ETL) tools like Snowflake
Linking with GIS / GEO data
The dataset management domain is of particular importance to governments. Since they govern things in the physical world, often across a rather large piece of land, GIS plays an important role. Many datasets are presented as WFS / WMS endpoints, which means they can be browsed on an interactive map.
We could even add lat
lon
props to AtomicServer and query them by showing these resources directly on the map, although that has some gotchas: #278
Other reasons why AtomicServer is a good match for data catalogues
- Easily self-hostable, open source
- Powerful front-end customization through e.g. SvelteKit template
- Built-in full-text search for finding datasets
The Market
There are some open source dataset management tools:
- CKAN (example), focused on open data sets for governments. Lots of metadata per dataset and powerful search.
- Apache Atlas, focused on auditing and (big) data flows for enterprise
And some proprietary ones:
The implementation
So, how do we actually build this into AtomicServer?
Add DataSet Class + View
This view should show:
- Owner
- Description
- License
- AccessURL
But what about things that are specific to certain types of datasets? For example, a RESTful JSON endpoint might have a link to a swagger / OpenAPI file. Let's explore some options:
- Include all properties in the DataSet Class. This could lead to a very complicated class, indeed. But it keeps all data in one resource, which could be nice i suppose.
- Use multi-class, and have specific
SQLTable
andJSONAPI
classes in addition toDataSet
. This would still keep all data in one Resource, but it would lead to ambiguity on which view to select. - Have a single
dataViews
property from theDataSet
. This would link to a bunch of related DataViews. A SQLTable DataSet might link to aSQLDataView
resource that contains some info about the schema or access authorization or whatever. - Have specific properties for certain
DataView
s. So aDataSet
could have anSQLinfo
property. This would make theDataSet
class way bigger, but it would also remove ambiguity about what the relationship is between theDataSet
and theDataView
.
DataViewers
Some types of data can be presented in a browsable way to end-users:
- GeoJSON / WFS / WMS in a map view
- CSV in a table
- SQL in a set of tables
- RESTful JSON API in Swagger / OpenAPI viewer