Skip to content

AtomicServer as dataset management tool #1072

Open
@joepio

Description

@joepio

The proposition

Dataset management / catalogues

Organizations have a lot of data living in a lot of different services and places. Various apps, databases, folders, files... Where do we keep track of this? Often times, the answer that this knowledge is tacit. For larger orgs, there are documents that track things like this. But still, this often means something like "an excel sheet listing a bunch of apps that we use". However, we're increasingly dependent on querying and linking these datasets to do things. Apps like Zapier link datasets to actions.

Truly large organizations like governments sometimes maintain a place for their public datasets. Standards like DCAT and tools like CKAN help in this process.

LLM integration

We're currently adding MCP / Assistant functionality to AtomicServer (see #1055 and #951). This allows LLMs to call various functions, which not only includes controlling AtomicServer, but also querying external services.

I think there is a big opportunity here for LLMs and MCP. I think many datasets should include MCP servers in the near future, which would mean AI models can have means to search through them. This would give incredible powers to LLMs, as they could use all information that an organization would have put in their dataset catalogue. For end-users, this would mean chatbots that just know everything there is to know within that organization.

Data importing & converting

AtomicServer wouldn't just be a place to keep the datasets and their API access, it would also be a place where you store the data itself. Importing CSVs to atomic #924 would allow users to make their existing datasets way better:

  • Browseable (in GUI) & linkable (every row its own URL)
  • RESTful JSON API
  • Queryable (full text + search)

However, we'll have to compete with very powerful (ETL) tools like Snowflake

Linking with GIS / GEO data

The dataset management domain is of particular importance to governments. Since they govern things in the physical world, often across a rather large piece of land, GIS plays an important role. Many datasets are presented as WFS / WMS endpoints, which means they can be browsed on an interactive map.

We could even add lat lon props to AtomicServer and query them by showing these resources directly on the map, although that has some gotchas: #278

Other reasons why AtomicServer is a good match for data catalogues

  • Easily self-hostable, open source
  • Powerful front-end customization through e.g. SvelteKit template
  • Built-in full-text search for finding datasets

The Market

There are some open source dataset management tools:

  • CKAN (example), focused on open data sets for governments. Lots of metadata per dataset and powerful search.
  • Apache Atlas, focused on auditing and (big) data flows for enterprise

And some proprietary ones:

The implementation

So, how do we actually build this into AtomicServer?

Add DataSet Class + View

This view should show:

  • Owner
  • Description
  • License
  • AccessURL

But what about things that are specific to certain types of datasets? For example, a RESTful JSON endpoint might have a link to a swagger / OpenAPI file. Let's explore some options:

  • Include all properties in the DataSet Class. This could lead to a very complicated class, indeed. But it keeps all data in one resource, which could be nice i suppose.
  • Use multi-class, and have specific SQLTable and JSONAPI classes in addition to DataSet. This would still keep all data in one Resource, but it would lead to ambiguity on which view to select.
  • Have a single dataViews property from the DataSet. This would link to a bunch of related DataViews. A SQLTable DataSet might link to a SQLDataView resource that contains some info about the schema or access authorization or whatever.
  • Have specific properties for certain DataViews. So a DataSet could have an SQLinfo property. This would make the DataSet class way bigger, but it would also remove ambiguity about what the relationship is between the DataSet and the DataView.

DataViewers

Some types of data can be presented in a browsable way to end-users:

  • GeoJSON / WFS / WMS in a map view
  • CSV in a table
  • SQL in a set of tables
  • RESTful JSON API in Swagger / OpenAPI viewer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions