AtomicServer as dataset management tool

# The proposition

## Dataset management / catalogues

Organizations have a lot of data living in a lot of different services and places. Various apps, databases, folders, files... Where do we keep track of this? Often times, the answer that this knowledge is tacit. For larger orgs, there are documents that track things like this. But still, this often means something like "an excel sheet listing a bunch of apps that we use". However, we're increasingly dependent on querying and linking these datasets to do things. Apps like Zapier link datasets to actions.

Truly large organizations like governments sometimes maintain a place for their public datasets. Standards like DCAT and tools like CKAN help in this process.

## LLM integration

We're currently adding MCP / Assistant functionality to AtomicServer (see #1055 and #951). This allows LLMs to call various functions, which not only includes controlling AtomicServer, but also querying external services.

I think there is a big opportunity here for LLMs and MCP. I think many datasets should include MCP servers in the near future, which would mean AI models can have means to search through them. This would give incredible powers to LLMs, as they could use _all_ information that an organization would have put in their dataset catalogue. For end-users, this would mean chatbots that just know _everything_ there is to know within that organization. 

## Data importing & converting

AtomicServer wouldn't just be a place to keep the datasets and their API access, it would also be a place where you store the data itself. Importing CSVs to atomic #924 would allow users to make their existing datasets way better:

- Browseable (in GUI) & linkable (every row its own URL)
- RESTful JSON API
- Queryable (full text + search)

However, we'll have to compete with very powerful (ETL) tools like Snowflake

## Linking with GIS / GEO data

The dataset management domain is of particular importance to governments. Since they govern things in the physical world, often across a rather large piece of land, GIS plays an important role. Many datasets are presented as WFS / WMS endpoints, which means they can be browsed on an interactive map.

We could even add `lat` `lon` props to AtomicServer and query them by showing these resources directly on the map, although that has some gotchas: #278 

## Other reasons why AtomicServer is a good match for data catalogues

- Easily self-hostable, open source
- Powerful front-end customization through e.g. SvelteKit template
- Built-in full-text search for finding datasets

# The Market

There are some open source dataset management tools:

- [CKAN](https://ckan.org/) ([example](https://data.gov.au/dataset/ds-ga-cffba00f-d106-0af4-e044-00144fdd4fa6/details?q=light)), focused on open data sets for governments. Lots of metadata per dataset and powerful search.
- [Apache Atlas](https://atlas.apache.org/#/), focused on auditing and (big) data flows for enterprise

And some proprietary ones:




# The implementation

So, how do we actually build this into AtomicServer?

## Add DataSet Class + View

This view should show:

- Owner
- Description
- License
- AccessURL

But what about things that are specific to certain types of datasets? For example, a RESTful JSON endpoint might have a link to a swagger / OpenAPI file. Let's explore some options:

- Include all properties in the DataSet Class. This could lead to a _very_ complicated class, indeed. But it keeps all data in one resource, which could be nice i suppose.
- Use multi-class, and have specific `SQLTable` and `JSONAPI` classes in addition to `DataSet`. This would still keep all data in one Resource, but it would lead to ambiguity on which view to select.
- Have _a single_ `dataViews` property from the `DataSet`. This would link to a bunch of related DataViews. A SQLTable DataSet might link to a `SQLDataView` resource that contains some info about the schema or access authorization or whatever. 
- Have _specific_ properties for certain `DataView`s. So a `DataSet` could have an `SQLinfo` property. This would make the `DataSet` class way bigger, but it would also remove ambiguity about what the relationship is between the `DataSet` and the `DataView`.

## DataViewers

Some types of data can be presented in a browsable way to end-users:

- GeoJSON / WFS / WMS in a map view
- CSV in a table
- [SQL in a set of tables](https://inloop.github.io/sqlite-viewer/)
- RESTful JSON API in Swagger / OpenAPI viewer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AtomicServer as dataset management tool #1072

The proposition

Dataset management / catalogues

LLM integration

Data importing & converting

Linking with GIS / GEO data

Other reasons why AtomicServer is a good match for data catalogues

The Market

The implementation

Add DataSet Class + View

DataViewers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AtomicServer as dataset management tool #1072

Description

The proposition

Dataset management / catalogues

LLM integration

Data importing & converting

Linking with GIS / GEO data

Other reasons why AtomicServer is a good match for data catalogues

The Market

The implementation

Add DataSet Class + View

DataViewers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions