Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

?sort=colname~numeric to sort by by column cast to real #894

Open
simonw opened this issue Jul 15, 2020 · 21 comments
Open

?sort=colname~numeric to sort by by column cast to real #894

simonw opened this issue Jul 15, 2020 · 21 comments

Comments

@simonw
Copy link
Owner

simonw commented Jul 15, 2020

If a text column actually contains numbers, being able to "sort by column, treated as numeric" would be really useful.

Probably depends on column actions enabled by #690

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

This needs querystring parameter design. Some options:

  • ?_sort_clause=cast(mycol+as+integer) and ?_sort_clause_desc=cast(mycol+as+integer) - allowing any expression. This would need to be disabled if arbitrary SQL was turned off, similar to the restrictions on ?_where=.
  • ?_sort_numeric=mycol and ?_sort_numeric_desc=mycol - this would cast to real which would work on integer values as well. It could be allowed even when arbitrary SQL was disabled.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

I could even let plugins define new sort types. Imagine a plugin that enables this:

?_sort_date_desc=mycol - where it knows how to handle specific date formats, or even uses dateutil.parser.parse.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Sorting by date when the column has a junk date format in it is such a column need it should maybe ship in Datasette by default - though I've been trying to avoid adding heavy dependencies like dateutil if I can get away with it.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

This does feel like a weird plugin hook just because there aren't really THAT many different use-cases that plugins could solve. The ones I can think of are:

  • Sort numeric
  • Sort by parsed date
  • Sort by FTS rank

Could this work if I just allow _sort_clause=?

One possible solution for the no-arbitrary-SQL case: users can define sort orders in metadata.json/yml. So if you want to enable sort-by-distance without enabling arbitrary SQL you could add something like this:

databases:
  mydb:
    tables:
      museums:
        sort_clause: bm25(fts)

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

There's something interesting about figuring out which sort options should be offered in the column actions menu.

Two options:

  • Try to detect if a text column is all-integers or all-floats, either by scanning the entire table (if it's small enough) or by scanning the first X rows, where X might be the size of the first page or maybe the first 1,000 or similar.
  • Could also let users define this in metadata.yml for the table.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

The simplest solution would be for Python code to scan all of the visible values on the current page and show the column action for "sort by this numeric" based purely on that. I already do that in the JavaScript for "are there any blank values in the first page?" here:

/* Show notBlank option if not selected AND at least one visible blank value */
var tdsForThisColumn = Array.from(
th.closest('table').querySelectorAll('td.' + th.className)
);
if (
params.get(`${column}__notblank`) != '1' &&
tdsForThisColumn.filter(el => el.innerText.trim() == '').length
) {
notBlank.style.display = 'block';
notBlank.setAttribute('href', notBlankUrl(column));
} else {
notBlank.style.display = 'none';
}

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

I think the first version of this feature involves implementing ?_sort_numeric=col and ?_sort_numeric_desc=col plus the JavaScript to detect if those values should be shown in the column actions menu.

One question: how to reflect that this is happening in the current sort UI. This menu here for example:

fixtures__compound_three_primary_keys__1_001_rows_where_sorted_by_pk3_descending

And this interface: how should it indicate that a text is currently sorted numerically v.s. sorted alphabetically, and allow the user to switch from one to the other?

fixtures__sortable__201_rows_where_sorted_by_sortable_with_nulls_descending

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

For the "Sort by X" select menu case... I could automatically expand that menu to contain extra options for "Sort numerically by X" for each TEXT column in the table. That's a pretty good option.

For the action cog menu, I can add the extra options to the cog menu - and rely on the fact that the title of the page will say "Sorted numerically by colname descending".

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

This is enough of a design to build a working prototype.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

The Sort by <select> menu needs a rethink:

<select name="_sort" id="sort_by">
<option value="">Sort...</option>
{% for column in display_columns %}
{% if column.sortable %}
<option value="{{ column.name }}"{% if column.name == sort or column.name == sort_desc %} selected{% endif %}>Sort by {{ column.name }}</option>
{% endif %}
{% endfor %}
</select>

If it's going to include sort by numeric options it needs a different format - since ?_sort=colname needs to also support ?_sort=numeric_colname - but shouldn't clash with a weird column already named numeric_colname.

Can I come up with a value syntax for this that is guaranteed not to clash with a weirdly named existing column?

I think so. I could use {"type": "numeric", "column": "mycolumn"} as the value - then if a column has a name that is itself valid JSON (weird but possible) the _sort= value would be {"type": "default", "column": "{\"type\": \"numeric\", \"name\": \"column\"}"}.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Simpler option: ?_sort= column values look like this:

  • mycolumn - for sort by column
  • mycolumn$numeric - for sort by column after cast to float
  • mycolumn$morename$default - for the edge case where the column name itself contains a $ symbol

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Even better solution: use URL encoding in the parameter details. This is consistent with how ?_next= tokens work, e.g. ?_next=0.291861560261786%2Ce%2Cj.

So the format can be:

  • mycolumn
  • urlencoded-mycolumn$castname

For most columns this will look like: ?_sort=score$numeric

For columns with a $ in their name it will be ?_sort=score%24hasdollar$numeric

Problem: both $ and , are usually URL encoded anyway. I need a character which isn't encoded by default, so that I can use its encoded form to show it is part of the column name and its un-encoded form to split the cast indicator.

_ is a candidate here - not encoded by default, but can be encoded as %5F.

The other unreserved non-alphanumeric characters are -, ., _, ~.

Of these, ~ is least likely to show up in a column name. So I'll use that.

  • mycolumn
  • mycolumn~numeric
  • mycolumn%7Ewith%7Etildes~numeric

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Urgh this isn't going to work. %7E~%7E gets decoded as ~~~ so I wouldn't be able to tell the difference.

I could use double-percentage-encoding here instead. I feel like there's a simpler solution that I'm missing (and that may well be in use within Datasette already, I'm not doing great thinking this morning).

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Much easier solution: if the suffix is ~numeric then treat it as the column name sorted numerically. If the suffix is missing OR the suffix is ~default, sort without casting. Only add the ~default suffix if the column name itself contains at least one ~ symbol.

Using ~ because it doesn't need to be URL-encoded.

@simonw simonw changed the title Feature: sort by column cast to integer ?sort=colname~numeric to sort by by column cast to integer Oct 15, 2020
@simonw simonw changed the title ?sort=colname~numeric to sort by by column cast to integer ?sort=colname~numeric to sort by by column cast to real Oct 15, 2020
@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Relevant code:

sortable_columns = await self.sortable_columns_for_table(
database, table, use_rowid
)
# Allow for custom sort order
sort = special_args.get("_sort")
sort_desc = special_args.get("_sort_desc")
if not sort and not sort_desc:
sort = table_metadata.get("sort")
sort_desc = table_metadata.get("sort_desc")
if sort and sort_desc:
raise DatasetteError("Cannot use _sort and _sort_desc at the same time")
if sort:
if sort not in sortable_columns:
raise DatasetteError("Cannot sort table by {}".format(sort))
order_by = escape_sqlite(sort)
if sort_desc:
if sort_desc not in sortable_columns:
raise DatasetteError("Cannot sort table by {}".format(sort_desc))
order_by = "{} desc".format(escape_sqlite(sort_desc))

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Something to watch out for: "" empty strings cast to 0.0:

select cast("100" as real), "100", cast(null as real), cast("" as real)

cast("100" as real) "100" cast(null as real) cast("" as real)
100.0 100   0.0

https://latest.datasette.io/fixtures?sql=select+cast%28%22100%22+as+real%29%2C+%22100%22%2C+cast%28null+as+real%29%2C+cast%28%22%22+as+real%29

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

cast(nullif(colname, '') as real) can fix this - it will treat '' the same as null.

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Also need to rethink this template logic that decides if to show a column as sorted or not:

{% if column.name == sort %}
<a href="{{ path_with_replaced_args(request, {'_sort_desc': column.name, '_sort': None, '_next': None}) }}" rel="nofollow">{{ column.name }}&nbsp;▼</a>
{% else %}
<a href="{{ path_with_replaced_args(request, {'_sort': column.name, '_sort_desc': None, '_next': None}) }}" rel="nofollow">{{ column.name }}{% if column.name == sort_desc %}&nbsp;▲{% endif %}</a>
{% endif %}

@simonw
Copy link
Owner Author

simonw commented Oct 15, 2020

Prototype so far:

diff --git a/datasette/views/table.py b/datasette/views/table.py
index ea11a51..d61f8bd 100644
--- a/datasette/views/table.py
+++ b/datasette/views/table.py
@@ -497,17 +497,32 @@ class TableView(RowTableShared):
         if sort and sort_desc:
             raise DatasetteError("Cannot use _sort and _sort_desc at the same time")
 
+        def parse_sort(sort):
+            if "~" in sort:
+                if sort.endswith("~default"):
+                    col = sort.rsplit("~", 1)[0]
+                    return col, escape_sqlite(col)
+                elif sort.endswith("~numeric"):
+                    col = sort.rsplit("~", 1)[0]
+                    return col, "cast(nullif({}, '') as real)".format(escape_sqlite(col))
+                else:
+                    return sort, escape_sqlite(sort)
+            else:
+                return sort, escape_sqlite(sort)
+
         if sort:
-            if sort not in sortable_columns:
-                raise DatasetteError("Cannot sort table by {}".format(sort))
+            sort_column, sort_clause = parse_sort(sort)
+            if sort_column not in sortable_columns:
+                raise DatasetteError("Cannot sort table by {}".format(sort_column))
 
-            order_by = escape_sqlite(sort)
+            order_by = sort_clause
 
         if sort_desc:
-            if sort_desc not in sortable_columns:
-                raise DatasetteError("Cannot sort table by {}".format(sort_desc))
+            sort_column, sort_clause = parse_sort(sort_desc)
+            if sort_column not in sortable_columns:
+                raise DatasetteError("Cannot sort table by {}".format(sort_column))
 
-            order_by = "{} desc".format(escape_sqlite(sort_desc))
+            order_by = "{} desc".format(sort_clause)
 
         from_sql = "from {table_name} {where}".format(
             table_name=escape_sqlite(table),

@simonw simonw added this to the 0.51 milestone Oct 23, 2020
@simonw simonw added the medium label Oct 29, 2020
@simonw simonw modified the milestones: 0.51, Datasette 0.52 Oct 29, 2020
@simonw simonw modified the milestones: Datasette 0.52, Datasette Next Nov 28, 2020
@simonw simonw removed this from the Datasette Next milestone Dec 18, 2020
@simonw
Copy link
Owner Author

simonw commented Aug 20, 2021

Maybe ?_sort_numeric=col and ?_sort_numeric_desc=col would be better here.

@simonw
Copy link
Owner Author

simonw commented Aug 20, 2021

I could add these sorting links to the cog menu for any TEXT columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant