Skip to content

Update Table.GenerateByPage to handle empty tables being returned #484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

bgribaudo
Copy link
Contributor

Per Table.GenerateByPage's Helper Functions description, this method should call getNextPage until that callback returns null, then will combine all returned pages into a single table.

However, when the existing Table.GenerateByPage returns an empty table, its actual behavior differs from the above description (see below examples). If the callback returns an empty table, Table.GenerateByPage may output 0 rows or an extra, all-null row--neither of which seems desirable or in alignment with the above description.

This PR addresses these behavior anomalies. (Note: It would be good for someone with internal Power Query knowledge to validate this PR to ensure that it is done in a way that aligns with PQ's lazy paradigm and that it does not trigger extra streaming.)

Demo of Existing's Behavior Anomalies

let
    // Existing version of this method
    Table.GenerateByPage = 
        (getNextPage as function) as table =>
            let        
                listOfPages = List.Generate(
                    () => getNextPage(null),            // get the first page of data
                    (lastPage) => lastPage <> null,     // stop when the function returns null
                    (lastPage) => getNextPage(lastPage) // pass the previous page to the next function call
                ),
                // concatenate the pages together
                tableOfPages = Table.FromList(listOfPages, Splitter.SplitByNothing(), {"Column1"}),
                firstRow = tableOfPages{0}?
            in
                // if we didn't get back any pages of data, return an empty table
                // otherwise set the table type based on the columns of the first page
                if (firstRow = null) then
                    Table.FromRows({})
                // check for empty first table
                else if (Table.IsEmpty(firstRow[Column1])) then
                    firstRow[Column1]
                else
                    Value.ReplaceType(
                        Table.ExpandTableColumn(tableOfPages, "Column1", Table.ColumnNames(firstRow[Column1])),
                        Value.Type(firstRow[Column1])
                    )
    ,
    PageFunction = (previous) => 
        let
            /*
            // Actual: In this scenario, 2 rows are output, the 2nd holding a null column value because Response2 contained 0 rows.
            // Expected: Only 1 row (from Response1) should have been returned.
            Response1 = #table({"A"},{{1}}),
            Response2 = #table({"A"},{})
            */
            // Actual: In this scenario, 0 rows are output. Response1 contained no rows so Response2's data is skipped
            // Expected: 1 row returned (row from Response2)
            Response1 = #table({"A"},{}),
            Response2 = #table({"A"},{{1}})
        in
            if previous = null then Response1 
            else if previous = Response1 then Response2
            else null,
    Result = Table.GenerateByPage(PageFunction)
in
    Result

@CurtHagenlocher
Copy link
Contributor

We specifically can't use Table.Combine, as that will effectively break the streaming behavior for most cases.

FWIW, we somehow ended up with an older implementation as the "boilerplate" for this function. A slightly newer definition -- albeit one which still doesn't fix the "empty row" problem -- looks like this:

Table.GenerateByPage = (
    getNextPage as function,
    optional tableType as type
) as table =>
    let
        listOfPages = List.Generate(
            () => getNextPage(null),
            (lastPage) => lastPage <> null,
            (lastPage) => getNextPage(lastPage)
        ),
        tableOfPages = Table.FromList(listOfPages, Splitter.SplitByNothing(), {"Column1"}),
        firstRow = tableOfPages{0}?,
        keys = if tableType = null then Table.ColumnNames(firstRow[Column1])
            else Record.FieldNames(Type.RecordFields(Type.TableRow(tableType))),
        appliedType = if tableType = null then Value.Type(firstRow[Column1]) else tableType
    in
        if tableType = null and firstRow = null then
            Table.FromRows({})
        else
            Value.ReplaceType(
                Table.ExpandTableColumn(tableOfPages, "Column1", keys),
                appliedType
            )

What's better about it is that when you know the table type, it avoids the duplicate request for the first page.

@bgribaudo
Copy link
Contributor Author

@CurtHagenlocher,

Thanks for that revised logic! I was planning to look into the effects of the first-row-get-type logic today--but the revision you provided should save me that effort. :-)

Is switching from Table.Combine to filtering listOfPages using Table.IsEmpty any better from the streaming perspective (see PR revision made a few minutes ago)?

Without this change, `try Table.GenerateByPage(each error "bad", type table[A=any])` won't catch the `error "bad"` raised here (which simulates the first data page fetch attempt raising an error).
@bgribaudo
Copy link
Contributor Author

@CurtHagenlocher, do you like either of the last two commits ("Adjusting error propagation" or "Reverting last change to avoid triggering a row fetch")?

@brigadir13
Copy link

Per Table.GenerateByPage's Helper Functions description, this method should call `` until that callback returns null, then will combine all returned pages into a single table.

However, when the existing Table.GenerateByPage returns an empty table, its actual differs from the above description (see below examples). If the callback returns an empty table, Table.GenerateByPage may output 0 rows or an extra, all-null row--neither of which seems desirable or in alignment with the above description.

This PR addresses these anomalies. (Note: It would be good for someone with internal Power Query knowledge to validate this PR to ensure that it is done in a way that aligns with lazy paradigm and that it does not trigger extra streaming.)

Demo of Anomalies

let
    // Existing version of this method
    Table.GenerateByPage = 
        ( as function) as table =>
            let        
                = List.Generate(
                    () => (null),            // get the first page of data
                    () => <> null,     // stop when the function returns null
                    () => () // pass the previous page to the next function call
                ),
                // concatenate the pages together
                = Table.FromList(, Splitter.SplitByNothing(), {""}),
                = {0}?
            in
                // if we didn't get back any pages of data, return an empty table
                // otherwise set the table type based on the columns of the first page
                if ( = null) then
                    Table.FromRows({})
                // check for empty first table
                else if (Table.IsEmpty([])) then
                    []
                else
                    Value.ReplaceType(
                        Table.ExpandTableColumn(, "", Table.ColumnNames([])),
                        Value.Type([])
                    )
    ,
    = (previous) => 
        let
            /*
            // Actual: In this scenario, 2 rows are output, the 2nd holding a null column value because contained 0 rows.
            // Expected: Only 1 row (from ) should have been returned.
            = #table({"A"},{{1}}),
            = #table({"A"},{})
            */
            // Actual: In this scenario, 0 rows are output. contained no rows so data is skipped
            // Expected: 1 row returned (row from )
            = #table({"A"},{}),
            = #table({"A"},{{1}})
        in
            if previous = null then 
            else if previous = then 
            else null,
    Result = Table.GenerateByPage(@# )
in
    Result
  • po

@CurtHagenlocher
Copy link
Contributor

The problem with filtering each page using Table.IsEmpty as a generic solution is that it might be an expensive check. Consider the scenario where each page is produced by a SQL query. It's possible that adding Table.IsEmpty could force that query to be run twice per page. This could be mitigated by buffering the page first, but that has drawbacks as well.

On the whole, I think that M doesn't have the right primitives to let this function be written generically and efficiently inside M. At one point I was looking to add this as a standard library function in C#, but that was over ten years ago and I no longer remember why I got pushback. I think when we run into a situation internally where "empty pages" is a potential issue, we're usually able to add a filter after the Table.GenerateByPage to filter out the empty rows. But that, too, is not generic because it depends on the fact that "real" rows won't ever be empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants