Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow accessing the entire row of selected values in gr.DataFrame #9128

Merged
merged 21 commits into from
Aug 20, 2024

Conversation

abidlabs
Copy link
Member

@abidlabs abidlabs commented Aug 15, 2024

Closes: #7601
Closes: #7127

This PR adds a .row_value parameters to gr.SelectData for gr.DataFrame.

Example:

import gradio as gr
import pandas as pd

with gr.Blocks() as demo:

    sample_data = {
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3']
                   }
    df = pd.DataFrame(sample_data)
    df_widget = gr.Dataframe(df, interactive=False)

    def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
        print("index", evt.index)
        print("value", evt.value)
        print("row_value", evt.row_value)
        return

    df_widget.select(df_select_callback, inputs=[df_widget])

if __name__ == "__main__":
    demo.launch()

@gradio-pr-bot
Copy link
Collaborator

gradio-pr-bot commented Aug 15, 2024

🪼 branch checks and previews

Name Status URL
Spaces ready! Spaces preview
Website ready! Website preview
Storybook ready! Storybook preview
🦄 Changes detected! Details

Install Gradio from this PR

pip install https://gradio-pypi-previews.s3.amazonaws.com/83f33c80b5d4144aaba33f20340c84368da2d5d5/gradio-4.41.0-py3-none-any.whl

Install Gradio Python Client from this PR

pip install "gradio-client @ git+https://github.com/gradio-app/gradio@83f33c80b5d4144aaba33f20340c84368da2d5d5#subdirectory=client/python"

Install Gradio JS Client from this PR

npm install https://gradio-npm-previews.s3.amazonaws.com/83f33c80b5d4144aaba33f20340c84368da2d5d5/gradio-client-1.5.0.tgz

@gradio-pr-bot
Copy link
Collaborator

gradio-pr-bot commented Aug 15, 2024

🦄 change detected

This Pull Request includes changes to the following packages.

Package Version
@gradio/dataframe minor
@gradio/utils minor
gradio minor
  • Maintainers can select this checkbox to manually select packages to update.

With the following changelog entry.

Allow accessing the entire row of selected values in gr.DataFrame

Maintainers or the PR author can modify the PR title to modify this entry.

Something isn't right?

  • Maintainers can change the version label to modify the version bump.
  • If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

@dwipper
Copy link

dwipper commented Aug 15, 2024

@abidlabs Based on my testing, works great! Thanks!

@freddyaboulton
Copy link
Collaborator

freddyaboulton commented Aug 16, 2024

@abidlabs - seeing this weird behavior when sorting more than once - I think the original_indices are being refreshed on every sort so the original_index is not the actual original index.

https://www.loom.com/share/f8e799ef89c24e578979cf4e7703ccd9?sid=45e1e768-a361-4991-bad4-667684df436b

I would be more in favor of passing the entire row value as value in the select event. Seems like that would be more useful in the event you don't have access to the original dataframe (it was created in an event). Of course, that's breaking but we could do for 5.0?

Code

import gradio as gr
import pandas as pd
from pathlib import Path

abs_path = Path(__file__).parent.absolute()

df = pd.read_json(str(abs_path / "assets/leaderboard_data.json"))
invisible_df = df.copy()

COLS = [
    "T",
    "Model",
    "Average ⬆️",
    "ARC",
    "HellaSwag",
    "MMLU",
    "TruthfulQA",
    "Winogrande",
    "GSM8K",
    "Type",
    "Architecture",
    "Precision",
    "Merged",
    "Hub License",
    "#Params (B)",
    "Hub ❤️",
    "Model sha",
    "model_name_for_query",
]
ON_LOAD_COLS = [
    "T",
    "Model",
    "Average ⬆️",
    "ARC",
    "HellaSwag",
    "MMLU",
    "TruthfulQA",
    "Winogrande",
    "GSM8K",
    "model_name_for_query",
]
TYPES = [
    "str",
    "markdown",
    "number",
    "number",
    "number",
    "number",
    "number",
    "number",
    "number",
    "str",
    "str",
    "str",
    "str",
    "bool",
    "str",
    "number",
    "number",
    "bool",
    "str",
    "bool",
    "bool",
    "str",
]
NUMERIC_INTERVALS = {
    "?": pd.Interval(-1, 0, closed="right"),
    "~1.5": pd.Interval(0, 2, closed="right"),
    "~3": pd.Interval(2, 4, closed="right"),
    "~7": pd.Interval(4, 9, closed="right"),
    "~13": pd.Interval(9, 20, closed="right"),
    "~35": pd.Interval(20, 45, closed="right"),
    "~60": pd.Interval(45, 70, closed="right"),
    "70+": pd.Interval(70, 10000, closed="right"),
}
MODEL_TYPE = [str(s) for s in df["T"].unique()]
Precision = [str(s) for s in df["Precision"].unique()]

# Searching and filtering
def update_table(
    hidden_df: pd.DataFrame,
    columns: list,
    type_query: list,
    precision_query: str,
    size_query: list,
    query: str,
):
    filtered_df = filter_models(hidden_df, type_query, size_query, precision_query)  # type: ignore
    filtered_df = filter_queries(query, filtered_df)
    df = select_columns(filtered_df, columns)
    return df

def search_table(df: pd.DataFrame, query: str) -> pd.DataFrame:
    return df[(df["model_name_for_query"].str.contains(query, case=False))]  # type: ignore

def select_columns(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    # We use COLS to maintain sorting
    filtered_df = df[[c for c in COLS if c in df.columns and c in columns]]
    return filtered_df  # type: ignore

def filter_queries(query: str, filtered_df: pd.DataFrame) -> pd.DataFrame:
    final_df = []
    if query != "":
        queries = [q.strip() for q in query.split(";")]
        for _q in queries:
            _q = _q.strip()
            if _q != "":
                temp_filtered_df = search_table(filtered_df, _q)
                if len(temp_filtered_df) > 0:
                    final_df.append(temp_filtered_df)
        if len(final_df) > 0:
            filtered_df = pd.concat(final_df)
            filtered_df = filtered_df.drop_duplicates(  # type: ignore
                subset=["Model", "Precision", "Model sha"]
            )

    return filtered_df

def filter_models(
    df: pd.DataFrame,
    type_query: list,
    size_query: list,
    precision_query: list,
) -> pd.DataFrame:
    # Show all models
    filtered_df = df

    type_emoji = [t[0] for t in type_query]
    filtered_df = filtered_df.loc[df["T"].isin(type_emoji)]
    filtered_df = filtered_df.loc[df["Precision"].isin(precision_query + ["None"])]

    numeric_interval = pd.IntervalIndex(
        sorted([NUMERIC_INTERVALS[s] for s in size_query])  # type: ignore
    )
    params_column = pd.to_numeric(df["#Params (B)"], errors="coerce")
    mask = params_column.apply(lambda x: any(numeric_interval.contains(x)))  # type: ignore
    filtered_df = filtered_df.loc[mask]

    return filtered_df

demo = gr.Blocks(css=str(abs_path / "assets/leaderboard_data.json"))
with demo:
    gr.Markdown("""Test Space of the LLM Leaderboard""", elem_classes="markdown-text")

    with gr.Tabs(elem_classes="tab-buttons") as tabs:
        with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
            with gr.Row():
                with gr.Column():
                    with gr.Row():
                        search_bar = gr.Textbox(
                            placeholder=" 🔍 Search for your model (separate multiple queries with `;`) and press ENTER...",
                            show_label=False,
                            elem_id="search-bar",
                        )
                    with gr.Row():
                        shown_columns = gr.CheckboxGroup(
                            choices=COLS,
                            value=ON_LOAD_COLS,
                            label="Select columns to show",
                            elem_id="column-select",
                            interactive=True,
                        )
                with gr.Column(min_width=320):
                    filter_columns_type = gr.CheckboxGroup(
                        label="Model types",
                        choices=MODEL_TYPE,
                        value=MODEL_TYPE,
                        interactive=True,
                        elem_id="filter-columns-type",
                    )
                    filter_columns_precision = gr.CheckboxGroup(
                        label="Precision",
                        choices=Precision,
                        value=Precision,
                        interactive=True,
                        elem_id="filter-columns-precision",
                    )
                    filter_columns_size = gr.CheckboxGroup(
                        label="Model sizes (in billions of parameters)",
                        choices=list(NUMERIC_INTERVALS.keys()),
                        value=list(NUMERIC_INTERVALS.keys()),
                        interactive=True,
                        elem_id="filter-columns-size",
                    )
                    selected_data = gr.Json()

            leaderboard_table = gr.components.Dataframe(
                value=df[ON_LOAD_COLS],  # type: ignore
                headers=ON_LOAD_COLS,
                datatype=TYPES,
                elem_id="leaderboard-table",
                interactive=False,
                visible=True,
            )

            # Dummy leaderboard for handling the case when the user uses backspace key
            hidden_leaderboard_table_for_search = gr.components.Dataframe(
                value=invisible_df[COLS],  # type: ignore
                headers=COLS,
                datatype=TYPES,
                visible=False,
            )
            search_bar.submit(
                update_table,
                [
                    hidden_leaderboard_table_for_search,
                    shown_columns,
                    filter_columns_type,
                    filter_columns_precision,
                    filter_columns_size,
                    search_bar,
                ],
                leaderboard_table,
            )
            for selector in [
                shown_columns,
                filter_columns_type,
                filter_columns_precision,
                filter_columns_size,
            ]:
                selector.change(
                    update_table,
                    [
                        hidden_leaderboard_table_for_search,
                        shown_columns,
                        filter_columns_type,
                        filter_columns_precision,
                        filter_columns_size,
                        search_bar,
                    ],
                    leaderboard_table,
                    queue=True,
                )
            def select_data(data: gr.SelectData):
                return {"index": data.index, "original_index": data.original_index,
                        "model_name": df.iloc[data.original_index[0]]['model_name_for_query'],
                        }
            leaderboard_table.select(select_data, None, selected_data)


if __name__ == "__main__":
    demo.launch()

@abidlabs
Copy link
Member Author

I would be more in favor of passing the entire row value as value in the select event. Seems like that would be more useful in the event you don't have access to the original dataframe (it was created in an event). Of course, that's breaking but we could do for 5.0?

Can you explain what you mean by if you don't have access to the original dataframe? You can always get the value of the dataframe by passing it is as an input component.

I like the idea of passing in the original indices over passing in a row in case we later introduce functionality that allows re-ordering the columns as well

@freddyaboulton
Copy link
Collaborator

You can always get the value of the dataframe by passing it is as an input component.

Yea that's true!

@abidlabs
Copy link
Member Author

@abidlabs - seeing this weird behavior when sorting more than once - I think the original_indices are being refreshed on every sort so the original_index is not the actual original index.

I'll check to see what the issue is, thanks

@pngwn
Copy link
Member

pngwn commented Aug 17, 2024

Can you explain what you mean by if you don't have access to the original dataframe? You can always get the value of the dataframe by passing it is as an input component.

This does generally create an enormous payload tho, far from ideal imo.

@abidlabs
Copy link
Member Author

In cases where a user doesn't have access to the original dataframe yes, but otherwise its just sending the tuple of indices. If we were to send the entire row, it'd be a bigger payload than just sending the tuple in all cases, and it might not be future proof in case we introduce mechanisms to reorder columns at some point

@abidlabs abidlabs marked this pull request as draft August 17, 2024 21:29
@abidlabs
Copy link
Member Author

abidlabs commented Aug 19, 2024

After thinking about it some more, I think that @freddyaboulton's suggestion of sending the updated row values makes more sense than sending the original index. Reason being that the the concept of "original index" is not well-defined for interactive dataframes (e.g. if a user has modified the value of a cell, or has inserted/deleted rows/columns, what should the original index reflect?). Whereas a row_value and col_value should be well-defined in all cases.

@dwipper if you have access to the sorted row_value and col_value, will that work for your use case?

@dwipper
Copy link

dwipper commented Aug 20, 2024

@abidlabs It's interesting. In 4.41 it appears the index_num = evt.index[0] is returning the original index value. This would have worked for my use case when I created a shadow DF that had the original DF, and was looking up a unique reference id in that shadow DF. Since that didn't work, I moved the reference id into a column of the DF, and now the record_id=df.iat[index_num, 6] is returning the original index vs the DF record that was clicked....did this change in some recent release?

@abidlabs
Copy link
Member Author

hmm @dwipper I can't think of any recent change to the dataframe that would have caused this.

@dwipper
Copy link

dwipper commented Aug 20, 2024

@abidlabs To clarify, on the df.select event, the index_num = evt.index[0] is providing the sorted row index. But when I use the df.ait(index_num, 2) function to try get a cell value in the row of sorted df, the lookup seems to reference the unsorted index values, and returns the cell value in the unsorted row. I'm thinking this isn't the expected/desired behavior?

Running this modification your above code in 4.41 will show you the issue:

import pandas as pd

with gr.Blocks() as demo:
    sample_data = {
                    'A': ['A0', 'A1', 'A2', 'A3'],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                    'C': ['C0', 'C1', 'C2', 'C3']
                   }
    df = pd.DataFrame(sample_data)
    df_widget = gr.Dataframe(df, interactive=False)

    def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
        print(evt.index)
        index_num = evt.index[0]
        print(df.iat[index_num, 2])
        return

    df_widget.select(df_select_callback, inputs=[df_widget])

if __name__ == "__main__":
    demo.launch()```

@dwipper
Copy link

dwipper commented Aug 20, 2024

@abidlabs My bigger picture use case is this:

I have a list of data in a database table. The dataframe shows a subset of the columns in the table. When the user clicks on a row in the dataframe, I need to go back to the table and get additional fields to display in the UI for the row the user clicked on. In order to do that, I have a hidden record_id column in the dataframe to get back to the record in the table. Due to the above behavior of the df.ait() function, if the dataframe is sorted, the incorrect record_id value is returned and the wrong record and data from the table is returned and displayed in the UI.

So at least in my use case, the way the df.ait() works is the issue. In looking at the Pandas docs, I can't see how to access the sorted dataframe.....if there were a function evt.selected_row that returned all the cells in the selected row as a list, that would work for my use case.

@abidlabs abidlabs changed the title Allow accessing the original index in gr.DataFrame Allow accessing the entire row and column of selected value in gr.DataFrame Aug 20, 2024
@abidlabs abidlabs marked this pull request as ready for review August 20, 2024 19:11
Copy link
Collaborator

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well @abidlabs ! Two comments:

  • Should we send the current headers in a separate key? In the event columns are reordered.
  • Tables are typically taller than they are wide. I'm a bit worried sending the entire column over is a bit too much data. Perhaps that can be made opt-in?

This fixes the given issues so will approve

@abidlabs
Copy link
Member Author

Should we send the current headers in a separate key? In the event columns are reordered.
Tables are typically taller than they are wide. I'm a bit worried sending the entire column over is a bit too much data. Perhaps that can be made opt-in?

Good points @freddyaboulton, for now I'll only send row_value and we can think about how to handle column values if/when we introduce mechanisms for changing columns. We might need to modify .select() to add a flag or something else

@abidlabs abidlabs changed the title Allow accessing the entire row and column of selected value in gr.DataFrame Allow accessing the entire row of selected values in gr.DataFrame Aug 20, 2024
@abidlabs abidlabs enabled auto-merge (squash) August 20, 2024 21:03
@abidlabs abidlabs merged commit 747013b into main Aug 20, 2024
21 checks passed
@abidlabs abidlabs deleted the df-original-index branch August 20, 2024 21:08
@abidlabs
Copy link
Member Author

Thanks for the feedback everyone!

@dwipper
Copy link

dwipper commented Aug 22, 2024

@abidlabs The new function works great!

Based on some related testing, I figured out that the root issue here is that when the gr.Dataframe gets sorted by the user by clicking on the column header arrow, that sort isn't reflected in the gr.Dataframe instance. So when the gr.Dataframe instance is passed to a function, it has the original index order, not the sorted order that is represented in the UI. This is apparently why, as mentioned above, the df.iat() function uses the original index order, not the sorted index order.

Is this a bug or the expected behavior?

@abidlabs
Copy link
Member Author

I see, yes that is expected behavior. Sorting a DataFrame only changes the "view" in the UI, but it doesn't actually change the underlying value, so when you access the value in your Python function, you'll get the original dataframe

@dwipper
Copy link

dwipper commented Aug 22, 2024

@abidlabs Thanks for the clarification. Any way you can think around that, i.e. getting the sorted DF? Does a gr.Dataset() function differently?

While getting the row the user clicked on is really helpful, in my app, I want the user to be able to scroll through the list with a VCR control (see following image). It works fine on the unsorted list, but if the user sorts the list, the scrolling doesn't work properly since it's based on the original index.

image
image

JonSingleton added a commit to JonSingleton/StyleTTS-WebUI that referenced this pull request Oct 12, 2024
-Fixed bug the arises from the unexpected behavior of gradio dataframes not passing an updated index when sorting by clicking column headers (gradio-app/gradio#9128)
-Implemented a "send to generation tab" button from the history page.
JonSingleton added a commit to JonSingleton/StyleTTS-WebUI that referenced this pull request Oct 12, 2024
-Fixed bug the arises from the unexpected behavior of gradio dataframes not passing an updated index when sorting by clicking column headers (gradio-app/gradio#9128)
-Implemented a "send to generation tab" button from the history page.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gr.Dataframe not sending sorted dataframe to callback Entire Row Data in Dataframe Select Event
5 participants