Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large schema, Low-cardinality columns and Enum type #20

Open
ansarizafar opened this issue Sep 10, 2024 · 2 comments
Open

Large schema, Low-cardinality columns and Enum type #20

ansarizafar opened this issue Sep 10, 2024 · 2 comments

Comments

@ansarizafar
Copy link

Adding schema of all tables in prompt can exceed LLM context limit hence not viable. How PremSQL handles schema with large number of tables? Is there a strategy to only add schema of relevant tables in prompt?

Low-cardinality refers to columns with few unique values. Low-cardinality column values are typically status flags, Boolean values, or major classifications such as gender and state code. An example of a data table column with low-cardinality would be a CUSTOMER table with a column named NEW_CUSTOMER. Enumerated types (ENUM) are widely used in structured attributes to restrict the values of a text type to carry only one or more of a pre-defined set of permitted values.

How Low-cardinality columns are handled and how PremSQL ensures mapping ENUM attribute values to a permitted one?

@Anindyadeep
Copy link
Member

Adding schema of all tables in prompt can exceed LLM context limit hence not viable. How PremSQL handles schema with large number of tables? Is there a strategy to only add schema of relevant tables in prompt?

Not right now, because the context size is model dependent if you are using Prem-1B-SQL then the context size is 2048 and if you are using chatGPT then context size is very huge. However, internally we will be coming up with better strategies to which tables to choose / efficient schema selection in the later releases.

Low-cardinality refers to columns with few unique values. Low-cardinality column values are typically status flags, Boolean values, or major classifications such as gender and state code. An example of a data table column with low-cardinality would be a CUSTOMER table with a column named NEW_CUSTOMER. Enumerated types (ENUM) are widely used in structured attributes to restrict the values of a text type to carry only one or more of a pre-defined set of permitted values.

Could you give me an example of this?

To summarize, in this current release, we are very much researching on the modelling side of things, not very much on the core library. However in next releases, most of the focus would be on PremSQL pipelines, where the focus would be into providing mechanisms like better schema selection (if number of tables are large), using embedding models inside PremSQL for semantic based searches (if required), function calling etc.

If you want to contribute to this project let us know. We can discuss.

@ansarizafar
Copy link
Author

ansarizafar commented Sep 11, 2024

I am very much interested in text2SQL and doing research these days. There are many libs that can generate simple SQL but I am yet to find a scalable and reliable text2SQL solution that can generate complex SQL queries. In my experience the issues I have mentioned above are a big reason for wrong SQL generation. Please check this project/paper it looks promising. https://github.com/stanford-oval/suql

For large number of tables, We can store the schema of each table and gloden queries in an embedded vector/graph database (eg. Kuzudb). Perform a vector search and find relevant golden queries based on user's question, send user question and found golden queries to LLM and Ask LLM to generate SQL for user question. use generated SQL to find relevant table schema with vector search, send the selected table schemas to LLM and ask to revise the generated SQL.

I would love to contribute but I am not a Python developer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants