-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue #1557 Changes to support switching to bigquery. #1568
Conversation
The queries are Postgres Syntax so needs some tweaking to BQ format, so that is the reason might now be working. I also feel the submission query might perform bit better with BQ may be not need to create a Temp table(Needs to play around with it) |
I put this up as draft right now because @zqian wanted to see it. No need to review any code here as it's really just a work in progress. I'm planning on keeping them in Postgres syntax and calling them using EXTERNAL_QUERY. I think the reason it's currently not working is the parameter replacement doesn't work for EXTERNAL_QUERY in BigQuery. I've only spent about 1 day on this. If we rewrote it to the tables in BigQuery using the context_store_entity and context_store_keymap it would incur a query cost. I'm not sure about what the cost would be. |
From what I understood with BQ cost for running Context Store queries shouldn't be a lot. The reason as I understood from Unizin Folks is since the Storage for Context store data is not as big as BQ Event store so we don't have query Processing cost as with event store.
IMHO, I feel if we are switching to BQ backend that I feel the queries should also be BQ Fomat otherwise we are still be dealing with 2 different system(Postgres, BQ). But I fully don't understand why parameter replacement doesn't work so I am just saying what I am thinking. Since last time I did the update I was dealing with Postgres, Bigquery connector and it was not fun updating. |
I added some more thoughts to this on Slack. I felt this was possibly the easiest way for a first pass. If we wanted to push this forward and do a rewrite to 100% BigQuery we might decide to use the canvas (CD2) tables instead of the context_store. It just feels better to only be going to one source for everything even if BigQuery is just mostly a "proxy" for the PostGres data. |
Thanks for the feedback here. Because of complications around named parameters and the temporary table, I've decided to take your suggestion and explore rewriting this directly into BigQuery. Currently it runs as far as the assignment table so it's getting there! |
yes, all the tables other than Submission should be simple, just syntax change and probably some the way the variables are queried. With submission query, with BQ we have to check if we want to create a Temp table route, I felt BQ is might be good with handling stuff. But Only way to test is in Myla Beta since it has lot of courses. |
Doesn't seem like temp table is necessary anymore. Can just create it using with a WITH and CTE which seems to be creating a fast temporary table now. Was only necessary back in PostGres. |
aren't working but it feels like it might be making progress!
context_store_keymap Currently gets as far as assignment table.
Adding in tbyte calculation to the new bq method
I will review this today or tomorrow |
dashboard/common/db_util.py
Outdated
@@ -181,7 +182,7 @@ def get_last_cronjob_run() -> Union[datetime, None]: | |||
return None | |||
|
|||
|
|||
def get_canvas_data_date() -> Union[datetime, None]: | |||
def get_canvas_data_date() -> Union[datetime.datetime, None]: | |||
if not settings.DATABASES.get('DATA_WAREHOUSE', {}).get('IS_UNIZIN'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this line need update after removal of 'DATA_WAREHOUSE' setting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should probably also just be removed as it's useless code now and will always return the default of {}
. I also see this in the cron.py.
logger.info("************ total status=" + status + "\n") | ||
if settings.LRS_IS_BIGQUERY: | ||
total_tbytes_billed = self.total_bytes_billed / 1024 / 1024 / 1024 / 1024 | ||
# $6.25 per TB as of Feb 2024 https://cloud.google.com/bigquery/pricing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to watch for the cost. I have 20 MyLA courses in localhost, and the initial run with this PR costs $18, and subsequent run costs $0.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only has 5 course and costed $24 next run cost $0.1. I would expect to cost less
// This is the total row count
resources_access: 2115
resources: 248
submissions: 1759
assignments: 95
user: 100
course: 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost is high likely because you courses without a start date, without a term (or Default Term/No Term) or with a old term in your database. It tries to pull from the event store from that date on an empty DB.
It might also be a bug from this code with retrieving the dates for terms which wouldn't be good. I was looking at my database for course 556393 and there was no start or end time in the local DB, even though it has Fall 2022 as the term on the course. Will have to look into that so I wouldn't run this till then
The event store is getting huge now so doing a full query is multiple TB of data.
I think we should have a safety case in here so it doesn't retrieve more than X months of data from the event_store
unless explicitly directed to. Generally there won't ever be a case to populate old course data. Though querying a year is still a high cost but would likely be about 1/5 the cost of a full run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should create an issue if that is the case and fix the term start/end time issue.
I feel this PR should be run from MyLA Beta and see how long it take to run. since MyLA Beta has about 50+ course |
@@ -1,7 +1,7 @@ | |||
from datetime import datetime | |||
import logging | |||
from collections import namedtuple | |||
from typing import Any, Dict, List, Union | |||
from typing import Any, Dict, List, Optional, Union | |||
from zoneinfo import ZoneInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is redundant import from sqlalchemy.orm import sessionmaker
this could be removed.
dashboard/cron.py
Outdated
def setup_bigquery(self): | ||
# Instantiates a client | ||
self.bigquery_client = bigquery.Client() | ||
# self.bigquery_client = bigquery.Client.from_service_account_json(settings.BIGQUERY_SERVICE_ACCOUNT_JSON) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line since it is commented code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that I had this in there because I was thinking of switching the way we configure BigQuery to use an explicit JSON value in the hjson rather than the bq_cred.json file. This is probably a separate issue but something I noticed in the newer version of the library that is supported.
The PR seems be working from fetching results as expected. I have some minor comment. But I feel this PR should be run against MyLA Beta course and see how it performance specially with Submission Query. |
Do you mean to use the course-set that's on the MyLA beta server? That could be done locally by just exporting the course table. It should run about the same local or on that server and uses the same data set. |
I don't think I care if it's done locally or Beta server. If you run the courses from MyLA Beta or Test (Test has more courses) and see how the cron performance in same way as current prod then that should be good. |
That makes sense, I was only testing it locally on a few courses but even that ran significantly faster now. I should be able to try the full list of courses. |
You should only test this with a current term course for now as it will pull down too much data otherwise on an empty database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice Work @jonespm for pulling this off in short amount of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the Codacy messages. Otherwise, the PR is good to go!
Thanks, didn't notice the checks, they were valid about unused imports so should pass now. |
I'm not sure if I'll have time to run on the test environment today this so planning on waiting until Monday to merge and test this out then. |
cron job runs successfully on test Jenkins. |
The queries currently aren't working but it feels like it might be making progress!
metadata
user
assignment_groups
assignment
assignment_weight
term
course
resource
submission
timestamp without time zone
to keep dates correct