Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature audit based on logs history #3267

Open
1 of 4 tasks
balajialg opened this issue Feb 15, 2022 · 14 comments
Open
1 of 4 tasks

Feature audit based on logs history #3267

balajialg opened this issue Feb 15, 2022 · 14 comments
Assignees
Labels
enhancement Issues around improving existing functionality

Comments

@balajialg
Copy link
Contributor

balajialg commented Feb 15, 2022

Summary

Thanks to @yuvipanda's nudge, I started working on the feature matrix document to segment our instructors based on the type of features they use. My initial understanding is that we can classify the instructors into three archetypes - instructors using foundational/intermediary or complex use cases. I also spent some time mapping which features map to which user archetypes in the doc. Open to the team's input on whether the classification makes sense.

I would like to get the team's input on whether it is possible to retrieve metrics around usage for a particular feature? (Thanks @ryanlovett for nudging me to think in this direction). Just like the python popularity dashboard which tracks the usage of python libraries, Is it possible for us to track the most used features by our instructors? We can use this information during the semester onboarding to tailor the feature demo based on their prior usage

Feature List

  • Jupyter Classic Notebook
  • JupyterLab
  • RetroLab
  • RStudio
  • Jupyter R Kernel
  • R Dashboarding (Shiny)
  • Nbgitpuller extensions for Chrome/Mozilla
    File Management
  • Real-time file sharing using SyncThing
  • Real-Time Collaboration
  • File Archiving
  • Shared Volumes
    User Management
  • Admin Access
  • Secure Github authentication
    Application
  • Linux Desktop environment
  • SyncThing
  • Persistent Storage
    • Postgres DB
    • SQLite
      3rd party libraries
  • Otter Grader
  • Installing Lab extensions (JupyterLab, RetroLab)
  • Creating custom kernels (Conda environment, etc..)
    High-Performance Computing
  • Dask based clusters
  • Auto Scaling via Calendar

User Stories

  • As an infrastructure admin, I would like to know the data about the number of users (and if possible their hub names, course names) using a specific datahub feature so that I can use that information to classify whether their usage is foundational, intermediary, or advanced.

Tasks

@balajialg balajialg self-assigned this Feb 15, 2022
@balajialg balajialg added the enhancement Issues around improving existing functionality label Feb 15, 2022
@ryanlovett
Copy link
Collaborator

I think the logs within the Google cloud console should show frequency of urls like /tree, /rstudio, /lab, etc. They would also have gitpuller URLs, syncthing, desktop. Basically features in the URL can be tracked.

However course info is not in the URL, except in cases where the course info happens to be in a git repo name. The latter don't follow any strict format though. Instructors can name their repos however they like.

@balajialg
Copy link
Contributor Author

balajialg commented Feb 18, 2022

@ryanlovett Amazing to know that feature-related metrics can be tracked. Can understand the complexity related to retrieving course-related information. I will follow up more on ways to retrieve feature-related information in the Google Cloud console

@balajialg
Copy link
Contributor Author

balajialg commented Mar 4, 2022

Next Steps from March Sprint Planning Meeting:

  • @balajialg to formulate questions for the feature usage!

  • @balajialg to work with @felder to get required data for different feature usage related questions

@balajialg balajialg assigned felder and unassigned balajialg Mar 4, 2022
@balajialg
Copy link
Contributor Author

balajialg commented Mar 7, 2022

@felder - During your free time, Can you please let me know whether we can get analytics data to answer the below questions? It would help build a narrative around Datahub's value proposition.

  • How many instructors use Jupyter Classic Notebook, JLab, and Retro Lab?
  • How many instructors use R Studio, Jupyter R Kernel, and R Shiny?
  • How many instructors use Admin functionality and Shared volume?
  • How many instructors use Otter Grader/Gofer grader for auto-grading?
  • How many instructors use JLab plugins?'What type of plugins is most used?
  • How many instructors use persistent storage? (Postgres DB, SQLite DB, etc..)

If instructor-level data is not available, how would you like these questions to be framed so that we can get data that are closely relevant to the question being asked?

@balajialg
Copy link
Contributor Author

balajialg commented Mar 12, 2022

@felder Sharing the context from the conversation which happened between us in Slack, Instructor + Course-specific data cannot be retrieved with the GCP logs that are stored currently. It would require us to figure out another mechanism to fetch the data (possibly nbgitpuller links).

I will set up some time with you the week after to figure out the near-term scope of the data to be retrieved and options we can explore to answer the highlighted questions in the longer run.

@balajialg
Copy link
Contributor Author

balajialg commented Apr 7, 2022

Qualitative Insights based on preliminary log analysis from the last 30 days:

  • R Kernel: Datahub and Biology hub users actively use R Kernel
  • Lab: Data 100 hub users use Jupyter Lab extensively
  • Remote Desktop: Astro and EECS hub users use remote desktop environments regularly
  • Shared Directory: Datahub, Data 100, Astro, Biology hub users use the shared directory
  • Shared-read-write: Data100 hub users use a shared read-write directory. More manual exploration is required across other hubs.
  • Conda Environment: Conda environment created for the Genomics class taught by Priya Moorjani is not used this semester
  • Admin Functionality The admin feature is actively used by Datahub, Astro, Data 8, Public Health, EECS, Data 100, Dlab, workshop, ISchool, Highschool, Julia, Data 102, Stat 159, Biology, R, Prob 140, and Stat 20 hub users.
  • Syncthing is not having much usage across any of the hubs
  • Shiny doesn't have usage across any of the hubs

@ryanlovett
Copy link
Collaborator

@balajialg Interesting, thanks! Do you know what admin is being used for? It is just to view the list of people or is it being used to stop/start servers too?

@balajialg
Copy link
Contributor Author

balajialg commented Apr 8, 2022

@ryanlovett I searched

``resource.type="k8s_container" resource.labels.cluster_name="fall-2019" resource.labels.namespace_name="-prod" resource.labels.container_name="notebook" textPayload="oauth2/authorize?"`

in the log explorer to see how many users actively click on the option "access server" to access other users' hub instances. Apparently, I could see that almost all hub users noted in the above comment access other users' instances. Let me know if querying "oauth2/authorize?" as part of the text payload is the right search query to retrieve users clicking on access server options.

Here is the link to the gcp log explorer with the search query

@yuvipanda
Copy link
Contributor

@balajialg i'm not sure but maybe oauth2/authorize may also be used each time a user logs in, regardless of wether it's used with admin access or not.

Another way is to look at just the hub logs and look for uses of the admin panel there by URL, with resource.labels.container_name="hub"

@yuvipanda
Copy link
Contributor

The other point is that these are only logs across last 30 days, so we can't make inferences about longer term usage patterns. We can start saving the nginx logs too though, and make that happen.

@balajialg
Copy link
Contributor Author

balajialg commented Apr 8, 2022

@yuvipanda Completely agree with you! I am looking at the above points as potential hypotheses to explore possible trends with the long-term log data. My other hypothesis is that except for a few variations, this data should highly correlate with the long-term data (considering this is a snapshot from mid-semester). But I can be completely wrong about this point.

Searching for hub logs - I am seeing entries in the logs for all hubs which I am not able to make sense of. Should I interpret these logs as admin access features that got widely used by instructors/GSIs across hubs over the past month or some of these logs are configuration-based and did not get logged due to a user action? Check the log results here

@yuvipanda
Copy link
Contributor

@balajialg what log lines do you get when you access admin hub yourself? Basically we need to look at that and derive regexes and filters from that info. Some post processing may also be needed.

Everything with wp-admin or webadmin or similar is bots trying to exploit our hub because maybe it's a wordpress instance or a similar piece of software with known vulnerabilities :D

@yuvipanda
Copy link
Contributor

I think a basic process should be to:

  1. Do the thing you're trying to measure
  2. See what logs show up, if any.
  3. Be very careful in vetting the hypothesis that what you see in (2) always shows up when you do (1) but at no other time. This can be a bit difficult but definitely doable.
  4. Document the process as we go along so we don't lose track.

@balajialg
Copy link
Contributor Author

balajialg commented Apr 8, 2022

@yuvipanda Looked at the hub logs by searching for the textPayload=~"\s/hub/admin\s" search query. I observe that the resulting logs are similar to the network logs when the admin portal is accessed by the user in Datahub. Obviously, it requires a bit of post-processing in terms of drawing meaningful insights but I can see that logs are from multiple hubs like Datahub, Data 8, Data 100, etc.

Thanks for detailing the process! I am spending a lot of time fine-tuning the search query (learning regex on the side) to ensure that the search results only show up for that particular search query. It is time-intensive.

Is the Nginx log structure similar to the current logs? or would it require fine-tuning the search query once more based on the resulting logs?

@balajialg balajialg changed the title Feasibility of Feature Popularity Dashboard (Similar to Python Popularity Dashboard) Feature audit based on logs history Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants