Feature audit based on logs history #3267

balajialg · 2022-02-15T23:47:10Z

Summary

Thanks to @yuvipanda's nudge, I started working on the feature matrix document to segment our instructors based on the type of features they use. My initial understanding is that we can classify the instructors into three archetypes - instructors using foundational/intermediary or complex use cases. I also spent some time mapping which features map to which user archetypes in the doc. Open to the team's input on whether the classification makes sense.

I would like to get the team's input on whether it is possible to retrieve metrics around usage for a particular feature? (Thanks @ryanlovett for nudging me to think in this direction). Just like the python popularity dashboard which tracks the usage of python libraries, Is it possible for us to track the most used features by our instructors? We can use this information during the semester onboarding to tailor the feature demo based on their prior usage

Feature List

Jupyter Classic Notebook
JupyterLab
RetroLab
RStudio
Jupyter R Kernel
R Dashboarding (Shiny)
Nbgitpuller extensions for Chrome/Mozilla
File Management
Real-time file sharing using SyncThing
Real-Time Collaboration
File Archiving
Shared Volumes
User Management
Admin Access
Secure Github authentication
Application
Linux Desktop environment
SyncThing
Persistent Storage
- Postgres DB
- SQLite
  3rd party libraries
Otter Grader
Installing Lab extensions (JupyterLab, RetroLab)
Creating custom kernels (Conda environment, etc..)
High-Performance Computing
Dask based clusters
Auto Scaling via Calendar

User Stories

As an infrastructure admin, I would like to know the data about the number of users (and if possible their hub names, course names) using a specific datahub feature so that I can use that information to classify whether their usage is foundational, intermediary, or advanced.

Tasks

Work with @yuvipanda to retrieve the logs data for analysis
Analyze the logs data
Build a feature audit for Datahub (https://www.intercom.com/blog/before-you-plan-your-product-roadmap/)
Use the feature audit to make decisions about the UI/Communication/Developing new features etc...

ryanlovett · 2022-02-17T22:49:35Z

I think the logs within the Google cloud console should show frequency of urls like /tree, /rstudio, /lab, etc. They would also have gitpuller URLs, syncthing, desktop. Basically features in the URL can be tracked.

However course info is not in the URL, except in cases where the course info happens to be in a git repo name. The latter don't follow any strict format though. Instructors can name their repos however they like.

balajialg · 2022-02-18T02:07:20Z

@ryanlovett Amazing to know that feature-related metrics can be tracked. Can understand the complexity related to retrieving course-related information. I will follow up more on ways to retrieve feature-related information in the Google Cloud console

balajialg · 2022-03-04T21:33:02Z

Next Steps from March Sprint Planning Meeting:

@balajialg to formulate questions for the feature usage!
@balajialg to work with @felder to get required data for different feature usage related questions

balajialg · 2022-03-07T22:27:51Z

@felder - During your free time, Can you please let me know whether we can get analytics data to answer the below questions? It would help build a narrative around Datahub's value proposition.

How many instructors use Jupyter Classic Notebook, JLab, and Retro Lab?
How many instructors use R Studio, Jupyter R Kernel, and R Shiny?
How many instructors use Admin functionality and Shared volume?
How many instructors use Otter Grader/Gofer grader for auto-grading?
How many instructors use JLab plugins?'What type of plugins is most used?
How many instructors use persistent storage? (Postgres DB, SQLite DB, etc..)

If instructor-level data is not available, how would you like these questions to be framed so that we can get data that are closely relevant to the question being asked?

balajialg · 2022-03-12T02:10:14Z

@felder Sharing the context from the conversation which happened between us in Slack, Instructor + Course-specific data cannot be retrieved with the GCP logs that are stored currently. It would require us to figure out another mechanism to fetch the data (possibly nbgitpuller links).

I will set up some time with you the week after to figure out the near-term scope of the data to be retrieved and options we can explore to answer the highlighted questions in the longer run.

balajialg · 2022-04-07T23:59:51Z

Qualitative Insights based on preliminary log analysis from the last 30 days:

R Kernel: Datahub and Biology hub users actively use R Kernel
Lab: Data 100 hub users use Jupyter Lab extensively
Remote Desktop: Astro and EECS hub users use remote desktop environments regularly
Shared Directory: Datahub, Data 100, Astro, Biology hub users use the shared directory
Shared-read-write: Data100 hub users use a shared read-write directory. More manual exploration is required across other hubs.
Conda Environment: Conda environment created for the Genomics class taught by Priya Moorjani is not used this semester
Admin Functionality The admin feature is actively used by Datahub, Astro, Data 8, Public Health, EECS, Data 100, Dlab, workshop, ISchool, Highschool, Julia, Data 102, Stat 159, Biology, R, Prob 140, and Stat 20 hub users.
Syncthing is not having much usage across any of the hubs
Shiny doesn't have usage across any of the hubs

ryanlovett · 2022-04-08T00:59:51Z

@balajialg Interesting, thanks! Do you know what admin is being used for? It is just to view the list of people or is it being used to stop/start servers too?

balajialg · 2022-04-08T01:14:20Z

@ryanlovett I searched

``resource.type="k8s_container" resource.labels.cluster_name="fall-2019" resource.labels.namespace_name=~~"-prod" resource.labels.container_name="notebook" textPayload=~~"oauth2/authorize?"`

in the log explorer to see how many users actively click on the option "access server" to access other users' hub instances. Apparently, I could see that almost all hub users noted in the above comment access other users' instances. Let me know if querying "oauth2/authorize?" as part of the text payload is the right search query to retrieve users clicking on access server options.

Here is the link to the gcp log explorer with the search query

yuvipanda · 2022-04-08T03:51:39Z

@balajialg i'm not sure but maybe oauth2/authorize may also be used each time a user logs in, regardless of wether it's used with admin access or not.

Another way is to look at just the hub logs and look for uses of the admin panel there by URL, with resource.labels.container_name="hub"

yuvipanda · 2022-04-08T03:53:26Z

The other point is that these are only logs across last 30 days, so we can't make inferences about longer term usage patterns. We can start saving the nginx logs too though, and make that happen.

balajialg · 2022-04-08T16:56:03Z

@yuvipanda Completely agree with you! I am looking at the above points as potential hypotheses to explore possible trends with the long-term log data. My other hypothesis is that except for a few variations, this data should highly correlate with the long-term data (considering this is a snapshot from mid-semester). But I can be completely wrong about this point.

Searching for hub logs - I am seeing entries in the logs for all hubs which I am not able to make sense of. Should I interpret these logs as admin access features that got widely used by instructors/GSIs across hubs over the past month or some of these logs are configuration-based and did not get logged due to a user action? Check the log results here

yuvipanda · 2022-04-08T19:30:28Z

@balajialg what log lines do you get when you access admin hub yourself? Basically we need to look at that and derive regexes and filters from that info. Some post processing may also be needed.

Everything with wp-admin or webadmin or similar is bots trying to exploit our hub because maybe it's a wordpress instance or a similar piece of software with known vulnerabilities :D

yuvipanda · 2022-04-08T19:31:44Z

I think a basic process should be to:

Do the thing you're trying to measure
See what logs show up, if any.
Be very careful in vetting the hypothesis that what you see in (2) always shows up when you do (1) but at no other time. This can be a bit difficult but definitely doable.
Document the process as we go along so we don't lose track.

balajialg · 2022-04-08T23:23:05Z

@yuvipanda Looked at the hub logs by searching for the textPayload=~"\s/hub/admin\s" search query. I observe that the resulting logs are similar to the network logs when the admin portal is accessed by the user in Datahub. Obviously, it requires a bit of post-processing in terms of drawing meaningful insights but I can see that logs are from multiple hubs like Datahub, Data 8, Data 100, etc.

Thanks for detailing the process! I am spending a lot of time fine-tuning the search query (learning regex on the side) to ensure that the search results only show up for that particular search query. It is time-intensive.

Is the Nginx log structure similar to the current logs? or would it require fine-tuning the search query once more based on the resulting logs?

balajialg self-assigned this Feb 15, 2022

balajialg added the enhancement Issues around improving existing functionality label Feb 15, 2022

balajialg assigned felder and unassigned balajialg Mar 4, 2022

This was referenced Apr 18, 2022

Weekly Update - Monday, April 18th #3353

Closed

Issue with filenames of logs data stored in GCP buckets! #3356

Open

balajialg mentioned this issue May 9, 2022

Weekly Update - Monday, May 9th #3379

Closed

balajialg mentioned this issue May 23, 2022

Weekly Update - Monday, May 23rd #3395

Closed

balajialg changed the title ~~Feasibility of Feature Popularity Dashboard (Similar to Python Popularity Dashboard)~~ Feature audit based on logs history Jul 22, 2022

balajialg mentioned this issue Jul 26, 2022

Weekly Update - Monday, July 25th #3516

Closed

balajialg mentioned this issue Oct 18, 2022

Shane Onboarding: Datahub Infra Access Checklist #3820

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature audit based on logs history #3267

Feature audit based on logs history #3267

balajialg commented Feb 15, 2022 •

edited

Loading

ryanlovett commented Feb 17, 2022

balajialg commented Feb 18, 2022 •

edited

Loading

balajialg commented Mar 4, 2022 •

edited

Loading

balajialg commented Mar 7, 2022 •

edited

Loading

balajialg commented Mar 12, 2022 •

edited

Loading

balajialg commented Apr 7, 2022 •

edited

Loading

ryanlovett commented Apr 8, 2022

balajialg commented Apr 8, 2022 •

edited

Loading

yuvipanda commented Apr 8, 2022

yuvipanda commented Apr 8, 2022

balajialg commented Apr 8, 2022 •

edited

Loading

yuvipanda commented Apr 8, 2022

yuvipanda commented Apr 8, 2022

balajialg commented Apr 8, 2022 •

edited

Loading

Feature audit based on logs history #3267

Feature audit based on logs history #3267

Comments

balajialg commented Feb 15, 2022 • edited Loading

Summary

Feature List

User Stories

Tasks

ryanlovett commented Feb 17, 2022

balajialg commented Feb 18, 2022 • edited Loading

balajialg commented Mar 4, 2022 • edited Loading

balajialg commented Mar 7, 2022 • edited Loading

balajialg commented Mar 12, 2022 • edited Loading

balajialg commented Apr 7, 2022 • edited Loading

ryanlovett commented Apr 8, 2022

balajialg commented Apr 8, 2022 • edited Loading

yuvipanda commented Apr 8, 2022

yuvipanda commented Apr 8, 2022

balajialg commented Apr 8, 2022 • edited Loading

yuvipanda commented Apr 8, 2022

yuvipanda commented Apr 8, 2022

balajialg commented Apr 8, 2022 • edited Loading

balajialg commented Feb 15, 2022 •

edited

Loading

balajialg commented Feb 18, 2022 •

edited

Loading

balajialg commented Mar 4, 2022 •

edited

Loading

balajialg commented Mar 7, 2022 •

edited

Loading

balajialg commented Mar 12, 2022 •

edited

Loading

balajialg commented Apr 7, 2022 •

edited

Loading

balajialg commented Apr 8, 2022 •

edited

Loading

balajialg commented Apr 8, 2022 •

edited

Loading

balajialg commented Apr 8, 2022 •

edited

Loading