-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce search across all of HexDocs #1811
Comments
Btw, I have a dump of the database already, in case someone wants to use it for a proof of concept. Just ping me elsewhere and I will send a link. We should also skip any |
@josevalim 🙋♂️ I'd like to compare the dump with the data I've scraped. Also, would it be possible to get access to fastly logs to calculate co-occurrence metrics for packages? I think it can add one more useful dimension to order the search results by. |
Getting access to logs is probably difficult but the Hex team may accept a PR that adds this computation. I cannot answer for them though, so you will have to ask. :) |
You could look at the dependency graph and weigh by downloads and get a crude measurement of it.
The code actually only grabs new packages and re-indexes them since Hex can sort by updated_at. So you could run that daily and it would take seconds. One of the reason's its so slow is that the the json containing the indexable items
@ruslandoga nice idea with the sqlite C function I did the lazy way with SQL and its not too slow https://github.com/jeregrine/hex-search/blob/main/lib/hex_docs_search/hex.ex#L50 |
Can you please elaborate? I'm not seeing how it would be able to estimate the co-occurrence metric. |
@jeregrine oh, so you skip downloading the whole docs tar? |
Didn't even know it was downloadable. :-) But yea I don't do that it might faster at a cost of more disk/memory usage. ¯_(ツ)_/¯
Actually more I think about it, nvm. Its messy. |
In the current design, would this require packages to update their |
The new search functionality (assets/js) would only be present in the new |
👋 hey everyone, just checking in. Is this in progress? If so, any way I can assist? If not, I may be able to help get it off the ground :) |
There is a delay because we are also investigating if it makes sense to add embeddings to the docs, so we can also use it to provide context for LLMs (such as OpenAI). I will try to post more information soon. :) |
Sounds good! Thanks for your hard work. Not trying to hurry. I'm happy to wait, just want to assist if possible/warranted. |
That's really good to know. I will reach out once we have an action plan, unless you are also happy to get involved in the "figure it out" process and write some JS too? :) |
Yeah, I'd be very happy to be involved in any way. Cross package search is a major win for the Ash ecosystem, and is absolutely worth me spending my time on. |
I see 4 search planes:
Please empower the user |
I am WIP-ing 'pinned repos' in ex_doc. It's the most versatil. it's just the json version of this file: both search_data.js and search_data.json will include the package info like this: That would allow the UI to ingest the search_data.json files of the pinned repos and display the info like this And we need to change the UI a bit, but that Idea was already sketched up in this post. Pinned repos can be stored in
It's not a big change to ex_doc. And ofc we need to keep caching and versioning at it is now in search_results_72517.js |
We explored this but sometimes those files can be really large and building a index of all of them in realtime would become very expensive. Often the resulting index was so large that it would blow up local storage, which would cause us to index them every time, making it worse. |
@josevalim, I am not sure that you read my comment here: #1811 (comment) here it is again:
I am addressing here the solution 4-pinned repos. in the local storage we just store the list like this:
it's the user who decide wich repos he want to 'Pin' Ash search index is 104KB, it's cached in browser cache So for ash framework users it will be a few bytes in the local storage. please correct me if I am missing something. as I am WIP-ing this. |
Here is the architecture and the UI I propose for search 1- repo - current repo ====>ex_doc feature. Offline and online search So ex_doc search for 1 3 and 4 We have to have one UI.
I am.just WIP-ing 3 and 4 So with 1 3 and 4 I can do some ash and phoenix coding on the plane @zachdaniel ;) |
A complication to discuss later: You can pin online depos and/or local depos (if they are in your HD). Like mix.deps can have local and remote packages. Sounds complex but can be simple |
Right. But you can think a new user would also want to pin Elixir itself and we know for a fact Elixir was too big to cache (so we added compression). Ecto and Phoenix are also on the larger side too. So I wonder if those three would not be enough to below up session storage space? |
Local cache is 10MB. Elixir search index is 2MB. On the plane it's not a pb, we loading from disc. Online we might have a cache miss, it's life :) Then the browser hits the CDN. If you want to cap everything to 10MB you can and make it like an amazon kindle and tell the user you don't have more storage with an UI like this:
I am saying let's empower the user. The persona is a Dev. So it's ok if the UX is technical a bit. ---This is tangent and maybe crazy ---- This is tangent but we could also Ideate a chrome extension UX. Don't we need one for phoenix ? A level of gamification is to track the most pinned repos. Like github (forks/stars). It can create another |
Interesting... However, I think we have to be a bit less optimistic. We still need session storage for other indexes. For example, imagine you fill in your index with 9MB without Elixir. Now, without additional space, if you try to search Elixir docs, it will go through the slow path and rebuild the index every time. So maybe 7MB of custom search max. And you are right about empowering the user... but should we realistically expect users to craft their own search engines? Projects like Elixir, Phoenix, and Ash need the search to work across several repos out of the box and I would focus on that instead. The good news is that I am quite sure your ideas could be fully explored as a separate project! PS: in any case, I don't think this solves the airplane case either. You have the search contents but not the rendered pages themselves. You could try to rebuild them from the index but not all information is available. |
For the plane use case, when I am working within a project all I need is
within my_ash_project/deps.
Ex_doc should reach there.
Think of my_ash_project/deps as a cache for hexdocs.
I understand you want to do 2. But till then.
ex_doc or a fork of it can do 1 3 and 4.
I would use it locally. My understanding is that you allow different documentation tools.
So for me it's not either ex_doc or hexdocs search. It's both of them.
If you decide to enforce a certain builder on hexdocs I ll respect that.
And I can use ex_doc_mutirepo_search as a local book on my computer. I love to have physical copy on my disc. Ex_doc is great and we can make it better.
Thanks
…On Sat, Jan 13, 2024, 3:36 p.m. José Valim ***@***.***> wrote:
Interesting...
However, I think we have to be a bit less optimistic. We still need
session storage for *other* indexes. For example, imagine you fill in
your index with 9MB without Elixir. Now, without additional space, if you
try to search Elixir docs, it will go through the slow path and rebuild the
index every time.
And you are right about empowering the user... but should we realistically
expect users to craft their own search engines? Projects like Elixir,
Phoenix, and Ash need the search to work across several repos *out of the
box* and I would focus on there instead. The good news is that I am quite
sure your ideas could be fully explored as a separate project!
PS: in any case, I don't think this solves the airplane case either. You
have the search contents but not the rendered pages themselves. You could
try to rebuild them from the index but not all information is available.
—
Reply to this email directly, view it on GitHub
<#1811 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAT6TYEXMQUENULQO2NFFDYOLV4NAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQG43DENJXGU>
.
You are receiving this because you commented.
Message ID: ***@***.***>
|
I see, that definitely feels out of scope for ExDoc then. :) I recommend exploring this on your own, something that builds the docs in the deps folder and creates a unified search interface. Bonus points if it works both online and offline. Meanwhile, let's please refocus this issue on its original description. Thank you! |
@josevalim, In that case I would suggest to move hexdocs search feature as you envisioned it to the hexdocs repo. Here are my arguments:
I suggest we figure out the TechnicalDesign/Architecture of the search functionality. we have 2 product/services (ex_doc, hexdocs). For UX I would suggest the apple approach, one UX across physically separates complementary devices. One search experience through ex_doc and hexdocs, the user will not notice the discontinuity. |
That's historically how we have implemented features in Hexdocs that are used by ExDoc and that's most likely how we plan to implement this one too: Hexdocs provide a generic interface for others to hook into and ExDoc simply acts as one of the clients. |
It all depends if the Hexdocs team wants to maintain a search service. If not, then a third service will consume Hexdocs packages (Hexdocs then works as "storage") and ExDoc then talks to said service. The feature is listed here because most of the work will be done by the ExDoc team anyway. |
Interesting
An ex_doc with a plugin architecture would be cool (embedding search form
and search results)
So that ex_doc wouldnt have code dependency to hexdocs
AND Integrating different search engines. (Including Google) would be super
easy and free
And one day AI search within ex_doc UX
You don't have to know ex_doc code base to implement a search plugin
…On Mon, Jan 15, 2024, 3:15 p.m. José Valim ***@***.***> wrote:
It all depends if the Hexdocs team wants to maintain a search service. If
not, then a third service will consume Hexdocs packages (Hexdocs then work
as "storage") and ExDoc then talks to said service. The feature is listed
here because most of the work will be done by the ExDoc team anyway.
—
Reply to this email directly, view it on GitHub
<#1811 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABAT6T6KLC4J3DUUBWEWCSLYOWE5TAVCNFSM6AAAAAA7GFB46KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJSG4ZDCNBYGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
👋 I'm interested in working on this and would love to collaborate with anyone else currently involved! I'll start by revisiting the SQLite approaches and checking if there are better options available now (typesense, meilisearch, etc.). |
Hi @ruslandoga! At the moment, we are thinking about going with Postgres. We will compute our own text embeddings using machine learning models and store them with pgvector. What are your thoughts? |
👋 @josevalim oh right, I forgot about your comment above on wanting to add semantic search... Sorry! I should probably reread this thread. With SQLite I kept the embeddings in a BLOB and loaded them all in memory on startup and used https://github.com/elixir-nx/hnswlib as index. That was too complicated and a bit resource-intensive, pgvector would likely make it much simpler and more efficient :) But I was rather wondering about the basic global search, like a global autocomplete, is that still planned? Would Postgres be used for that as well? |
Yes, the goal would be to use PG for that as well. |
The goal of this feature is to provide search and autocompletion across packages. We will add a new configuration, called
related_deps
, which is a list of package names we find related. We will improve both autocomplete and search to use this, such that:related_deps
related_deps
related_deps
related_deps
To power this feature, we will build a new service that does both autocompletion and search based on SQLite3 database. We have proof of concepts from:
The SQLite3 database can be built weekly and it currently takes about an 1 hour. It should also include the entries for both Elixir (and Erlang once they migrate to ExDoc). We can make it a live service later too (by keeping our own database, perhaps PG, and them dumping it daily). There is an open question if we want to host the SQLite3 builder on Hex.pm.
The text was updated successfully, but these errors were encountered: