feat(webscraper): add implementation from masa-oracle#12
Conversation
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #12 +/- ##
==========================================
+ Coverage 56.20% 57.35% +1.14%
==========================================
Files 9 6 -3
Lines 290 340 +50
==========================================
+ Hits 163 195 +32
- Misses 107 124 +17
- Partials 20 21 +1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
mcamou
left a comment
There was a problem hiding this comment.
Approving this considering that the "WIP" status mentioned will be addressed in a separate PR. Otherwise ignore the approval :)
| Error string `json:"error"` | ||
| Data interface{} `json:"data"` | ||
| Error string `json:"error"` | ||
| Data []byte `json:"data"` |
There was a problem hiding this comment.
Why this change? We're unmarshalling twice (once to a []byte, and from there to the actual data)
There was a problem hiding this comment.
the rationale behind this change it's that when using interface{} it forces the caller to guess the underlying type: while this is not an urgent issue it made look the code quite ugly to see, and eventually added more complexity. I found out the switch to increase readibility and avoided an extra step.
when using interfaces the result was a marshal of an interface, that had two consequences:
- The receiver would have to unmarshal it to known kind, so who populate the result could inadvertitely nest interfaces and make it un-easy to unmarshal
- If the result let's say was a string (like most of the cases) the API would always return something like
"data"including the", becauseinterface{}was marshaled to a string type
| type Section struct { | ||
| Title string `json:"title"` // Title is the heading text of the section. | ||
| Paragraphs []string `json:"paragraphs"` // Paragraphs contains all the text content of the section. | ||
| Images []string `json:"images"` // Images storing base64 - maybe!!? |
There was a problem hiding this comment.
We will probably need more than just the base64 contents, at the very least a MIME type
There was a problem hiding this comment.
oh yes, I think there is much room for improvement here. At this stage I'm not touching any of the scraper code. I've actually just copy-pasted this part from the existing implementation: https://github.com/masa-finance/masa-oracle/blob/main/pkg/scrapers/web/web.go
| // } | ||
| // logrus.WithField("result", string(res)).Info("Scraping completed") | ||
| // }() | ||
| func scrapeWeb(uri []string, depth int) ([]byte, error) { |
There was a problem hiding this comment.
Very small nit: shouldn't depth be a uint?
There was a problem hiding this comment.
yep makes totally sense 👍 just note I've copied the existing code as-is and wasn't my intention to change it at all
Signed-off-by: mudler <mudler@localai.io>
feat(webscraper): add implementation from masa-oracle
This PR adds the webscraper code from masa-oracle, adapted to run as a separate job.
WIP - want to refactor out the golang client around a job result so it's easier to consume as API, basically addressing the comment in the code.