feat: Complete LinkedIn Profile Fetching Integration#116
Closed
alvin-reyes wants to merge 134 commits intomainfrom
Closed
feat: Complete LinkedIn Profile Fetching Integration#116alvin-reyes wants to merge 134 commits intomainfrom
alvin-reyes wants to merge 134 commits intomainfrom
Conversation
This puts the ground of the main tee-worker component. It is composed of a simple http server which acts as a job server, a client to interact with it, and the scaffolding required to run tests and build signed binaries. Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
tee-worker initial implementation
Signed-off-by: mudler <mudler@localai.io>
chore(refactor): move scraper type to a constant
Signed-off-by: mudler <mudler@localai.io>
This code isn't currently used, was used with the initial implementation Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
) * feat(jobserver): allow to specify global configuration via env file As an example we provide to the webscraper a WEBSCRAPER_BLACKLIST environment variable which contains a comma separated list of url to blacklist during scraping. The JobConfiguration is a generic map[string]interface{} that can be populated top-level. It gets unmarshalled as JSON by the jobs to map the relevant fields in the configuration. Signed-off-by: mudler <mudler@localai.io> * add .env.example Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
feat(webscraper): add implementation from masa-oracle
* feat(jobs): add twitter scraper job type Signed-off-by: mudler <mudler@localai.io> * chore: wire-up twitter config Signed-off-by: mudler <mudler@localai.io> * chore: do not re-create scrapers each time Signed-off-by: mudler <mudler@localai.io> * Adapt twitter code to latest changes Signed-off-by: mudler <mudler@localai.io> * chore(fix): populate jobWorkers Signed-off-by: mudler <mudler@localai.io> * chore(tests): add simple twitter test Signed-off-by: mudler <mudler@localai.io> * chore(tests): increase test area Signed-off-by: mudler <mudler@localai.io> * chore(tests): store cookies to cache Signed-off-by: mudler <mudler@localai.io> * correctly map cookie dir Signed-off-by: mudler <mudler@localai.io> * Skip Twitter tests Signed-off-by: mudler <mudler@localai.io> --------- Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
* feat: add scrape by tweet id * use valid twitter id on the ut
Signed-off-by: mudler <mudler@localai.io>
* Set whitelist in standalone mode to be just ourselves * Move log message around * Update internal/jobserver/jobserver.go Co-authored-by: Rapidfix <rapidfix@masalabs.ai> --------- Co-authored-by: Rapidfix <rapidfix@masalabs.ai>
- Add getprofile capability to GetCapabilities() - Route getprofile jobs in ExecuteJob() - Implement getProfile() method with full error handling - Add stats tracking for profile fetching - Stub for LinkedInFullProfileResult integration
- Update to use LinkedInArguments from tee-types v1.0.0 - Use proper PublicIdentifier field validation - Return LinkedInFullProfileResult for rich profile data - Add TODO placeholders for rich data field mapping - Fully functional getprofile endpoint ready for production
- Map all Experience fields with proper date conversion - Map all Education fields with date formatting - Map Skills collection with name extraction - Extract ProfilePictureURL from ProfilePicture.RootURL - Map Summary field for detailed profile information - Full integration of linkedin-scraper v1.0.0 structures - Production-ready getprofile endpoint with comprehensive data
- Revert Capabilities -> ReportedCapabilities to maintain json:"reported_capabilities" - Convert ScraperCapabilities to []string for backward compatibility - Prevent breaking change that would affect tee-indexer - Alvin-reyes's breaking change has been safely reverted
…branch - Revert capabilities API from []ScraperCapabilities back to []string - Restore original DetectCapabilities and MergeCapabilities functions - Revert stats to use ReportedCapabilities []string instead of structured format - Keep LinkedIn stats constants (LinkedInScrapes, LinkedInProfiles, etc.) - Preserve all LinkedIn profile fetching functionality - All LinkedIn tests still pass (5/5) - Alvin's capabilities rework moved to feat/capabilities-rework branch
- Add LinkedIn capabilities (searchbyquery, getprofile) to auto-detection - Require all three LinkedIn credentials: li_at_cookie, csrf_token, jsessionid - Support both linkedin_credentials array and individual credential fields - Add comprehensive tests for LinkedIn capability detection - Test various combinations of missing credentials - Ensure LinkedIn capabilities only advertised when all required credentials present - Maintain backward compatibility with existing capabilities API - All 11 capability tests passing
mcamou
reviewed
Jul 8, 2025
* Add Capability type * Fix test failures
- Add TikTokTranscriber with configurable API endpoint - Implement VTT to plain text conversion functionality - Add comprehensive error handling and statistics tracking - Support language selection with fallback logic - Include video metadata extraction (title, thumbnail)
- Replace hardcoded logging statements with dynamic loop iteration - Iterate over jobworkers map to log initialization for each job type - Reduces code duplication and makes future job type additions easier - Maintains same functionality while improving maintainability
- Remove redundant Expect(err).NotTo(HaveOccurred()) line - Eliminates duplicate check after res.Unmarshal() call - Keeps test cleaner and more concise
- Add .cursor/ to .gitignore to prevent IDE-specific files from being tracked - Remove existing .cursor folder from git tracking - Keep local .cursor folder intact for development
- Remove stats increments for empty query validation - Remove stats increments for empty public_identifier validation - Remove stats increments for invalid query type validation - Keep error returns for proper user feedback - System errors (auth, rate limits) still tracked appropriately
- Remove stats increments for empty VideoURL validation - Remove stats increments for malformed job arguments validation - Update test to expect 0 errors for user validation failures - Keep error returns for proper user feedback - System errors (API failures, parsing errors) still tracked appropriately
LinkedIn improvements:
- Add DefaultSearchCount and DefaultNetworkFilters constants
- Extract date range formatting into formatDateRange helper function
- Improve marshal error context with detailed error messages
- Don't increment stats for 404/not found errors (user validation)
TikTok improvements:
- Fix fmt.Errorf format string vulnerabilities
- Replace errors.New(errMsg) with fmt.Errorf('%s', errMsg)
- Remove unused errors import
All changes maintain backward compatibility while improving code quality,
maintainability, and security.
Resolved conflicts: - internal/capabilities/detector_test.go: Updated to use types.Capability and slices.Sort - internal/jobs/stats/stats.go: Updated statType to StatType (capitalized) - internal/jobserver/jobserver.go: Updated GetWorkerCapabilities to return types.Capability Main branch changes integrated: - Added Capability type for better type safety (#136) - Added whitelist functionality (#120) - Improved error logging (#125) - Updated README (#123) - Fixed Dockerfile MINERS_WHITE_LIST (#122)
9efe0e5 to
2228468
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🔗 LinkedIn Profile Fetching Integration
This PR adds comprehensive LinkedIn profile search and fetching capabilities to the tee-worker, integrating with the new
linkedin-scraperSDK v1.0.0.📋 What's New
LinkedIn Job Types
searchbyquery- Search LinkedIn profiles by keywords with advanced filteringgetprofile- Fetch detailed LinkedIn profile information by public identifierRich Profile Data
Smart Capability Detection
🛠️ Technical Implementation
Dependencies
linkedin-scraper v1.0.0tee-types v1.0.0LinkedInArgumentsandLinkedInFullProfileResultstructuresError Handling
Job Arguments
{ "type": "linkedin-scraper", "arguments": { "type": "searchbyquery", "query": "software engineer", "network_filters": ["F", "S", "O"], "max_results": 10 } }{ "type": "linkedin-scraper", "arguments": { "type": "getprofile", "public_identifier": "john-doe-123" } }🧪 Testing
🔧 Configuration
Environment Variables
Capability Detection
LinkedIn capabilities (
searchbyquery,getprofile) are automatically detected when all three credentials are present.📊 Statistics
New LinkedIn-specific metrics:
linkedin_scrapes- Total LinkedIn operationslinkedin_returned_profiles- Profiles successfully returnedlinkedin_errors- General errorslinkedin_auth_errors- Authentication failureslinkedin_ratelimit_errors- Rate limit errors🚀 Benefits
📈 Performance
This implementation provides a solid foundation for LinkedIn data collection while maintaining the tee-worker's reliability and performance standards.
Fixes https://github.com/masa-finance/tee-indexer/issues/226