Skip to content

feat: Complete LinkedIn Profile Fetching Integration#116

Closed
alvin-reyes wants to merge 134 commits intomainfrom
feat/linkedin-scraper
Closed

feat: Complete LinkedIn Profile Fetching Integration#116
alvin-reyes wants to merge 134 commits intomainfrom
feat/linkedin-scraper

Conversation

@alvin-reyes
Copy link
Contributor

@alvin-reyes alvin-reyes commented Jun 22, 2025

🔗 LinkedIn Profile Fetching Integration

This PR adds comprehensive LinkedIn profile search and fetching capabilities to the tee-worker, integrating with the new linkedin-scraper SDK v1.0.0.

📋 What's New

LinkedIn Job Types

  • searchbyquery - Search LinkedIn profiles by keywords with advanced filtering
  • getprofile - Fetch detailed LinkedIn profile information by public identifier

Rich Profile Data

  • Complete profile information - Name, headline, location, summary
  • Work experience - Full employment history with dates and descriptions
  • Education history - Schools, degrees, and academic background
  • Skills - Professional skills and competencies
  • Profile pictures - High-quality profile image URLs

Smart Capability Detection

  • Auto-detection - LinkedIn capabilities automatically detected when credentials are present
  • Credential validation - Requires all three LinkedIn credentials (li_at_cookie, csrf_token, jsessionid)
  • Graceful fallback - Workers operate normally without LinkedIn credentials

🛠️ Technical Implementation

Dependencies

  • Updated to linkedin-scraper v1.0.0
  • Updated to tee-types v1.0.0
  • Uses new LinkedInArguments and LinkedInFullProfileResult structures

Error Handling

  • Authentication errors - Proper handling of expired/invalid credentials
  • Rate limiting - Graceful handling of LinkedIn API limits
  • Not found errors - Clean error messages for invalid profiles
  • Stats tracking - Comprehensive metrics for all operation types

Job Arguments

{
  "type": "linkedin-scraper",
  "arguments": {
    "type": "searchbyquery",
    "query": "software engineer",
    "network_filters": ["F", "S", "O"],
    "max_results": 10
  }
}
{
  "type": "linkedin-scraper", 
  "arguments": {
    "type": "getprofile",
    "public_identifier": "john-doe-123"
  }
}

🧪 Testing

  • 5/5 LinkedIn tests passing with real API integration
  • Comprehensive test coverage for both search and profile fetching
  • Error scenario testing - Invalid credentials, timeouts, not found cases
  • Integration tests - End-to-end workflow validation

🔧 Configuration

Environment Variables

LINKEDIN_LI_AT_COOKIE=your_li_at_cookie
LINKEDIN_CSRF_TOKEN=your_csrf_token  
LINKEDIN_JSESSIONID=your_jsessionid

Capability Detection

LinkedIn capabilities (searchbyquery, getprofile) are automatically detected when all three credentials are present.

📊 Statistics

New LinkedIn-specific metrics:

  • linkedin_scrapes - Total LinkedIn operations
  • linkedin_returned_profiles - Profiles successfully returned
  • linkedin_errors - General errors
  • linkedin_auth_errors - Authentication failures
  • linkedin_ratelimit_errors - Rate limit errors

🚀 Benefits

  • Expands data collection - Access to professional LinkedIn profiles
  • High-quality data - Comprehensive profile information
  • Reliable operation - Robust error handling and credential validation
  • Scalable architecture - Follows existing job patterns and conventions
  • No breaking changes - Fully backward compatible

📈 Performance

  • Efficient profile fetching - Optimized API calls
  • Timeout handling - Configurable timeouts for all operations
  • Memory efficient - Streams large profile datasets
  • Stats tracking - Real-time monitoring of operation success rates

This implementation provides a solid foundation for LinkedIn data collection while maintaining the tee-worker's reliability and performance standards.

Fixes https://github.com/masa-finance/tee-indexer/issues/226

mudler and others added 30 commits October 23, 2024 19:51
This puts the ground of the main tee-worker component.

It is composed of a simple http server which acts as a job server, a
client to interact with it, and the scaffolding required to run tests
and build signed binaries.

Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
tee-worker initial implementation
Signed-off-by: mudler <mudler@localai.io>
chore(refactor): move scraper type to a constant
Signed-off-by: mudler <mudler@localai.io>
This code isn't currently used, was used with the initial
implementation

Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
)

* feat(jobserver): allow to specify global configuration via env file

As an example we provide to the webscraper a WEBSCRAPER_BLACKLIST
environment variable which contains a comma separated list of url to
blacklist during scraping.

The JobConfiguration is a generic map[string]interface{} that can be
populated top-level. It gets unmarshalled as JSON by the jobs to map the
relevant fields in the configuration.

Signed-off-by: mudler <mudler@localai.io>

* add .env.example

Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
feat(webscraper): add implementation from masa-oracle
* feat(jobs): add twitter scraper job type

Signed-off-by: mudler <mudler@localai.io>

* chore: wire-up twitter config

Signed-off-by: mudler <mudler@localai.io>

* chore: do not re-create scrapers each time

Signed-off-by: mudler <mudler@localai.io>

* Adapt twitter code to latest changes

Signed-off-by: mudler <mudler@localai.io>

* chore(fix): populate jobWorkers

Signed-off-by: mudler <mudler@localai.io>

* chore(tests): add simple twitter test

Signed-off-by: mudler <mudler@localai.io>

* chore(tests): increase test area

Signed-off-by: mudler <mudler@localai.io>

* chore(tests): store cookies to cache

Signed-off-by: mudler <mudler@localai.io>

* correctly map cookie dir

Signed-off-by: mudler <mudler@localai.io>

* Skip Twitter tests

Signed-off-by: mudler <mudler@localai.io>

---------

Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
Signed-off-by: mudler <mudler@localai.io>
* feat: add scrape by tweet id

* use valid twitter id on the ut
Signed-off-by: mudler <mudler@localai.io>
rapidfix and others added 8 commits July 1, 2025 12:43
* Set whitelist in standalone mode to be just ourselves

* Move log message around

* Update internal/jobserver/jobserver.go

Co-authored-by: Rapidfix <rapidfix@masalabs.ai>

---------

Co-authored-by: Rapidfix <rapidfix@masalabs.ai>
- Add getprofile capability to GetCapabilities()
- Route getprofile jobs in ExecuteJob()
- Implement getProfile() method with full error handling
- Add stats tracking for profile fetching
- Stub for LinkedInFullProfileResult integration
- Update to use LinkedInArguments from tee-types v1.0.0
- Use proper PublicIdentifier field validation
- Return LinkedInFullProfileResult for rich profile data
- Add TODO placeholders for rich data field mapping
- Fully functional getprofile endpoint ready for production
- Map all Experience fields with proper date conversion
- Map all Education fields with date formatting
- Map Skills collection with name extraction
- Extract ProfilePictureURL from ProfilePicture.RootURL
- Map Summary field for detailed profile information
- Full integration of linkedin-scraper v1.0.0 structures
- Production-ready getprofile endpoint with comprehensive data
@teslashibe teslashibe changed the title feat: linkedin scraper implementation - tee-worker feat: Complete LinkedIn Profile Fetching Integration Jul 2, 2025
@teslashibe teslashibe self-assigned this Jul 2, 2025
@teslashibe teslashibe requested a review from rapidfix July 2, 2025 23:46
- Revert Capabilities -> ReportedCapabilities to maintain json:"reported_capabilities"
- Convert ScraperCapabilities to []string for backward compatibility
- Prevent breaking change that would affect tee-indexer
- Alvin-reyes's breaking change has been safely reverted
…branch

- Revert capabilities API from []ScraperCapabilities back to []string
- Restore original DetectCapabilities and MergeCapabilities functions
- Revert stats to use ReportedCapabilities []string instead of structured format
- Keep LinkedIn stats constants (LinkedInScrapes, LinkedInProfiles, etc.)
- Preserve all LinkedIn profile fetching functionality
- All LinkedIn tests still pass (5/5)
- Alvin's capabilities rework moved to feat/capabilities-rework branch
- Add LinkedIn capabilities (searchbyquery, getprofile) to auto-detection
- Require all three LinkedIn credentials: li_at_cookie, csrf_token, jsessionid
- Support both linkedin_credentials array and individual credential fields
- Add comprehensive tests for LinkedIn capability detection
- Test various combinations of missing credentials
- Ensure LinkedIn capabilities only advertised when all required credentials present
- Maintain backward compatibility with existing capabilities API
- All 11 capability tests passing
@teslashibe teslashibe requested review from mcamou and mudler July 7, 2025 20:24
mcamou and others added 9 commits July 9, 2025 19:19
* Add Capability type

* Fix test failures
- Add TikTokTranscriber with configurable API endpoint
- Implement VTT to plain text conversion functionality
- Add comprehensive error handling and statistics tracking
- Support language selection with fallback logic
- Include video metadata extraction (title, thumbnail)
- Replace hardcoded logging statements with dynamic loop iteration
- Iterate over jobworkers map to log initialization for each job type
- Reduces code duplication and makes future job type additions easier
- Maintains same functionality while improving maintainability
- Remove redundant Expect(err).NotTo(HaveOccurred()) line
- Eliminates duplicate check after res.Unmarshal() call
- Keeps test cleaner and more concise
- Add .cursor/ to .gitignore to prevent IDE-specific files from being tracked
- Remove existing .cursor folder from git tracking
- Keep local .cursor folder intact for development
- Remove stats increments for empty query validation
- Remove stats increments for empty public_identifier validation
- Remove stats increments for invalid query type validation
- Keep error returns for proper user feedback
- System errors (auth, rate limits) still tracked appropriately
- Remove stats increments for empty VideoURL validation
- Remove stats increments for malformed job arguments validation
- Update test to expect 0 errors for user validation failures
- Keep error returns for proper user feedback
- System errors (API failures, parsing errors) still tracked appropriately
LinkedIn improvements:
- Add DefaultSearchCount and DefaultNetworkFilters constants
- Extract date range formatting into formatDateRange helper function
- Improve marshal error context with detailed error messages
- Don't increment stats for 404/not found errors (user validation)

TikTok improvements:
- Fix fmt.Errorf format string vulnerabilities
- Replace errors.New(errMsg) with fmt.Errorf('%s', errMsg)
- Remove unused errors import

All changes maintain backward compatibility while improving code quality,
maintainability, and security.
Resolved conflicts:
- internal/capabilities/detector_test.go: Updated to use types.Capability and slices.Sort
- internal/jobs/stats/stats.go: Updated statType to StatType (capitalized)
- internal/jobserver/jobserver.go: Updated GetWorkerCapabilities to return types.Capability

Main branch changes integrated:
- Added Capability type for better type safety (#136)
- Added whitelist functionality (#120)
- Improved error logging (#125)
- Updated README (#123)
- Fixed Dockerfile MINERS_WHITE_LIST (#122)
@teslashibe teslashibe closed this Jul 14, 2025
@teslashibe teslashibe force-pushed the feat/linkedin-scraper branch from 9efe0e5 to 2228468 Compare July 14, 2025 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants