Skip to content

Latest commit

 

History

History
150 lines (100 loc) · 5.7 KB

README.md

File metadata and controls

150 lines (100 loc) · 5.7 KB

Instagram Scraper

A simple, serverless Instagram scraper that pulls Instagram posts—complete with images as base64 encoded strings—without the need for an API key. Designed with 100% edge deployment in mind, it runs efficiently on platforms like Vercel, Supabase, Deno Deploy, and more, making it easy to integrate your Instagram content into your broader content strategy.

Features

  • Repurpose Instagram Content: Automatically fetch Instagram posts to use in your blogs, websites, etc.
  • No API Key Needed: Skip the hassle of authentication—just scrape public data directly.
  • Cross-Origin Workaround: Convert Instagram thumbnails to base64, allowing you to use images across domains.
  • Incremental Loading: Keep your content fresh by pulling in new posts daily without overwhelming Instagram's servers.
  • Serverless and Cost-Effective: Deploy on platforms like Vercel, Supabase, Deno deploy to run this scraper as a microservice with minimal overhead.

Technology

  • Edge Deployment: Written with 100% edge deployment in mind, using V8 isolates to run efficiently at the edge.
  • Simple Microservice: Built as a microservice following the Single Responsibility Principle (SRP), ensuring focused and efficient functionality.
  • Compliance First: Designed to be used responsibly without violating Instagram’s terms and conditions.

Getting Started

Prerequisites

  • Node.js (version 20.x and above)
  • A platform for deployment (e.g., Vercel)

Installation

  1. Clone the Repository:

    git clone https://github.com/senthilsweb/instagram-scraper.git
    cd instagram-scraper
  2. Install Dependencies:

    npm install
  3. Set Up Environment Variables:

This project does not require specific environment variables, so no additional setup is necessary.

  1. Run Locally:

    To run the scraper locally, use:

    npm run dev

    This will start the server on http://localhost:3000.

API Endpoint

POST /api/instagram/scrape

Retrieves Instagram posts for a given profile and size limit.

  • URL: /api/instagram/scrape
  • Method: POST
  • Content-Type: application/json

Request Payload

{
  "profile_id": "your-instagram-profile-id",
  "first": 10
}
  • profile_id: The Instagram profile ID you want to scrape.
  • first: (Optional) Number of posts to fetch. Defaults to 10 if not provided.

Response

Successful Response

If the request is successful, the API returns a JSON object containing the fetched Instagram posts.

{
  "first": 10,
  "total": 100,
  "result": [
    {
      "id": 0,
      "text": "Your Instagram post caption",
      "thumbnail_src": "https://instagram.com/your-thumbnail-url.jpg",
      "display_url": "https://instagram.com/your-image-url.jpg",
      "shortcode": "shortcode123",
      "base64": "...",
      "created_at": "01-Jan-2024"
    }
    // More posts...
  ]
}

Validation Error Response

If the profile_id is missing or invalid, the API returns a validation error response with a 400 status code.

{
  "error": "Profile ID is required and cannot be empty.",
  "message": "Please provide a valid Instagram profile ID."
}

Deployment

Deploy the scraper on platforms like Vercel for serverless and cost-effective operation:

Deploy the scraper on platforms like Vercel for serverless and cost-effective operation:

Deploy on Vercel:

  1. Connect Your GitHub Repository to Vercel:

    • Follow the steps in Vercel’s dashboard to connect your GitHub repository.
  2. Set Environment Variables:

    • Although this project doesn’t require specific environment variables, ensure any relevant settings for your deployment environment are configured in Vercel.
  3. Deploy Your Project:

    • Vercel automatically deploys your project with one click or via continuous integration with GitHub.

Warning: Vercel Function Time Limitation

Vercel imposes a time limitation on serverless functions, typically up to 10 seconds for free-tier users. If your request to scrape Instagram takes longer than this, the function may time out.

Workarounds for Vercel Timeout:

  • Optimize Requests: Reduce the number of posts (first) requested to ensure the function completes within the time limit.
  • Use Incremental Loading: Fetch posts incrementally (e.g., daily) to avoid large, time-consuming requests.
  • Alternative Deployment: Consider deploying the scraper on platforms with higher timeout limits (e.g., AWS Lambda, Google Cloud Functions) if you anticipate needing to scrape large amounts of data.

Compliance and Usage

  • Respect Instagram’s Terms of Use: Use this tool responsibly to avoid account restrictions or legal issues.
  • Avoid Excessive Requests: Making too many requests in a short period may result in your IP being blocked by Instagram.
  • Obtain Necessary Permissions: Ensure you have the right to use and repurpose the Instagram content you scrape.

Disclaimer: Please use this tool responsibly and ensure compliance with Instagram's Terms of Use. Excessive or inappropriate use may lead to account restrictions or legal consequences. Additionally, it is strongly recommended not to run this code from public serverless platforms like Vercel for production, as it may lead to temporary or permanent blocks by Instagram. Instead, consider running it on-premises using Docker or a similar setup to better control the environment and reduce the risk of IP blocks.