Skip to content

AKrekhovetskyi/tech-trend-stat

Repository files navigation

TechTrendStat

Technology Trends Statistician is your go-to tool for real-time insights into the ever-changing technology landscape, combining web scraping and data analysis to track and analyze the latest trends in development job descriptions.

The visualized data collected by this project can be viewed at https://akrekhovetskyi.github.io/. The source code of the website can be found at https://github.com/AKrekhovetskyi/AKrekhovetskyi.github.io.

Features

  • Scraping jobs from Djinni by several specialization categories (e.g. Python, Java, DevOps, etc.).
  • Proxy and user agent rotation.
  • Ability to work with local and cloud MongoDB, as well as with regular CSV files.
  • Using Pydantic models instead of standard items for better data validation.
  • Database collections to simplify connection to MongoDB.
  • Two pipelines (Mongo and CSV).
  • Data Wrangling. Clean up text and extract technology statistics with CLI.

Linux Installation

NOTE: Python version >3.9 is required.

Clone the repository:

git clone --recurse-submodules https://github.com/AndriyKy/tech-trend-stat.git
cd tech-trend-stat

Install uv manager and set the PYTHONPATH environment variable:

# Install uv with our official standalone installer.
curl -LsSf https://astral.sh/uv/install.sh | sh
export UV_ENV_FILE=.env
export PYTHONPATH="$(pwd):$(pwd)/techtrendanalysis:$(pwd)/techtrendanalysis"

Create a copy of the file .env.copy and set the required variables:

cp .env.sample .env

Getting Started

If you decide to work with MongoDB, here is a tutorial on how to install it locally in a Docker container. Here is also the instruction on how to create a cluster on the cloud.

Once the database has been successfully installed, you just need to run the following command to scrape the vacancies using the scrapy spider along with the Mongo pipeline:

uv run scrapy crawl djinni -a categories="Python"

To scrape the vacancies into a CSV file, comment out all the MONGODB_* environment variables and run the command above.

You can substitute "Python" for any other category, or a stack of categories separated by a " | ". See available specializations (categories) on the Djinni website.

To extract statistics from job descriptions, first install the required spaCy model:

uv run spacy download en_core_web_md

To run the wrangler, use its CLI:

uv run python -m techtrendanalysis.wrangler --help

Data Analysis

To see the visualization of the extracted statistics, please, head over to the analysis file and follow the instructions given there.

Here is an example of a visualized result

Python technology statistics

Contribution

Install pre-commit script and hooks:

pre-commit install
pre-commit install-hooks

Run tests after any modifications:

uv run coverage run -m pytest --show-capture=stdout --showlocals -vv -s -rA tests/