GitHub - foscraft/beatrice-project: BeatriceVec is a powerful Python package/tool designed for generating word embeddings in the dimension of 600, without relying on any third-party packages. Built with Cpython.

BeatriceVec

BeatriceVec is a high-performance Python package designed for generating 600-dimensional word embeddings, optimized with Cython for speed and efficiency. It requires no third-party numerical libraries, relying solely on pure Python and Cython for its implementation. Word embeddings are vector representations of words that capture semantic relationships and meaning, making them ideal for natural language processing (NLP) tasks such as word similarity, text classification, and information retrieval.

With BeatriceVec, you can transform textual data into meaningful vector representations locally, without internet access. Its embeddings capture nuanced semantic relationships between words, empowering algorithms to understand context and similarities—perfect for applications like sentiment analysis, language translation, and recommendation systems.

Key Features

High Dimensionality: Utilizes 600 dimensions to encode complex word relationships and fine-grained distinctions, enhancing performance in downstream NLP tasks.
Cython Optimization: Compiled to C for faster training and vector operations compared to pure Python implementations.
Standalone: No dependencies beyond Cython, making it lightweight and easy to deploy.
User-Friendly API: Simple interface for training custom embeddings on your own text corpora, tailored to your specific domain.
Local Processing: Create and query embeddings offline, ideal for secure or resource-constrained environments.

BeatriceVec is a valuable tool for developers and researchers exploring word embeddings, offering flexibility, performance, and ease of use for text analysis, information retrieval, and language understanding projects.

Installation

The package is available as a source distribution (.tar.gz) or a pre-built wheel (.whl) in the dist/ folder. Download them from here.

Prerequisites:

Python 3.6 or later
Cython (installed automatically with the package)

From Wheel (Recommended for Speed):

pip install dist/beatricevec-1.0.1-cp312-cp312-linux_x86_64.whl

Note: The wheel is platform-specific (e.g., linux_x86_64 for Linux, win_amd64 for Windows). Use the .tar.gz if your platform differs.

From Source Distribution:

pip install dist/beatricevec-1.0.1.tar.gz

Requires a C compiler (e.g., gcc on Linux/Mac, Visual Studio on Windows) to compile the Cython code during installation.

Manual Build (For Development):

git clone https://github.com/foscraft/beatrice-project.git
cd beatrice-project
pip install cython
python setup.py build_ext --inplace

Usage

from beatricevec import BeatriceVec

# Example corpus
corpus = [
    "Learning strategies for post-literacy and continuing education in Kenya",
    "Natural language processing with BeatriceVec is fast and efficient",
    "Word embeddings capture semantic relationships"
]

# Initialize and train the model
embedder = BeatriceVec(corpus)
embedder.build_vocab()
embedder.initialize_word_vectors()
embedder.train()

# Get embeddings
embeddings = embedder.get_embeddings()

# Print embeddings for each word
for embedding in embeddings:
    print(embedding[:10])  # Print first 10 dimensions for brevity

Documentation

Methods

build_vocab(): Constructs the vocabulary from the input corpus.
initialize_word_vectors(): Initializes word vectors with random values between -1 and 1.
train(): Trains the model using a Word2Vec-inspired algorithm, optimized with Cython.
update_vector(vector: list, context_vector: list): Updates a target vector using gradient descent (internal method).
get_embeddings() -> list: Returns a list of 600-dimensional embeddings for all words in the vocabulary.
get_embedding(word: str) -> list: Retrieves the 600-dimensional embedding for a specific word, or None if not found.

Parameters

dimension: 600 (fixed, rich representation space)
context_size: 2 (default window size for context words)
learning_rate: 0.01 (default gradient descent step size)
num_epochs: 10 (default training iterations)

License

BeatriceVec is released under the Apache 2.0 License.

How to Contribute

Contributions are welcome! See CONTRIBUTING.md for guidelines on how to contribute to this project.

Development Notes

Built with Cython for performance without external numerical libraries.
Compatible with Python 3.6+.
Source and wheel distributions available in the dist/ folder.

Explore the power of high-dimensional word embeddings with BeatriceVec and enhance your NLP projects today!

Updates Made

Python Version: Updated to 3.6+ to reflect broader compatibility with Cython.
Badges: Fixed GitHub and Downloads links to point to the dist/ folder.
Description: Emphasized Cython optimization, standalone nature, and 600-dimensional embeddings.
Installation: Added detailed instructions for .whl and .tar.gz, including prerequisites and manual build options.
Usage: Updated example corpus to be more representative and added a note about slicing embeddings for readability.
Documentation: Clarified method descriptions and added default parameters.
Contributing: Linked to a generic open-source contribution guide since a specific CONTRIBUTING.md wasn't provided.
Development Notes: Highlighted Cython usage and lack of dependencies.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.vscode		.vscode
beatricevec		beatricevec
dist		dist
local		local
media		media
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
describe.md		describe.md
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BeatriceVec

Key Features

Installation

Usage

Documentation

Methods

Parameters

License

How to Contribute

Development Notes

Updates Made

About

Releases 1

Packages

Languages

License

foscraft/beatrice-project

Folders and files

Latest commit

History

Repository files navigation

BeatriceVec

Key Features

Installation

Usage

Documentation

Methods

Parameters

License

How to Contribute

Development Notes

Updates Made

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages