minhash-lsh-algorithm

Here are 18 public repositories matching this topic...

Cheng-Lin-Li / Spark

There are Python 2.7 codes and learning notes for Spark 2.1.1

spark map-reduce minhash tf-idf kmeans als cosine-similarity python27 kmeans-clustering minhash-lsh-algorithm apriori-algorithm alternating-least-squares uv-decomposition savasere-omiecinski-and-navathe apriori-son

Updated Aug 21, 2018
Python

tmpsrcrepo / benchmark_minhash_lsh

Star

insight data engineering fellow project

algorithm spark batch spark-streaming text-processing jaccard-similarity minhash-lsh-algorithm

Updated Nov 14, 2016
Python

kazemnejad / text_similarity_search

Star

An easy-to-use script for fast similarity search in the textual data (and embedding space) with GPU & Multi-core support.

indexing minhash-lsh-algorithm faiss text-simi

Updated Aug 26, 2019
Python

micts / jss

Star

Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing

python numpy minhash locality-sensitive-hashing jaccard-similarity minhash-lsh-algorithm jaccard-distance jaccard-index jaccard-similarity-estimation

Updated May 21, 2020
Python

shubhamwaghe / Scalable-Data-Mining

Star

Scalable Data Mining - Assignment submissions

scala apache-spark minhash-lsh-algorithm hadoop-mapreduce

Updated Dec 11, 2017
Python

soulintzis / Multidimensional-Data-Structures

Star

python bloom-filters data-structures cosine-similarity minhash-lsh-algorithm multidimensional membership-queries bplus-tree

Updated Sep 19, 2020
Python

rkapsalis / Range-and-similarity-queries

Star

Implementation of a B+ Tree for range and exact match queries and of the LSH algorithm for finding similar documents as measured by Jaccard Similarity.

python lsh jaccard-similarity minhash-lsh-algorithm bplustree

Updated Feb 19, 2021
Python

mandychumt / YelpRecommendationSystem

Star

Recommendation systems for Yelp (collaborative filtering & content-based)

collaborative-filtering recommendation-system frequent-itemset-mining datamining minhash-lsh-algorithm yelp-dataset content-based-recommendation tfidf-text-analysis

Updated Mar 28, 2020
Python

Chaimaaorg / entity-match-platform

Star

A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.

python airflow pyspark minhash-lsh-algorithm minhash-similarity

Updated Nov 16, 2025
Python

KenObata / distributed-curator

Star

Partition-aware MinHash LSH deduplication library for large-scale text data curation on Apache Spark.

spark deduplication minhash-lsh-algorithm common-crawl data-curation near-duplicate llm

Updated May 16, 2026
Python

cwuu / DataMining-LearningFromLargeDataSet-Task1

Star

ETH Zurich Fall 2017

locality-sensitive-hashing mapreduce datamining minhash-lsh-algorithm large-dataset

Updated Apr 12, 2018
Python

xadityax / Locality-Sensitive-Hashing-DNA-Seqs

Star

Implementing Locality Sensitive Hashing for DNA Sequences.

locality-sensitive-hashing dna-sequences minhash-lsh-algorithm shingling lsh-algorithm

Updated Nov 29, 2020
Python

wmjg-alt / nlp-systems-review

Star

NLP lessons with implementations of Information Retrieval algorithms, Data Sampling techniques, Clustering methods, Semantic Search, and more.

python search nlp search-engine data-science information-retrieval data-engineering data-analysis vector-space-model lessons tfidf semantic-search active-learning minhash-lsh-algorithm system-design clustering-methods sampling-methods

Updated Feb 27, 2026
Python

sebSR / text-processing

Star

similarity of the texts (Jaccard Similarity, Minhash, LSH)

minhash text-processing jaccard-similarity minhash-lsh-algorithm

Updated Feb 21, 2021
Python

RNimantha / User-Entity-Resolution-Deduplication

Star

Scalable PySpark entity resolution pipeline for deduplicating brand-level customer records using embeddings, MinHash blocking, semantic matching, and Delta Lake master-data regeneration.

record-linkage entity-resolution embeddings pyspark databricks minhash-lsh-algorithm delta-lake customer-deduplication

Updated May 11, 2026
Python

saifkhancse / Bangla-Text-Analysis

Star

A Streamlit app for Bangla text analysis (EDA, ANN/LSH similarity search, clustering) powered by PySpark.

python nlp text-mining pyspark recommender-system text-processing minhash-lsh-algorithm lhs streamlit

Updated Sep 13, 2025
Python

aloobun / minhash_exp

Star

Deduplication : minhash w/ LSH

dataset deduplication minhash-lsh-algorithm

Updated Oct 8, 2023
Python

AiDinho / LocallySensitiveHashing

Star

lsh minhash-lsh-algorithm

Updated Nov 17, 2017
Python

Improve this page

Add a description, image, and links to the minhash-lsh-algorithm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the minhash-lsh-algorithm topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minhash-lsh-algorithm

Here are 18 public repositories matching this topic...

Cheng-Lin-Li / Spark

tmpsrcrepo / benchmark_minhash_lsh

kazemnejad / text_similarity_search

micts / jss

shubhamwaghe / Scalable-Data-Mining

soulintzis / Multidimensional-Data-Structures

rkapsalis / Range-and-similarity-queries

mandychumt / YelpRecommendationSystem

Chaimaaorg / entity-match-platform

KenObata / distributed-curator

cwuu / DataMining-LearningFromLargeDataSet-Task1

xadityax / Locality-Sensitive-Hashing-DNA-Seqs

wmjg-alt / nlp-systems-review

sebSR / text-processing

RNimantha / User-Entity-Resolution-Deduplication

saifkhancse / Bangla-Text-Analysis

aloobun / minhash_exp

AiDinho / LocallySensitiveHashing

Improve this page

Add this topic to your repo