There are Python 2.7 codes and learning notes for Spark 2.1.1
-
Updated
Aug 21, 2018 - Python
There are Python 2.7 codes and learning notes for Spark 2.1.1
insight data engineering fellow project
An easy-to-use script for fast similarity search in the textual data (and embedding space) with GPU & Multi-core support.
Fast Jaccard similarity search for abstract sets (documents, products, users, etc.) using MinHashing and Locality Sensitve Hashing
Scalable Data Mining - Assignment submissions
Implementation of a B+ Tree for range and exact match queries and of the LSH algorithm for finding similar documents as measured by Jaccard Similarity.
Recommendation systems for Yelp (collaborative filtering & content-based)
A production-ready data pipeline for entity matching, web data enrichment, and analytics visualization using Apache Airflow, PySpark, and Docker.
Partition-aware MinHash LSH deduplication library for large-scale text data curation on Apache Spark.
ETH Zurich Fall 2017
Implementing Locality Sensitive Hashing for DNA Sequences.
NLP lessons with implementations of Information Retrieval algorithms, Data Sampling techniques, Clustering methods, Semantic Search, and more.
similarity of the texts (Jaccard Similarity, Minhash, LSH)
Scalable PySpark entity resolution pipeline for deduplicating brand-level customer records using embeddings, MinHash blocking, semantic matching, and Delta Lake master-data regeneration.
A Streamlit app for Bangla text analysis (EDA, ANN/LSH similarity search, clustering) powered by PySpark.
Deduplication : minhash w/ LSH
Add a description, image, and links to the minhash-lsh-algorithm topic page so that developers can more easily learn about it.
To associate your repository with the minhash-lsh-algorithm topic, visit your repo's landing page and select "manage topics."