Open source project for data preparation of LLM application builders
-
Updated
Oct 14, 2024 - Python
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Open source project for data preparation of LLM application builders
Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing
Simple and Distributed Machine Learning
some personal code snippet to learn new programming skill
YTsaurus is a scalable and fault-tolerant open-source big data platform.
Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.
CLI tool for giving '.csv' files a schema and cast them to '.parquet'
Big Data Applications
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
SageWorks: An easy to use Python API for creating and deploying AWS SageMaker Models
Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.
Created by Matei Zaharia
Released May 26, 2014