#

Apache Spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

Here are 8,563 public repositories matching this topic...

IBM / data-prep-kit

Open source project for data preparation of LLM application builders

python data spark malware code-quality data-preprocessing ray data-preparation deduplication data-prep finetuning data-preprocessing-pipelines datacuration large-language-models llm llmapps large-scale-data-processing datarecipes

Updated Oct 14, 2024
Python

ishaansathaye / CSC369-IntroDistributedComputing

Cal Poly Fall 2024 CSC 369 Intro to Distributed Computing

java scala spark hadoop distributed-computing map-reduce

Updated Oct 14, 2024
Java

SynapseML

microsoft / SynapseML

Simple and Distributed Machine Learning

Updated Oct 14, 2024
Scala

jazzwang / snippet

some personal code snippet to learn new programming skill

javascript java scala spark gradle

Updated Oct 14, 2024
Python

ytsaurus / ytsaurus

YTsaurus is a scalable and fault-tolerant open-source big data platform.

sql big-data spark clickhouse distributed-database lakehouse olap-database ytsaurus

Updated Oct 14, 2024
C++

iimeta / fastapi-admin

智元 Fast API 是一站式API管理系统，将各类LLM API进行统一格式、统一规范、统一管理，使其在功能、性能和用户体验上达到极致。

api fast spark realtime openai glm gpt fastapi gpt-4 chatgpt ernie-bot qwen

Updated Oct 14, 2024
Go

SneaksAndData / spark-utils

Comfy Utilities for Spark Job Authoring

spark distributed-computing

Updated Oct 14, 2024
Python

iimeta / fastapi

智元 Fast API 是一站式API管理系统，将各类LLM API进行统一格式、统一规范、统一管理，使其在功能、性能和用户体验上达到极致。

api fast spark realtime openai glm gpt fastapi gpt-4 chatgpt ernie-bot qwen

Updated Oct 14, 2024
Go

iimeta / fastapi-sdk

智元 Fast API 是一站式API管理系统，将各类LLM API进行统一格式、统一规范、统一管理，使其在功能、性能和用户体验上达到极致。

api fast spark realtime openai glm gpt fastapi gpt-4 chatgpt ernie-bot qwen

Updated Oct 14, 2024
Go

moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-science spark record-linkage entity-resolution fuzzy-matching deduplication em-algorithm data-matching deduplicate-data duckdb uk-gov-data-science

Updated Oct 14, 2024
Python

flyteorg / flytekit

Extensible Python SDK for developing Flyte tasks and workflows. Simple to get started and learn and highly extensible.

python data-science data automation sdk spark pypi extensible workflows hacktoberfest flyte mlops flyte-tasks

Updated Oct 14, 2024
Python

Jaime-alv / into-parquet

CLI tool for giving '.csv' files a schema and cast them to '.parquet'

spark cli-app parquet csv-parser

Updated Oct 14, 2024
Scala

marsfoundation / spark-app

spark ethereum dapp dai makerdao defi

Updated Oct 14, 2024
TypeScript

vasnake / artefacts-2019_2023

Collection of some interesting pieces of my projects. Spark, Scala, Python, sh

scala spark etl ml udf catalyst udaf

Updated Oct 14, 2024
Scala

polaternez / Big-Data-Applications

Big Data Applications

elasticsearch kafka spark spring-boot mongodb spark-streaming

Updated Oct 14, 2024
Java

h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Updated Oct 14, 2024
Jupyter Notebook

tobymao / sqlglot

Python SQL Parser and Transpiler

Updated Oct 14, 2024
Python

SuperCowPowers / sageworks

SageWorks: An easy to use Python API for creating and deploying AWS SageMaker Models

python aws machine-learning big-data spark pandas data-engineering

Updated Oct 14, 2024
Python

listenbrainz-server

metabrainz / listenbrainz-server

Server for the ListenBrainz project, including the front-end (javascript/react) code that it serves and all of the data processing components that LB uses.

react python music typescript database web big-data spark listenbrainz-server

Updated Oct 14, 2024
Python

timebusker / timebusker.github.io

timebusker.github.io

mysql java kafka spark spring-boot hive hadoop bigdata postgresql oracle ml

Updated Oct 14, 2024
HTML

Created by Matei Zaharia

Released May 26, 2014

Followers: 421 followers
Repository: apache/spark
Website: spark.apache.org
Wikipedia: Wikipedia

Related Topics