This constitutes a segment of a project undertaken as part of the master's course at the Department of Informatics and Telecommunications at the University of Athens. Specific details regarding the course can be accessed here, along with a corresponding description.
Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.
For additional information, please refer to the main repository that serves as the foundation for this project, accessible at this location.
The course deals with contemporary issues related to the principles and systems Big Data management. The topics we will examine are:
- The Map-Reduce programming model and systems such as Hadoop, HBase, HBase, and others.
- The HDFS file storage system. The Spark systems and TensorFlow.
- Message and streaming systems (e.g. Kafka and Samza).
- Repositories key value stores.
- Similarity detection techniques (similarity search, locality-sensitive hashing).
- Techniques for analysing links in large (PageRank, Hubs & Authorities). Clustering; hinting systems.
- Computational advertising issues.
- The course includes presentation and study of research topics as well as practical application of and practical application of these topics.