Skip to content

BOUHAMOUM/SC-DBSCAN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SC-DBSCAN: a schema extraction algorithm for large RDF datasets

SC-DBSCAN is a scalable density-based clustering algorithm that operates on entities in large RDF datasets. SC-DBSCAN builds a schema describing the entities of a dataset by discovering their classes. SC-DBSCAN is designed to address the scalability issue of density based clustering algorithms. It can cluster large RDF datasets and provides a clustering result of a quality as good as the original DBSCAN algorithm.

SC-DBSCAN is implemented in Scala and using the Apache Spark framework.

Building the project

Maven is used to build the project. The Maven wrapper tool allows to build the project without a local maven install. Due to some constraints imposed by the Scala compiler, a JDK 8 is needed.

On Linux

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
./mvnw package -DskipTests

On Windows

set JAVA_HOME=C:\path\to\jdk8
mvnw.cmd package -DskipTests

Running the algorithm

The main class is david/sc_dbscan/Main.scala.

spark-submit --class david.sc_dbscan.Main \\
             target/sc_dbscan-1.0-jar-with-dependencies.jar \\
             --eps X.X --coef Y --cap C  --mpts Y dataset

WHERE:
  --eps 	: the similarity threshold epsilon (between 0 and 1)
  --coef 	: a boolean that defines whether it clusters patterns or entities
  --cap 	: the maximum capacity of a computing node (in number of entities)
  --mpts 	: the density thresholg minPts
  dataset   : the path to the dataset

For Example

spark-submit --class david.sc_dbscan.Main \\
             target/sc_dbscan-1.0-jar-with-dependencies.jar \\
             --eps 0.8 --coef false --cap 2000  --mpts 3 DataSets/T2800L10N10000

About

SC-DBSCAN is a scalable and deterministic density-based clustering algorithm inspired from DBSCAN.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages