LSC & HSC
Light and heavy semantic clustering are two approaches for clustering any type of semantically annotated documents.

About


The SC project has been initiated at the LGI2P research center during the PhD of Nicolas Fiorini. The main motivation behind this work is to provide a hierarchical clustering technique that relies on semantic annotations associated to documents. Such clustering is generic as those documents can be texts, videos or even gene sequences.

Semantic clustering as we propose it aims at building a hierarchy of clusters that are semantically labeled. First, such clustering is more reliable than classical hierarchical approaches as the evaluation shows (using this benchmark). Second, labeling the clusters is often needed after clustering documents to understand what groups have been formed. As documents are clustered according to their semantic annotations, we propose to use them to label the tree nodes as well.

This website provides resources regarding the project, such as a GitHub project containing a minimal working example adapted to the benchmark, some download links and an overview of our results.

Source code


The project source code is made available on GitHub. Please consider it as a prototype/beta as it is not optimized yet (and HSC's complexity is quite high).

The project is developed in Java as a Maven project, feel free to report the bugs. The project contains the benchmark data and the results we got on this benchmark.


Results


experts
LSC-pp
LSC-nopp
baseline-pp
baseline-nopp

Downloads


Here are some useful links regarding hierarchical semantic clustering. First off, the benchmark that allowed us to evaluate the approach. Second, an archive containing all our results (but not their evaluation scores).

Contacts


Feel free to contact the team who initiated the project if you have any request or suggestion concerning it, or if you want to collaborate with us. This project results from the collaboration between the école des mines d'Alès and Montpellier SupAgro.