.

Unsupervised Topical Organization of Documents using Corpus-based Text Analysis

LAUR Repository

Show simple item record

dc.contributor.author Sarkissian, Sarkis
dc.contributor.author Tekli, Joe
dc.date.accessioned 2024-11-08T08:23:24Z
dc.date.available 2024-11-08T08:23:24Z
dc.date.copyright 2021 en_US
dc.date.issued 2021-11-09
dc.identifier.isbn 9781450383141 en_US
dc.identifier.uri http://hdl.handle.net/10725/16285
dc.description.abstract This study aims at automating the process of topical keyword organization of set of documents in an input text corpus. It is conducted in the context of a larger project to investigate efficient unsupervised learning techniques to automatically extract relevant classes and their keyword descriptions from a set of the United Nations (UN) documents, and use the latter to produce reference corpora allowing to classify future UN documents. We assume that the reference classes are unknown in advance, and thus suggest an unsupervised clustering approach which accepts as input a bunch of unstructured text documents, and produces as output groups of similar documents describing similar topics. The input document feature vectors are augmented with term co-occurrence and relatedness scores produced from a distributional thesaurus built on the same (or a related) corpus. The augmented feature vectors are then run through a hierarchical clustering process to identify groups of similar documents, which serve as candidates for topical organization and keyword extraction. Experiments on a manually labelled dataset of documents classified against the UN's Sustainable Development Goals (SDGs) confirm the quality and potential of the approach. en_US
dc.description.sponsorship ACM en_US
dc.description.sponsorship SIGAPP en_US
dc.language.iso en en_US
dc.publisher The Association for Computing Machinery en_US
dc.subject Big data -- Congresses en_US
dc.subject Computer security -- Congresses en_US
dc.subject Database management -- Congresses en_US
dc.title Unsupervised Topical Organization of Documents using Corpus-based Text Analysis en_US
dc.type Conference Paper / Proceeding en_US
dc.author.school SOE en_US
dc.author.idnumber 201306321 en_US
dc.author.department Electrical and Computer Engineering en_US
dc.publication.place New York, NY en_US
dc.description.bibliographiccitations Includes bibliographical references en_US
dc.identifier.doi https://doi.org/10.1145/3444757.3485078 en_US
dc.identifier.ctation Sarkissian, S., & Tekli, J. (2021, November). Unsupervised topical organization of documents using corpus-based text analysis. In Proceedings of 2021 13th International Conference on Management of Digital EcoSystems (MEDES 2021), (pp. 87-94). New York: ACM. en_US
dc.author.email joe.tekli@lau.edu.lb en_US
dc.conference.date 1-3 November, 2021 en_US
dc.conference.pages 87-94 en_US
dc.conference.place Tunisia (Virtual event) en_US
dc.conference.title MEDES '21: Proceedings of the 13th International Conference on Management of Digital EcoSystems en_US
dc.identifier.tou http://libraries.lau.edu.lb/research/laur/terms-of-use/articles.php en_US
dc.identifier.url https://dl.acm.org/doi/abs/10.1145/3444757.3485078 en_US
dc.orcid.id https://orcid.org/0000-0003-3441-7974 en_US
dc.publication.date 2021 en_US
dc.author.affiliation Lebanese American University en_US


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search LAUR


Advanced Search

Browse

My Account