Show simple item record

dc.contributor.advisorSang, Nguyen Thi Thanh
dc.contributor.authorHuy, Tran Tien
dc.date.accessioned2018-08-30T08:26:58Z
dc.date.available2018-08-30T08:26:58Z
dc.date.issued2017
dc.identifier.other022003763
dc.identifier.urihttp://keep.hcmiu.edu.vn:8080/handle/123456789/2746
dc.description.abstractThe aim of this thesis is to apply Vector Space Model concept and Web Mining Algorithm to extract the meaningful information from a certain web page. First thing to do is to crawl the web and remove, ignore all the redundancy and noise from the web page. Then, the Algorithm will continue to extract all the main text of that web page. Next, Latent Semantic Indexing Algorithm (LSI) are applied to make sure that all of the extracted text is actually related to the title of the current page. The methodology for LSI is used to change every words and sentence of text into a Vector Space Model called TF-IDF, or TF – IDF matrix, in which each element in a vector is a weighted number. After that, this model is truncated to create a subspace in which only meaningful words remain by using a technique called: Singular Value Decomposition (SVD). Moreover, the similarity between the text and the title can be calculated by using Cosine Similarity measurement. However, another important step that can’t be ignored before we handle the data is to pre-process data is also needed to research in details. Last but not least, other resources that was used during this thesis will be introduced later.en_US
dc.language.isoen_USen_US
dc.publisherInternational University - HCMCen_US
dc.subjectWeb sites; Informative blocksen_US
dc.titleExtraction of informative blocks from websiteen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record