Extraction of informative blocks from website

Show simple item record

dc.contributor.advisor	Sang, Nguyen Thi Thanh
dc.contributor.author	Huy, Tran Tien
dc.date.accessioned	2018-08-30T08:26:58Z
dc.date.available	2018-08-30T08:26:58Z
dc.date.issued	2017
dc.identifier.other	022003763
dc.identifier.uri	http://keep.hcmiu.edu.vn:8080/handle/123456789/2746
dc.description.abstract	The aim of this thesis is to apply Vector Space Model concept and Web Mining Algorithm to extract the meaningful information from a certain web page. First thing to do is to crawl the web and remove, ignore all the redundancy and noise from the web page. Then, the Algorithm will continue to extract all the main text of that web page. Next, Latent Semantic Indexing Algorithm (LSI) are applied to make sure that all of the extracted text is actually related to the title of the current page. The methodology for LSI is used to change every words and sentence of text into a Vector Space Model called TF-IDF, or TF – IDF matrix, in which each element in a vector is a weighted number. After that, this model is truncated to create a subspace in which only meaningful words remain by using a technique called: Singular Value Decomposition (SVD). Moreover, the similarity between the text and the title can be calculated by using Cosine Similarity measurement. However, another important step that can’t be ignored before we handle the data is to pre-process data is also needed to research in details. Last but not least, other resources that was used during this thesis will be introduced later.	en_US
dc.language.iso	en_US	en_US
dc.publisher	International University - HCMC	en_US
dc.subject	Web sites; Informative blocks	en_US
dc.title	Extraction of informative blocks from website	en_US
dc.type	Thesis	en_US

Files in this item

Name:: 022003763 - Huy, Tran Tien.pdf
Size:: 3.462Mb
Format:: PDF

This item appears in the following Collection(s)

Bachelor Thesis - Computer Science and Engineering

Show simple item record