발간년도 : [2024]
논문정보 |
|
논문명(한글) |
[Vol.19, No.1] A Study on Constructing a Focused Web Crawler for Producing Web Corpus |
|
논문투고자 |
Nam-Oh Kang, Jae-Ho Kim |
|
논문내용 |
Since the introduction of the web as an Internet service for information sharing, a vast amount of data has been made public through the web until recently. Accordingly, various attempts have been made to produce focused web crawlers for the purpose of building large-scale corpora from web space. The focused web crawler analyzes the obtained web pages to extract the requested information and enables effective information retrieval by extracting and visiting URLs that are highly relevant to the web pages the user wants. This allows natural language researchers to search, collect, and manage sentences using specific words or phrases on the web, making focused web crawlers suitable for building large-scale web corpora that meet specific conditions. In this study, we examined how the method of crawling URLs and the procedure of determining the priority of URLs to be crawled affect performance in constructing a focused web crawler for building a web corpus. Taking this into consideration, we present a method of building a web crawler focused on web corpus generation that seeks to improve performance. To prove the performance of the proposed system, corpus construction for several terms was performed. The results of the experiment showed that the corpus construction algorithm proposed in this paper is an improvement over the existing methods. |
|
첨부논문 |
|
|
|
|
|