Similarity Measures of Web Repositories constructed by Web-Scrapping from Specific Web Resource
IT Skills Show & International Conference on Advancements in Computing Resources, (SSICACR-2017) 15 and 16 February 2017, Alagappa University, Karaikudi, Tamil Nadu, India. International Journal of Computer Science (IJCS) Published by SK Research Group of Companies (SKRGC)
Download this PDF format
Abstract
Information extraction systems apply machine learning to the task. These systems differ in how the IE problem characterized and in the style of text that they handle. The most important tasks in information extraction from the web are understanding webpage structure and its organization as many web sites contain large collections of pages displayed using a common template or layout,which makes it increasingly difficult to discover relevant data about a specific topic. Extracting data from such template pages has become an important issue in recent days as the number of web pages available on the Internet has growing in day by day. Tools and protocols to extract all this information have now come in demand as researchers and web surfers want to discover new knowledge at an ever increasing rate. A web crawler also known as, a robot or a spider is a system for the bulk downloading of web pages, whereas the goal of a focused crawler is seeking pages that are relevant to a pre-defined set of topics from a specific web resource. Collecting and indexing those accessible web documents,which can answer all ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. Since all search engines take their data fed using crawlers, it is critical to improve its working ability. As the size of data is huge,Common Crawlers are no longer applicable in real life. So there is need to develop a domain specific crawler builds on stock of existing algorithms. This led to considerable savings in hardware and network resources, and helps keep the crawl more up-to-date. This paper proposed a novel framework called SWNLP, which enables bidirectional integration of page structure understanding and text understanding in an iterative manner. We have applied the proposed framework to the judgments information system to extract text of judgments and relate the similarity measures.
References
[1]Gautam Pant, Padmini Srinivasan1, FilippoMenczer, ?Crawling the Web?, Department of Management Sciences, The University of Iowa, Iowa City IA 52242, USA.
[2]DebajyotiMukhopadhyay, Arup Biswas, SukantaSinha, ?A New Approach to Design Domain Specific Ontology Based Web Crawler?, West Bengal University of Technology, pp.70091.
[3]Chakrabarti, Soumen, Martin van den Berg, and Byron Dom. ?Focused crawling: a new approach to topic-specificWeb resource discovery?, Elsevier, 1999.
[4]Sk.AbdulNabi, Dr.P.Premchand, ?Effective Performance of Information Retrieval by using Domain Based Crawler?, Vol. 4, No.7, 2013.
[5] Scott Deerwester, Susan T. Dumais, George W. Furnas and Thomas K. Landauer, Richard Harshman, ?Indexing by Latent Semantic Analysis?, 41(6):391-407, 1990.
[6]Radhika Gupta, AP Nidhi, ?Focused Crawling System based in Improved LSI?, Volume 2 Issue 9, September 2013.
[7] Hong-Wei Hao, Cui-Xia Mu, Xu-Cheng Yin, Shen Li, Zhi-Bin Wang, ?An Improved Topic Relevance Algorithm for focused Crawling?.
[8] Ali Pesaranghader, Ahmad Pesaranghader, Norwati Mustapha, NurfadhlinaMohdSharef, ?Improving Multi-term Topics Focused Crawling by Introducing Term Frequencyy-Information Content (TF-IC) Measure?, September 2013.
[9] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles and M. Gori, ?Focused Crawling Using Context Graphs?, NEC Research Institute, Princeton, NJ 08540-6634 USA.
[10] Allan Heydon and Marc Najork. Mercator: A scalable, extensibleWeb crawler. World Wide Web, 1999, 2(4):219–229.
[11] A Standard for Robot Exclusion[EB/OL]. https://www.robotstxt.org/wc/norobots.html.
[12] L. Barbosa and J. Freire, ?An adaptive crawler for locating hidden-web entrypoints?, in Proceedings of the 16th International World Wide Web Conference,2007.
[13] J. Callan and M. Connell, ?Query-based sampling of text databases?,ACMTransactions on Information Systems, vol. 19, no. 2, pp. 97–130, 2001.
[14] Anshika Pal, Deepak Singh Tomar, S.C. Shrivastava,?Effective Focused Crawling Based on content and Link Structure Analysis?Vol. 2, No. 1, June 2009.
Keywords
Web Crawler, Text Extraction, Structured Web Data, Deep Web.