Distributed Hypertext Resource Discovery Through Examples.

Soumen Chakrabarti, Martin van den Berg, Byron Dom: Distributed Hypertext Resource Discovery Through Examples. VLDB 1999: 375-386

@inproceedings{DBLP:conf/vldb/ChakrabartiBD99,
  author    = {Soumen Chakrabarti and
               Martin van den Berg and
               Byron Dom},
  editor    = {Malcolm P. Atkinson and
               Maria E. Orlowska and
               Patrick Valduriez and
               Stanley B. Zdonik and
               Michael L. Brodie},
  title     = {Distributed Hypertext Resource Discovery Through Examples},
  booktitle = {VLDB'99, Proceedings of 25th International Conference on Very
               Large Data Bases, September 7-10, 1999, Edinburgh, Scotland,
               UK},
  publisher = {Morgan Kaufmann},
  year      = {1999},
  isbn      = {1-55860-615-7},
  pages     = {375-386},
  ee        = {db/conf/vldb/ChakrabartiBD99.html},
  crossref  = {DBLP:conf/vldb/99},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Abstract

We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, meta-data, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that a keyword-based "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the user's interest (2) a distiller that evaluates a page as an access point for a large neighborhood of relevant pages. Our implementation uses IBM's Universal Database, not only for robust data storage, but also for integrating the computations of the classifier and distiller into the database. This results in significant increase in I/O efficiency: a factor of ten for the classifier and a factor of three for the distiller. In addition, ad-hoc SQL queries can be used to monitor the crawler, and dynamically change crawling strategies. We report on experiments to establish that our system is efficient, effective, and robust.

Copyright © 1999 by the VLDB Endowment. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by the permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Online Paper

Download PDF file (www.vldb.org, Darmstadt, Germany)
Download PDF file (www.acm.org, New York, USA)

DVD Version: Load ACM SIGMOD Anthology DVD 1" and ...

Windows: Click the letter of your CD drive
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Mac: Click here
UNIX/LINUX: mount the DVD and click on the path of your mount point:
/Anthology/aDVD1 or /dvd

Printed Edition

Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (Eds.): VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK. Morgan Kaufmann 1999, ISBN 1-55860-615-7
Contents

References

[1]: Chidanand Apté, Fred Damerau, Sholom M. Weiss: Automated Learning of Decision Rules for Text Categorization. ACM Trans. Inf. Syst. 12(3): 233-251(1994)
[2]: ...
[3]: Krishna Bharat, Andrei Z. Broder: A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Computer Networks 30(1-7): 379-388(1998)
[4]: Krishna Bharat, Monika Rauch Henzinger: Improved Algorithms for Topic Distillation in a Hyperlinked Environment. SIGIR 1998: 104-111
[5]: Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7): 107-117(1998)
[6]: Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, Prabhakar Raghavan: Scalable Feature Selection, Classification and Signature Generation for Organizing Large Text Databases into Hierarchical Topic Taxonomies. VLDB J. 7(3): 163-178(1998)
[7]: Soumen Chakrabarti, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, Jon M. Kleinberg: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Computer Networks 30(1-7): 65-74(1998)
[8]: Soumen Chakrabarti, Byron Dom, Piotr Indyk: Enhanced Hypertext Categorization Using Hyperlinks. SIGMOD Conference 1998: 307-318
[9]: ...
[10]: ...
[11]: Donald D. Chamberlin: A Complete Guide to DB2 Universal Database. Morgan Kaufmann 1998, ISBN 1-55860-482-0
[12]: ...
[13]: Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172(1998)
[14]: William W. Cohen: Fast Effective Rule Induction. ICML 1995: 115-123
[15]: ...
[16]: Paul De Bra, R. D. J. Post: Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible. Computer Networks and ISDN Systems 27(2): 183-192(1994)
[17]: Susan T. Dumais, John C. Platt, David Hecherman, Mehran Sahami: Inductive Learning Algorithms and Representations for Text Categorization. CIKM 1998: 148-155
[18]: Roy Goldman, Narayanan Shivakumar, Suresh Venkatasubramanian, Hector Garcia-Molina: Proximity Search in Databases. VLDB 1998: 26-37
[19]: Joachim Hammer, Hector Garcia-Molina, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, Jennifer Widom: Information Translation, Mediation, and Mosaic-Based Browsing in the TSIMMIS System. SIGMOD Conference 1995: 483
[20]: Thorsten Joachims, Dayne Freitag, Tom M. Mitchell: Web Watcher: A Tour Guide for the World Wide Web. IJCAI (1) 1997: 770-777
[21]: ...
[22]: Thomas Kistler, Hannes Marais: WebL - A Programming Language for the Web. Computer Networks 30(1-7): 259-270(1998)
[23]: Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment. SODA 1998: 668-677
[24]: David Konopnicki, Oded Shmueli: Information Gathering in the World-Wide Web: The W3QL Query Language and the W3QS System. ACM Trans. Database Syst. 23(4): 369-410(1998)
[25]: ...
[26/27]: Alberto O. Mendelzon, Tova Milo: Formal Models of Web Queries. PODS 1997: 134-143
[28]: ...
[29]: Wayne Niblack, Xiaoming Zhu, James L. Hafner, Thomas M. Breuel, Dulce B. Ponceleon, Dragutin Petkovic, Myron Flickner, Eli Upfal, Sigfredo I. Nin, Sanghoon Sull, Byron Dom, Boon-Lock Yeo, Savitha Srinivasan, Dan Zivkovic, Mike Penner: Updates to the QBIC System. Storage and Retrieval for Image and Video Databases (SPIE) 1998: 150-161
[30]: ...
[31]: Jacques Savoy: An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Inf. Process. Manage. 32(2): 155-170(1996)
[32]: Loren G. Terveen, William C. Hill: Finding and Visualizing Inter-Site Clan Graphs. CHI 1998: 448-455
[33]: ...