Integror

Integror is a department-internal project focused on different levels of data and information integration on the Web. Our interests include

  • integration of structured and semi-structured data from information-intensive Web and Deep Web sources,
  • as well as intuitive and semi-automatic visual integration of unstructured content blocks.

The foundations of the integration task are two innovative and robust formalisms: first capable of in-page content addressing and second responsible for navigational paths description.

The former of the formalisms is based on relative XPath addressing and displays visibly better robustness than state-of-the-art absolute-XPath based schemes. Its application to information integration task made possible creation of myPortal – an intuitive, user-friendly and robust application allowing creation of personalized portals based on logical content blocks extracted of pre-defined pages. With myPortal two mouse clicks are enough to create content block extraction rules; next, the extracted blocks can be composed into an integrated information view (personalized web portal). The method was proved to be highly resilient to changes. Several publications (including demonstrations at VLDB and WWW conferences) describe myPortal in more detail.

The letter of the formalisms – based on FSA (Finite State Automata) description of user navigation and pumping lemma – altogether with relative XPath gave birth to DWDI (Deep Web Data Integration) application. It allows integration of data from semi-structural and structural navigation (browsing) and forms-based Web sources. With DWDI, recorded user navigation path can be used to create description of Web or Deep Web source navigation pattern and relative XPath is used to describe location of data blocks on the page.

Current research directions aim at creation of mechanisms enabling more robust and adaptive addressing of different types of Web objects in dynamic environment (e.g. involving changing Web sites structures and Web pages structures). They include:

  • use of visual characteristics-based and 2D addressing of content blocks in Web pages,
  • automatic detection of relative XPath reference points,
  • handling of conflicts between multiple addresses of the same Web object,
  • as well as work on capability to operate in presence of technical problems (non standards-compliant code, 404 errors, etc.).

Project’s research includes also surveys on the applications and business models of enhanced Web objects addressing, on the nature and topicality of Deep Web sources as well as on the visual and navigational ways of presenting database content on the Web. Future research plans include also using F-Webs project’s experience to implement QoS-based Web sources evaluation and selection schemes.