Project description

Internet has become an economical support for publishing and distributing information in a large scale. Internet publishing techniques can be distinguished by the client's control on the origin and the quality of the published information, their precision in reaching only clients interested in the published information and their “publication lag” corresponding to the time necessary for reaching these clients. For example, “spam”-based publishing is generally uncontrolled, unprecise and has a short publication lag. News forums improve precision but need to be moderated in order to guarantee the origin and the quality of the published information. The origin and quality of web page information is guaranteed by the address (URL) of the producing web site but suffers from an important publication lag due to the low refresh rate of web search engines.

In order to reduce the time interval necessary for an information published on a web site to reach the interested users, more and more web sites apply web syndication techniques for publishing their contents. These techniques consist in publishing new information in form of web feeds or blogs to interested users who actively subscribe to these blogs. They reduce the publication lag of web information and allow users to create their personal information space observing the evolution of well-defined information sources.

Whereas web content syndication can be considered as a new efficient way of sharing information on the web, it also suffers from well-known problems related to the large scale of the web. The number of web feeds and blogs is constantly growing which creates new issues in feed management and feed aggregation. Specialized web syndication portals like Blastfeed.com, Plazoo.com and Technorati.com try to solve some of these problems by collecting and aggregating web feed data. One goal of these portals is to index feed data (similar to search engines for standard web ressources) based on efficient refresh algorithms to reduce the publication lag mentioned before. For example the number of feeds indexed by http://technorati.com/ doubles in size approximatively every six months and has reached 36*10^6 feeds in april 2006 and observes about 50*10^3 postings per hour (http://technorati.com/weblog/2006/04/96.html).

The goal of the ROSES project is to apply and evaluate modern data management technology in the context of web syndication. The proposed approach is based on the observation that web content syndication can be considered as a large-scale distributed XML data management problem :

  1. The two main web feed formats are RSS and Atom and both of them use XML as publishing syntax. We intend to exploit and adapt existing XML data management technology like XML datawarehouses and standard XML query languages (XQuery/XPath) for defining and implementing advanced syndication services (publish, filter, aggregate).
  2. Web content syndication consists in observing and aggregating large-volumes of evolving distributed XML data. Existing web syndication portals or interfaces are based on a centralized architecture and must be able to support high refresh and aggregation workloads. In this project we intend to apply and extend existing query evaluation and optimization techniques for distributed data in the context of web syndication. In particular, we will study the case of a distributed P2P syndication infrastructure.
  3. Currently proposed RSS feed aggregation services are still very limited and essentially consist in key-word based filtering, concatenating and time-stamp based reordering of several feeds. One goal of the project is to propose new advanced aggregation services based on XML data integration techniques.