Internet has become an economical support for publishing and distributing information in a large scale. Internet publishing techniques can be distinguished by the client's control on the origin and the quality of the published information, their precision in reaching only clients interested in the published information and their “publication lag” corresponding to the time necessary for reaching these clients. For example, “spam”-based publishing is generally uncontrolled, unprecise and has a short publication lag. News forums improve precision but need to be moderated in order to guarantee the origin and the quality of the published information. The origin and quality of web page information is guaranteed by the address (URL) of the producing web site but suffers from an important publication lag due to the low refresh rate of web search engines.
In order to reduce the time interval necessary for an information published on a web site to reach the interested users, more and more web sites apply web syndication techniques for publishing their contents. These techniques consist in publishing new information in form of web feeds or blogs to interested users who actively subscribe to these blogs. They reduce the publication lag of web information and allow users to create their personal information space observing the evolution of well-defined information sources.
Whereas web content syndication can be considered as a new efficient way of sharing information on the web, it also suffers from well-known problems related to the large scale of the web. The number of web feeds and blogs is constantly growing which creates new issues in feed management and feed aggregation. Specialized web syndication portals like Blastfeed.com, Plazoo.com and Technorati.com try to solve some of these problems by collecting and aggregating web feed data. One goal of these portals is to index feed data (similar to search engines for standard web ressources) based on efficient refresh algorithms to reduce the publication lag mentioned before. For example the number of feeds indexed by http://technorati.com/ doubles in size approximatively every six months and has reached 36*10^6 feeds in april 2006 and observes about 50*10^3 postings per hour (http://technorati.com/weblog/2006/04/96.html).
The goal of the ROSES project is to apply and evaluate modern data management technology in the context of web syndication. The proposed approach is based on the observation that web content syndication can be considered as a large-scale distributed XML data management problem :