Bases de Données / Databases

Table des matières

Challenges and issues
Expected results and contributions
Summary

Challenges and issues

The ROSES project aims at defining a set of web ressource syndication services and tools for localizing, integrating, querying and composing RSS feeds distributed on the Web. We distinguish between two kinds of services :

Feed Management Services :
1. catalogue
2. acquisition, storage and refresh
3. basic filtering (keyword) and notification

Feed Composition Services :
advanced filtering and querying (temporal, structured, multi-channel)
feed aggregation and data integration views
ranking and top-k queries

Technical and scientific challenges

Whereas RSS documents can be considered as a special kind of XML-RDF document that can be queried by any existing XML (XQuery, XPath) or RDF (Sparql) query language¹⁾, the combination of RSS syndication, XML query processing and distributed data and query processing creates new technical and scientific challenges that we intend to tackle in this project :

RSS data model and algebra : RSS feeds are ordered sequences of XML documents encoding a flow of time-stamped messages called items. Each item generally (but not necessarily) annotates an “external” web ressource identified by a URL. From this point of view, aggregating RSS feeds corresponds to querying (virtually infinite) sequences of items temporarily available in specific XML documents identified by a feed address. An important objective in this project will be the definition of a formal RSS data model and algebra with the precise semantics in terms of operations on time-stamped XML document sequences.
RSS feed management : We will study the definition and implementation of basic RSS feed management services (store,refresh,filter,notify) based on the RSS data model and existing technology for storing and querying (XPath/Xquery) XML documents. In particular we intend to evaluate and extend an existing RSS aggregation system (Blastfeed) built on top of an XML web-datawarehouse (Xyleme).
Distribution and optimization : RSS syndication is generally implemented in terms of a traditional two- or three-level client/server architecture where RSS feeds are aggregated directly by the user client or indirectly by an intermediate web portal. Whereas this kind of architecture might be sufficient for many use cases, we believe that RSS syndication “naturally” fits into a completely distributed architecture connecting clients, feed producers and feed aggregators. Distribution brings many well-known advantages (ressource sharing and load balancing, high availability through replication, …) to RSS syndication applications, if it is combined with efficient data replication and query evaluation strategies. One challenge in this project will be to study various optimization problems related to the distributed storage and aggregation of RSS feeds. The proposed techniques will be based on existing data replication, load balancing and query optimization techniques for distributed XML data. In particular, we will study these problems in the context of a P2P architecture, where each peer might play the role of a client, feed producer and feed aggregator.
Dynamic feed aggregation and generation : Feed aggregation consists in choosing and merging RSS feeds. This process might be guided according to some specific user interests (profile), local data and ranking score (credibility, relevance, importance). Based on a well-defined model and query language for simultaneously querying XML data and RSS feeds, it is possible to define new powerful RSS feed aggregation and data management services. In particular, it is possible to define dynamic personalized RSS/XML views filtering and composing existing feeds and external data. For example, personalized feed aggregation might consist in joining incoming RSS feeds with user profiles stored in form of simple XML documents. The same mechanism can be used to generate new enriched feeds from existing feeds and external data. Feed-specific relevance and popularity scores allow to rank feeds and their items and to apply top-k query processing algorithms.

Summary of key issues

This project addresses the following technical and scientific key issues combining XML query processing, distribution and information flows :

XML-RSS data model and algebra
Declarative (query-based) RSS feed aggregation
Distributed XML query processing
Data replication and load balancing
XML-RSS syndication views
RSS feed and item ranking models and algorithms

Economical benefits and issues

Similar to search engines which already play an important role in the modern information society, web syndication gains more and more importance at the economic level. One explication for the success of web content syndication is the observation that a big amount of information published on the web is a time-stamped, uniquely identified chunk of data with meta-data (news stories, uploaded photos, events, podcasts, wiki changes, source code changes, bug report). The possibility to create, observe and aggregate well-defined information channels on the web allows to reduce the distance (cost, time, effort) between information producers and information consumers at the web-scale :

Media companies (TV, radio, press) use web syndication for publishing their contents to their clients who can build their personalized media space by choosing and aggregating topic specific feeds according to their interests.
Electronic commerce applications use web syndication for linking products to potential clients who “actively” choose to be informed about the evolution of existing and the appearance of new products in the catalogue ²⁾.
Electronic auction systems like Ebay allow clients to be informed about the bidding process concerning objects they are intereste in
More generally, RSS syndication can be used for observing, filtering and agregating “external” web information according to some specific economic domain (veille technologique, economique)

The ROSES project aims at developing a flexible and efficient web syndication model for building this kind of applications. Flexibility and efficiency is achieved by a high-level syndication model based on declarative languages and distributed XML data management technology .

Contribution with respect to the ANR call for projects

This proposal answers to several priorities and objectives mentioned in the MDCO programm of the ANR call for projects. The main objective is to develop a web information management infrastructure combining distributed XML data management and RSS web ressource syndication. The project takes into account several important dimensions of web information :

Distribution and volume : The increasing number of web sites and web ressources

RSS syndication is used for reducing the “publication lag” of web ressources RSS feeds are XML documents distributed all over the web and the number of RSS feeds is growing every day.

Flexibility :

The main research topics concerning this project are mentioned in “Axe 2 : Algorithmes pour le traitement massif de données ” (page 8) :

XML, blogs, fils de discussion
données distribuées (web, P2P)
traitement et optimisation de requêtes distribuées
flux de données (prise en compte de l'évolution)
les échelles temporelles

Expected results and contributions

The main expected results and contributions are :

a general web syndication infrastructure for describing different centralized and distributed syndication scenarios
a formal XML-based RSS feed model and algebra
a programming API and environment providing basic feed management services
distributed query evaluation, data replication and load balancing algorithms for RSS feed data
high-level view language for dynamic RSS feed generation, enrichment and personalisation
new efficient filtering and ranking techniques for RSS feed data

Summary

The following figure summarizes the technical and scientific contribution of the proposed project and its integration with existing technology. The right part of the figure (gray) shows the most simple way of RSS-based web ressource syndication. A feed is an evolving XML document downloaded by a specialized user interface (RSS reader). On the rest of the figure shows the architectures we will study in our proposal.

¹⁾

Formally speaking, RSS feeds follow the RDF model for semantic web graphs, but the XML representation of “RSS graphs” naturally can be queried by any XML query language (no complex semantic path expressions).

²⁾

A “selling” argument for RSS feeds is that it can be considered as a controlled unobtrusive alternative for “electronic advertisement” : instead of filling up the mailbox of many uninterested users (who very rapidly get upset by the received spam), it follows the publish/subscribe principle, where users actively express their interest in some information from some well-defined source without being “drowned” by useless messages.

Bases de Données / Databases

Outils pour utilisateurs

Outils du site

Panneau latéral