The ROSES project aims at defining a set of web ressource syndication services and tools for
localizing, integrating, querying and composing RSS feeds distributed on the Web. We distinguish between two kinds of services :
Feed Management Services :
catalogue
acquisition, storage and refresh
basic filtering (keyword) and notification
Feed Composition Services :
advanced filtering and querying (temporal, structured, multi-channel)
feed aggregation and data integration views
ranking and top-k queries
Whereas RSS documents can be considered as a special kind of XML-RDF document that can be queried by any
existing XML (XQuery, XPath) or RDF (Sparql) query language1), the combination of RSS syndication, XML query processing and distributed data and query processing creates new technical and scientific challenges that we intend to tackle in this project :
RSS data model and algebra :
RSS feeds are ordered sequences of
XML documents encoding a flow of time-stamped messages called items. Each item generally (but not necessarily) annotates an “external” web ressource identified by a
URL. From this point of view, aggregating
RSS feeds corresponds to querying (virtually infinite) sequences of items temporarily available in specific
XML documents identified by a feed address. An important objective in this project will be the definition of a
formal RSS data model and algebra with the precise semantics in terms of operations on time-stamped
XML document sequences.
RSS feed management : We will study the definition and implementation of basic
RSS feed management services (store,refresh,filter,notify) based on the
RSS data model and existing technology for storing and querying (XPath/Xquery)
XML documents. In particular we intend to evaluate and extend an existing
RSS aggregation system (Blastfeed) built on top of an
XML web-datawarehouse (Xyleme).
Distribution and optimization :
RSS syndication is generally implemented in terms of a traditional two- or three-level client/server architecture where
RSS feeds are aggregated directly by the user client or indirectly by an intermediate web portal. Whereas this kind of architecture might be sufficient for many use cases, we believe that
RSS syndication “naturally” fits into a completely distributed architecture connecting clients, feed producers and feed aggregators. Distribution brings many well-known advantages (ressource sharing and load balancing, high availability through replication, …) to
RSS syndication applications, if it is combined with efficient data replication and query evaluation strategies. One challenge in this project will be to study various optimization problems related to the distributed storage and aggregation of
RSS feeds. The proposed techniques will be based on existing data replication, load balancing and query optimization techniques for distributed
XML data. In particular, we will study these problems in the context of a
P2P architecture, where each peer might play the role of a client, feed producer and feed aggregator.
Dynamic feed aggregation and generation : Feed aggregation consists in choosing and
merging RSS feeds. This process might be guided according to some specific user interests (profile), local data and ranking score (credibility, relevance, importance). Based on a well-defined model and query language for simultaneously querying
XML data and
RSS feeds, it is possible to define new powerful
RSS feed aggregation and data management services. In particular, it is possible to define dynamic personalized
RSS/XML views filtering and composing existing feeds and external data. For example, personalized feed aggregation might consist in joining incoming
RSS feeds with user profiles stored in form of simple
XML documents. The same mechanism can be used to generate new enriched feeds from existing feeds and external data. Feed-specific relevance and popularity scores allow to rank feeds and their items and to apply top-k query processing algorithms.
This project addresses the following technical and scientific
key issues combining XML query processing, distribution and information flows :
XML-
RSS data model and algebra
Declarative (query-based)
RSS feed aggregation
Distributed
XML query processing
Data replication and load balancing
XML-
RSS syndication views
RSS feed and item ranking models and algorithms
Similar to search engines which already play an important role in the modern information society,
web syndication gains more and more importance at the economic level. One explication for the success of web content syndication is the observation that a big amount of information published on the web is a time-stamped, uniquely
identified chunk of data with meta-data (news stories, uploaded photos, events, podcasts, wiki changes, source code changes, bug report). The possibility to create,
observe and aggregate well-defined information channels on the web allows to reduce the distance (cost, time, effort) between
information producers and information consumers at the web-scale :
Media companies (TV, radio, press) use web syndication for publishing their contents to their clients who can build their personalized media space by choosing and aggregating topic specific feeds according to their interests.
Electronic commerce applications use web syndication for linking products to potential clients who “actively” choose to be informed about the evolution of existing and the appearance of new products in the catalogue
2).
Electronic auction systems like Ebay allow clients to be informed about the bidding process concerning objects they are intereste in
More generally,
RSS syndication can be used for observing, filtering and agregating “external” web information according to some specific economic domain (veille technologique, economique)
The ROSES project aims at developing a flexible and efficient web syndication model for building this kind of applications.
Flexibility and efficiency is achieved by a high-level syndication model based on declarative languages and distributed
XML data management technology .
This proposal answers to several priorities and objectives mentioned in the MDCO programm of the ANR call for projects.
The main objective is to develop a web information management infrastructure combining distributed XML data management and
RSS web ressource syndication. The project takes into account several important dimensions of web information :
Distribution and volume : The increasing number of web sites and web ressources
RSS syndication is used for reducing the “publication lag” of web ressources RSS feeds are XML documents distributed all over the web and the number of RSS feeds is growing every day.
Flexibility :
The main research topics concerning this project are mentioned in “Axe 2 : Algorithmes pour le traitement massif de données ” (page 8) :
XML, blogs, fils de discussion
données distribuées (web,
P2P)
traitement et optimisation de requêtes distribuées
flux de données (prise en compte de l'évolution)
les échelles temporelles
The main expected results and contributions are :
a general web syndication infrastructure for describing different centralized and distributed syndication scenarios
a formal
XML-based
RSS feed model and algebra
a programming
API and environment providing basic feed management services
distributed query evaluation, data replication and load balancing algorithms for
RSS feed data
high-level view language for dynamic
RSS feed generation, enrichment and personalisation
new efficient filtering and ranking techniques for
RSS feed data
The following figure summarizes the technical and scientific contribution of the proposed project and its integration with existing technology. The right part of the figure (gray) shows the most simple way of RSS-based web ressource syndication. A feed is an evolving XML document downloaded by a specialized user interface (RSS reader). On the rest of the figure shows the architectures we will study in our proposal.