Bases de Données / Databases

Site Web de l'équipe BD du LIP6 / LIP6 DB Web Site

Outils pour utilisateurs

Outils du site


Panneau latéral

roses:sujet_these

Sujet de thèse

Encadrants : Bernd AMANN et Dan VODISLAV

Ce sujet de thèse est proposé dans le cadre du Programme Pluriformation (PPF) Wisdom et sera encadré par Bernd AMANN (professeur LIP6) et Dan Vodislav (maître de conférences CNAM). Le financement demandé permettrait de financer 3 ans de thèse avec un budget total de XXX kE.

Sujet : Modèles et applications de syndication de données sur le web

PhD subject

Supervisors: Bernd AMANN and Dan VODISLAV

This PhD thesis is proposed in the context of the Wisdom PPF and will be supervised by Bernd AMANN (Professor at LIP6) and Dan VODISLAV (Assistant Professor at CNAM). The demanded funding would allow financing 3 years of PhD, for a total budget of 90 kE [check].

The subject mainly addresses the topics of workpackages 2 and 5, i.e. the problem of modeling RSS feeds as an extension of XML data with temporal, dynamic features and the problem of creating tools and applications based on this model.

Dan: participation to other WP

Subject: Models and applications for data syndication on the web

Context

In order to reduce the time interval necessary for an information published on a web site to reach the interested users, more and more web sites apply web syndication techniques for publishing their contents. These techniques consist in publishing new information in form of web feeds or blogs to interested users who actively subscribe to these blogs. They reduce the publication lag of web information and allow users to create their personal information space observing the evolution of well-defined information sources.

Whereas web content syndication can be considered as a new efficient way of sharing information on the web, it also suffers from well-known problems related to the large scale of the web. The number of web feeds and blogs is constantly growing which creates new issues in feed management and feed aggregation. Specialized web syndication portals like Blastfeed.com, Plazoo.com and Technorati.com try to solve some of these problems by collecting and aggregating web feed data. One goal of these portals is to index feed data (similar to search engines for standard web ressources) based on efficient refresh algorithms to reduce the publication lag mentioned before.

We propose to handle the new issues in RSS syndication by considering web content syndication as a large-scale distributed XML data management problem:

  • All RSS formats use XML as a publishing syntax, therefore existing XML data management technologies (XML datawarehouses, XML query languages (Xquery/XPath)) can be adapted for defining and implementing advanced syndication services (publish, filter, aggregate).
  • XML data integration techniques can be used to define new advanced aggregation services, beyond the currently proposed services, that are still very limited and essentially consist in key-word based filtering, concatenating and time-stamp based reordering of several feeds.

Goals and roadmap

The first goal is the define a formal XML-RSS data model and algebra combining the semantics of RSS, XML and RDF. In particular, the model should be able to represent

  • hierarchically structured XML contents (labeled ordered trees)
  • graph structured (RDF) metadata
  • data streams including temporal properties and relationsips
  • annotation links between RSS/RDF metadata and the annotated resources.

The starting point will be existing work on XML [ZPR02,JLS+01,FFM+00] and RDF algebras [FHVB02] and languages [KAC+02] for defining a new algebra taking also into account temporal properties and relationships of RSS metadata streams. This algebra will be the basis for the definition of an XML-RSS query language, as an extension of XQuery.

The second goal is the definition of a framework for creating applications based on the XML-RSS data model. For instance, we consider applications that produce new RSS feeds and XML data , by transforming input RSS feeds and XML data. In this content we consider two main issues:

  • The definition of a declarative, high-level model for defining an application as a view over its inputs (feeds and data). This model will be inspired from declarative models for both XML view definition ([VCC+06, PPV05]) and dynamic data ([AAC+99, MAA+05]), in the specific context of RSS feeds. A set of algorithms will be defined, that allow translating a declarative view specification into an XML-RSS algebraic plan or query.
  • The implementation of this application building framework, based on the above models and algorithms.

References

[AAC+99] S. Abiteboul, B. Amann, S. Cluet, A. Eyal, L. Mignet, and T. Milo. Active views for electronic commerce. VLDB 1999.

[FFM+00] P. Fankhauser, M. Fernández, A. Malhotra, M. Rys, J. Siméon P. Wadler, The XML Query Algebra, W3C Working Draft 04 December 2000

[FHVB02] F. Frasincar, G. Houben, R. Vdovjak, P. Barna, RAL: an Algebra for Querying RDF, WISE’02

[JLS+01] H. V. Jagadish, Laks V. S. Lakshmanan, Divesh Srivastava, et al., TAX: A Tree Algebra for XML, DBLP’01

[KAC+02] Gregory Karvounarakis, Sofia Alexaki, Vassilis Christophides, Dimitris Plexousakis, Michel Scholl: RQL: a declarative query language for RDF. WWW 2002: 592-603

[MAA+05] T. Milo, S. Abiteboul, B. Amann, O. Benjelloun, and F. Dang Ngoc. Exchanging intensional xml data. In SIGMOD, 2003. an extended version of this article has been published in ACM Transactions on Database Systems 30(1).

[PPV05] M. Petropoulos , Y. Papakonstantinou , V. Vassalos, Graphical query interfaces for semistructured data: the QURSED system, ACM Transactions on Internet Technology (TOIT), v.5 n.2, p.390-438, May 2005

[VCC+06] D. Vodislav, S. Cluet, G. Corona et I. Sebei. Views for simplifying access to heterogeneous XML data. In CoopIS, pp. 72-90, Springer, 2006.

[Yah07] Yahoo! pipes, http://pipes.yahoo.com

[ZPR02] X. Zhang, B. Pielech, E.A. Rundesnteiner, Honey, I shrunk the XQuery!: an XML algebra optimization approach, WIDM2002

roses/sujet_these.txt · Dernière modification: 30/03/2015 15:31 (modification externe)