Bases de Données - Databases

Site Web de l'équipe BD du LIP6

Outils pour utilisateurs

Outils du site


Panneau latéral

roses:resultats

Web 2.0 technologies have transformed the Web from a publishing-only environment into a vibrant information place where yesterday's end users become nowdays content generators themselves. Web syndication formats such as RSS1 or Atom2 emerge as a popular means for timely delivery of frequently updated Web content. Information publishers provide brief summaries of the content they deliver on the Web, called information items, while information consumers subscribe to a number of RSS/Atom feeds (i.e., streams) and get informed about newly published items. Almost every personal weblog, news portal, or discussion forum employ now RSS/Atom feeds for enhancing traditional pull-oriented searching and browsing of web pages with push-oriented protocols of web content. Note also, that social media applications such as Twitter and Facebook rely on RSS to notify users about the newly available posts of their preferred friends (or followees). Unfortunately, preliminary works on RSS/Atom statistical characteristics do not provide a precise and up-to-dated characterization of feeds' behavior and content which could be effectively used for tuning refreshing policies of RSS aggregators, benchmarking scalability and performance of RSS continuous monitoring and filtering mechanisms or evaluating various RSS item mining, recommendation, enrichment and archiving techniques. We have extracted statistics and presented the first large-scale analysis of three complementary RSS/Atom parameters: (a) feeds' publishing activity; (b) items' structure and length; © the vocabularies employed by their textual content. Our empirical study relies on a testbed acquired over several monthes (always growing on the web-site), but originaly over a 8 month period of 10,794,285 items belonging to 8,155 productive feeds (out of the 12,611 harvested ones) and it is made available on line. The main conclusions drawn from our experiments are:

  1. Few RSS/Atom feeds (17%) produce almost the total number of items (97%) in our testbed. In their majority, productive feeds (i.e. with >1 item per hours) exhibit a regular behavior withoutpublishing bursts. Micro-blogging feeds originating from social media as expected are more productive than those from personal blogs while press sources lies in between. No major variation in feeds activity have been observed during the 8 month period of our study. The aggregated publishing rate among all feeds of our testbed has been measured to be 3.59 item per day.
  2. The most popular RSS/Atom structural elements are title and description while the average item length is 52 terms. Clearly, RSS/Atom items are greater than average advertisement bids (4-5 terms) or tweets (15 terms at most) but smaller than the size of the original blog posts (200-250 terms) or Web pages (450-500 terms excluding tags). In addition, we haven't observed any significant re-publishing of items across different feeds. Only a 0.41% of duplicated items have been identified in feeds hosted in different sites.
  3. Unlike previous studies,we observed that the language employed by RSS/Atom textual elements is subject to many kinds of imperfections and errors, such as the use of special-purpose terminology, person and place names, URLs and email addresses as well as typos and mistakes. We have measured a total number of 1,537,730 terms out of which only a small fraction (around 4%) is found in the WordNet dictionary. In this respect, we provide a formal characterization of the vocabulary growth using Heap laws as well as of terms' occurrences using a stretched exponential distribution reported to the first time in literature. We have finally studied temporal variation of terms rakings in the vocabulary. Surprisingly enough, the average displacement per term rank follows a Gaussien distribution: the rank of most and less frequent terms remains almost stable during the measured period.

Here is an extraction of this information: On the statistics website, you will find all information that brought those conclusions. It is decomposed in acquired data, acquiring database schema, generation scripts.

Here is a Figure that describes feed the variation (called Gini coefficient) of the publication rate over time which could help to predict a behaviour for acquisition of multiple feeds:

Each plot corresponds to a feed (publication rate vs Gini coefficient). We can distinguish three kind of behaviour Slow (less than 1 item/day), moderate (between 1 and 10 items per day), productive (more than 10 items per day). Which corresponds to different type of variation of productivity.

roses/resultats.txt · Dernière modification: 2015/03/30 15:31 (modification externe)