Web 2.0 technologies have transformed the Web from a publishing-only environment into a vibrant information place where yesterday's end users become nowdays content generators themselves. Web syndication formats such as RSS1 or Atom2 emerge as a popular means for timely delivery of frequently updated Web content. Information publishers provide brief summaries of the content they deliver on the Web, called information items, while information consumers subscribe to a number of RSS/Atom feeds (i.e., streams) and get informed about newly published items. Almost every personal weblog, news portal, or discussion forum employ now RSS/Atom feeds for enhancing traditional pull-oriented searching and browsing of web pages with push-oriented protocols of web content. Note also, that social media applications such as Twitter and Facebook rely on RSS to notify users about the newly available posts of their preferred friends (or followees). Unfortunately, preliminary works on RSS/Atom statistical characteristics do not provide a precise and up-to-dated characterization of feeds' behavior and content which could be effectively used for tuning refreshing policies of RSS aggregators, benchmarking scalability and performance of RSS continuous monitoring and filtering mechanisms or evaluating various RSS item mining, recommendation, enrichment and archiving techniques. We have extracted statistics and presented the first large-scale analysis of three complementary RSS/Atom parameters: (a) feeds' publishing activity; (b) items' structure and length; © the vocabularies employed by their textual content. Our empirical study relies on a testbed acquired over several monthes (always growing on the web-site), but originaly over a 8 month period of 10,794,285 items belonging to 8,155 productive feeds (out of the 12,611 harvested ones) and it is made available on line. The main conclusions drawn from our experiments are:
Here is an extraction of this information: On the statistics website, you will find all information that brought those conclusions. It is decomposed in acquired data, acquiring database schema, generation scripts.
Here is a Figure that describes feed the variation (called Gini coefficient) of the publication rate over time which could help to predict a behaviour for acquisition of multiple feeds:
Each plot corresponds to a feed (publication rate vs Gini coefficient). We can distinguish three kind of behaviour Slow (less than 1 item/day), moderate (between 1 and 10 items per day), productive (more than 10 items per day). Which corresponds to different type of variation of productivity.