The web produces continuous streams of text items published as RSS news, tweets, blog messages etc. Users can subscribe to these streams be defining queries which continuously filter and rank the most recent information items. A major challenge is then to efficiently process millions of such subscription queries over high rate input streams. In the context of the ROSES ANR project ROSES (2008-2012) on RSS feed aggregation and filtering we worked in on multi-query optimisation (PhD J. Creus), on efficient refresh strategies for dynamic RSS feeds (PhD of R. Horincar in collaboration with the MLIA team), and on continuous top-k query processing (PhD of N. Vouzoukidou in collaboration with ICS-Forth, Crete).
The Web is continuously changing and building representative archives for the future generates many challenging data processing problems including efficient harvesting, storage and indexing of Web resources. Our work on these challenges has started within the ANR Cartec project (before 2012) and pursued in the context of the European research project SCAPE (2011-2014) and an ongoing collaboration with the Central University of Venezuela (UCV), Caracas. The originality of our contributions is based on a Web page segmentation and ranking model which identifies and ranks semantically coherent blocks within a web page. Based on this model, highly-ranked blocks can be considered as more important by a page refresh strategy or more relevant within an information retrieval task. Three PhD thesis took place in this context on (1) semantic web page refresh strategies for web archives, (2) Web archive querying and text indexing (pruning), and (3) Web page segmentation and migration to HTML5 format (collaboration with UCV). A collaboration with the MLIA team (PhD thesis of M. Law) allowed us to explore advanced machine learning and image analysis techniques to improve the accuracy of our page change estimation model.
Distributed transactions in large data clusters generate a high control and synchronization overhead which is a major obstacle for achieving scalability. To reduce this overhead, we focus on user-centric applications where (1) the data fragment attached to each user defines the basic access unit, (2) transactions mostly access the data of two users (message exchange) and (3) the access frequency (popularity) is biased and fluctuates over time. To achieve optimal performance, we propose to move user data to a single node where the transaction can be executed locally. Then, under the assumption that users interact within social circles, we detect data groups (or bundles) and adapt data placement and cluster resources gradually according to their interactions. We have studied a related problem in the context of the GBIF (Global Biodiversity Information Facility) community and designed advanced querying facilities for world-wide federations of autonomous databases storing descriptions of observed natural species. To overcome the limited query capacities of GBIF database interfaces, we proposed a distributed middle-ware providing a higher-level structured access to the GBIF data. Our main contributions are a decentralized algorithm for parallel query processing and a cost-based placement and replication strategy keeping node usage (storage and query processing) below a given upper bound. This work is part of two PhD thesis (I. Gueye and N. Bame) in collaboration with the University Cheikh Anta Diop (UCAD) in Senegal.
RDF has become a standard format for publishing and integrating data and knowledge on the web. The resulting Linked-Open-Data cloud is conti- nuously growing and implementing efficient SPARQL query processors is a challenging problem. To achieve scalability, we propose to exploit MapReduce data parallelism implemented in the Apache Spark platform. For this, we defined a cost-model which allows to choose among two standard distributed join operators (partitioned and broadcast join) implemented on top of the different Spark data layers. Our experimental results show that hybrid join plans combining these two operators allow to improve the performance significantly.
Recommendation methods aim to predict, for a given user, the rating or preference she would give to an “item” which can be an action, a document, a product or even another user etc. These predictions can be derived from different kinds of input data like the past activities, social links and preferences of the user and of other “similar” users (collaborative filtering) and/or the explicit description of the items and the preferences of the users (content- based filtering).
We consider the problem of discovering valuable content publishers in micro-blogging systems by providing efficient, topological and contextual user recom- mendations. Building on the idea that topology-based measures are good indicators for estimating user simi- larity, we propose personalized user recommendation scores which capture both the topological proximity and connectivity of publishers along with their topic authorities. The size of the underlying social graphs strongly influences score computation costs especially with graph exploration operators. In order to speed up the re- commendation process we propose approximate algorithms based on landmarks that are selected according to several strategies. Within this context, we also develop a novel block-based, workload-aware edge partitioning strategy for distributed user similarity computation. Our method relies on a block approach for capturing local communities and takes advantage of the topological real-world large graph properties to provide a balanced edge distribution and reduce the communication costs for random walk-based computations. This work was done in collaboration with the Cédric-CNAM laboratory (PhD of R. Dahimene, Q. Grossetti and Y. Li) and was partially supported by the PEPS INS2I/INSMI 2015 FaSciDo “FAIT” (Finding Account of Interest on Twitter).
In this work, we consider matrix factorization algorithms in highly dynamic contexts with continuous rating streams where it is difficult to have an up-to-date recommendation model even with powerful parallelization techniques. Our solution reduces the lag-based re- commendation quality loss by introducing biases tracking the user behavior deviation. These biases are conti- nuously updated with the new ratings, in order to maintain the quality of recommendations at a high level for a longer time. We also propose an algorithm that takes into account the popularity of the tags and the opinions of the user’s neighborhood. Unlike the common Nearest-Neighbors approaches, which relies on a fixed number of neighbors, we propose a heuristic network traversal bound that enables on-the-fly recommendation com- putation with limited computation cost. We also propose a new method for improving the accuracy of top-k recommendation algorithms which dynamically estimates and adjusts candidate item lists to optimize the glo- bal recommendation accuracy. Finally, we propose a new Geographical POI (Point of Interest) recommendation method which takes into account the delay and distance between POIs to better capture the set of POIs which might be worth to be visited together than other existing methods based on matrix factorization. This work is done in collaboration with the LTCI-Télécom ParisTech (PhD M. Gueye, PhD J.-B. Griesner).
In the context of the ARESOS CNRS Mastodons project, we have started to explore the usage of MapReduce based algo- rithms and platforms (Hadoop and Spark) for complex text and graph mining workflows. In collaboration with the Institut des Systèmes Complexes and IRISA (Université de Rennes), the goal is to design and implement scalable algorithms for building phylomemetic networks describing the evolution of science in form of tempo- ral topic graphs. Scalability is achieved by combining data parallelism and incremental computation for topic extraction and topic alignment. After some preliminary results, this work is perpetuated in the ANR project EPIQUE which started in January 2017 under our coordination.
The topic machine learning and information access is mainly represented by B. Piwowarski who enriched the team competencies for better understanding data science related problems and applications, and fosters the collaboration with the MLIA team around the theme of learning to project complex objects (e.g. text or graphs) into a continuous latent space for information access tasks : node classification, information retrieval, or word evolution. The quantum probability formalism,whichisawaytorepresentobjectsinacontinuousspace,hasbeenalsoexploitedforsummarization.
In January 2014, we have started a new activity on Smart IoT applications as part of the two year exploratory project EBITA. The goal of EBITA was to explore business and research opportunities between Fraunhofer (Germany) and UPMC concerning semantic data technologies developed in the DAPA department and the IOSB Fraunhofer Institute in Karlsruhe. During the project (01/2015- 12/2016), the team temporarily hosted the project leader (V. Tippmann) and two research engineers. EBITA also allowed us to finance an ongoing PhD on semantic IoT data enrichment and integration (F. Hannou). Taking the data generated by thousands of sensors at the Jussieu Campus as starting point, the goal is to develop new data preparation and integration methods for combining raw sensor data with contex- tual information (room occupation, energy consuming devices, outside temperature, …) to enable higher-level semantic data exploration and analysis.
Data formats like XML or JSON are more and more adopted because of their flexibility to encode semi-structured documents data. Compared to the relational model where data must strictly conform to a predefined prescriptive schema, JSON imposes no a priori schema simply because the structure of the data is unknown in advance or might change later. However, the absence of a schemas also complicates the interaction with the data. First, formulating sound queries without a concise description of the underlying data is more difficult. The lack of a schema prevents query processors from perfor- ming a static analysis to check, for example, if a query uses nonexistent attributes. Finally, the lack of schema prevents from accelerating query processing by using well-known optimization techniques like wildcard ex- pansion and data projection. We introduce an MapReduce based inference algorithm for extracting descriptive schemas from large JSON data-sets. The extracted schemas concisely and precisely capture the structure of the input data set. This work is done in collaboration with researchers from Paris-Dauphine University, University of Pisa and University della Basilicata, Italy.