Bases de Données - Databases

Site Web de l'équipe BD du LIP6

Outils pour utilisateurs

Outils du site


Topic Extraction and Alignment for Large Scientific Document Collections

PhD Proposal at LIP6-UPMC, Paris


This thesis is financed by the EPIQUE ANR project ( and takes place in the Database research team ( of the LIP6 laboratory ( in Paris. The goal is to develop new tools for exploring large scientific document collections (Web of Science, Medline, …) and building interactive topic evolution maps or “phylomemies” [1]) for representing the evolution of science. These tools are based on efficient algorithms and data structures implemented on top of recent big-data infrastructures like Apache Spark.

A topic evolution map represents the evolution of science by a set of topics over a sequence of time periods where topics from different periods can be aligned through specific evolution links. For example, data related research topics have rapidly evolved during the last 25 years period where new research topics have appeared (noSQL, Big Data, MapReduce Data Processing, Data Science, Deep Learning), often by replacing, splitting or combining previous research topics (semi-structured data, parallel DBMS, machine learning, neural networks). Building such maps is a complex task including a variety of data processing steps (see more details in the EPIQUE project description).

This thesis is mainly deals with two steps of the EPIQUE workflow:

  • Topic extraction step: The first step is to extract semantic topic structures from large complex real-world document collections in different application domains (science, social web, news). There already exists a large spectrum of topic extraction models and algorithms based on graph clustering, matrix factorization (LDA) and other techniques. Existing topic models and algorithms do not scale and a first challenge will be to define and adapt scalable data mining solutions based on new data structures and recent parallel data processing frameworks [2,3,4,5].
  • Topic alignment step: The second step consists in exploring the evolution of science by aligning semantic topic structures from different time periods. This alignment is based on a topic evolution model representing different semantic evolution steps (birth, split, join, death, …) for topics from different time periods [1].The goal is to propose a formal topic evolution model based on existing work on scientific evolution and to implement efficient algorithms for the temporal alignment of semantic topic structures generated by step 1.


The first outcome of this thesis will be new innovative tools for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information. The second outcome will be new large-scale data processing solutions for implementing advanced text and graph mining algorithms. Our goal is in particular to provide generic low-level solutions which can be customized independently of the higher-level mining algorithms with respect to specific cost models and hardware constraints (memory, CPU, network).

START DATE : February 1st 2018

DURATION : 36 months

LOCATION : LIP6-UPMC,, Paris, France

SALARY : about 1700 euros (gross per month)



  • must hold a Master's degree in Computer Science
  • have strong analytical programming skills (Java, Scala, Python)
  • high capacity to understand new concepts and to work independently
  • have good expertise in database related topics (distributed databases, query optimisation, big data platforms)
  • excellent written and oral communications skills in English (French is a plus)

Applicants will have to send and attach:

  • an application letter in English or French
  • their CV
  • their university/grade transcripts of the last two years
  • a copy of their last diploma
  • recommendation letters (optional) to all contacts below

DEADLINE : December 15th 2017


  • Bernd Amann:
  • Hubert Naacke:


The LIP6 Laboratory of Computer Sciences, Paris ) with a staff of 470 people including 170 permanent researchers, 250 PhD students, Postdocs, engineers and administrative employees is today one of the most important centers of Computer Science in France. LIP6 is part of the Université Pierre et Marie Curie and as a department of CNRS (UMR 7606), it is also linked to the INS2I (Institut des sciences de l'information et de leurs interactions). The LIP6 laboratory is composed of 20 research teams structured into 7 departments which cover a wide spectrum of computer science domains: scientific computing, decision making, optimization problems in artificial intelligence and operational research, databases and machine learning, networks and systems, systems on chips, complex systems.


[1] Chavalarias D, Cointet J-P (2013) Phylomemetic Patterns in Science Evolution-The Rise and Fall of Scientific Fields. PLoS ONE 8(2): e54847. doi:10.1371/journal.pone.0054847

[2] A.. Rajaraman, J. Leskovec, J. D. Ullman. Mining of Massive Datasets 2013,

[3] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis, Tutorial (VLDB12, SWCW13)

[4] S. Yang et. al. Efficient Dense Structure Mining using MapReduce, IEEE International Conference on DataMining Workshops, 2009

[5] S. Papadimitriou and J. Sun. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. In IEEE Intl Conf.on Data Mining (ICDM), 2008

[6] R. M. C. McCreadie, C. Macdonald, and L. Ounis. On single-pass indexing with MapReduce. In ACM Conf. on Research and development in information retrieval (SIGIR), 2009

[7] D. Fried, S. G. Kobourov: Maps of Computer Science. PacificVis 2014: 113-120,, 2014

[8] M. I. Hossain, S. G. Kobourov: Research Topics Map: RTopMap., 2017

[9]J. D. Ullman, Designing good MapReduce algorithms, XRDS: Crossroads, The ACM Magazine for Students 19(1), 2012

site/offres/2018/theses/epique1.txt · Dernière modification: 2017/11/23 15:53 par amann