Table of Contents

Topic Extraction and Alignment for Large Scientific Document Collections

PhD Proposal at LIP6-UPMC, Paris

DESCRIPTION

This thesis is financed by the EPIQUE ANR project (http://www-bd.lip6.fr/wiki/site/recherche/projets/epique) and takes place in the Database research team (http://www-bd.lip6.fr) of the LIP6 laboratory (http://www.lip6.fr) in Paris. The goal is to develop new tools for exploring large scientific document collections (Web of Science, Medline, …) and building interactive topic evolution maps or “phylomemies” [1]) for representing the evolution of science. These tools are based on efficient algorithms and data structures implemented on top of recent big-data infrastructures like Apache Spark.

A topic evolution map represents the evolution of science by a set of topics over a sequence of time periods where topics from different periods can be aligned through specific evolution links. For example, data related research topics have rapidly evolved during the last 25 years period where new research topics have appeared (noSQL, Big Data, MapReduce Data Processing, Data Science, Deep Learning), often by replacing, splitting or combining previous research topics (semi-structured data, parallel DBMS, machine learning, neural networks). Building such maps is a complex task including a variety of data processing steps (see more details in the EPIQUE project description).

This thesis is mainly deals with two steps of the EPIQUE workflow:

EXPECTED RESULTS

The first outcome of this thesis will be new innovative tools for the reconstruction and exploration of multi-scale dynamics in complete real-world scientific corpora and for obtaining new insights in the evolution of complex human generated knowledge and information. The second outcome will be new large-scale data processing solutions for implementing advanced text and graph mining algorithms. Our goal is in particular to provide generic low-level solutions which can be customized independently of the higher-level mining algorithms with respect to specific cost models and hardware constraints (memory, CPU, network).

START DATE : February 1st 2018

DURATION : 36 months

LOCATION : LIP6-UPMC, http://www.lip6.fr, Paris, France

SALARY : about 1700 euros (gross per month)

APPLICATION PROCEDURE

Applicants:

Applicants will have to send and attach:

DEADLINE : December 15th 2017

CONTACTS

LIP6

The LIP6 Laboratory of Computer Sciences, Paris http://www.lip6.fr ) with a staff of 470 people including 170 permanent researchers, 250 PhD students, Postdocs, engineers and administrative employees is today one of the most important centers of Computer Science in France. LIP6 is part of the Université Pierre et Marie Curie and as a department of CNRS (UMR 7606), it is also linked to the INS2I (Institut des sciences de l'information et de leurs interactions). The LIP6 laboratory is composed of 20 research teams structured into 7 departments which cover a wide spectrum of computer science domains: scientific computing, decision making, optimization problems in artificial intelligence and operational research, databases and machine learning, networks and systems, systems on chips, complex systems.

BIBLIOGRAPHY

[1] Chavalarias D, Cointet J-P (2013) Phylomemetic Patterns in Science Evolution-The Rise and Fall of Scientific Fields. PLoS ONE 8(2): e54847. doi:10.1371/journal.pone.0054847

[2] A.. Rajaraman, J. Leskovec, J. D. Ullman. Mining of Massive Datasets 2013, http://infolab.stanford.edu/~ullman/mmds/book.pdf

[3] Kyuseok Shim. MapReduce Algorithms for Big Data Analysis, Tutorial (VLDB12, SWCW13)

[4] S. Yang et. al. Efficient Dense Structure Mining using MapReduce, IEEE International Conference on DataMining Workshops, 2009

[5] S. Papadimitriou and J. Sun. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining. In IEEE Intl Conf.on Data Mining (ICDM), 2008

[6] R. M. C. McCreadie, C. Macdonald, and L. Ounis. On single-pass indexing with MapReduce. In ACM Conf. on Research and development in information retrieval (SIGIR), 2009

[7] D. Fried, S. G. Kobourov: Maps of Computer Science. PacificVis 2014: 113-120, http://mocs.cs.arizona.edu, 2014

[8] M. I. Hossain, S. G. Kobourov: Research Topics Map: RTopMap. http://rtopmap.arl.arizona.edu, 2017

[9]J. D. Ullman, Designing good MapReduce algorithms, XRDS: Crossroads, The ACM Magazine for Students 19(1), 2012