Tutorials

Title: Blocking Techniques for Web-Scale Entity Resolution

George Papadakis

Institute for the Management of Information Systems - Athena Research Center, gpapadis@imis.athena-innovation.gr

Themis Palpanas

Paris Descartes University, themis@mi.parisdescartes.fr

Abstract

Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a bulk of research has focused on improving its efficiency so that it can be applied to Web Data collections, which are inherently voluminous and highly heterogeneous. The most common approach for this purpose is blocking, which clusters similar entities into blocks so that the pair-wise comparisons are restricted to the entities contained within each block.

In this tutorial, we elaborate on blocking techniques, starting from the early, schema-based ones that were crafted for database integration. We highlight the challenges posed by today's heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable the early blocking methods. We continue with the presentation of the latest blocking methods that are crafted for Web-scale data. We also explain how their efficiency can be improved by meta-blocking and parallelization techniques.

We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques, and enables the participants of the tutorial to put in practice all the topics discussed in the theory.

Bio

George Papadakis is a Postdoctoral Researcher at the Institute for the Management of Information Systems, Athena Research Center. Before that he worked as researcher at the NCSR "Demokritos", the L3S Research Center and the Institute of Communications and Computer Systems. He holds a Diploma in Computer Engineering from the National Technical University of Athens and a PhD from the Leibniz University of Hanover on "Blocking Techniques for efficient Entity Resolution over large, highly heterogeneous Information Spaces". In addition to entity resolution, his research focuses on web data mining and has received the best paper award from ACM Hypertext 2011.

Themis Palpanas is a professor of computer science at the Paris Descartes University, France. Before that he was a professor at the University of Trento, Italy, and he has worked as a researcher at the IBM T.J. Watson Research Center and the University of California at Riverside, and has been a visiting researcher at Microsoft Research, IBM Almaden Research Center, and the National University of Singapore. He is the author of eight US patents, three of which are part of commercial products. He has received three best paper awards and was General Chair for VLDB 2013. Professor Palpanas has been working on the field of Entity Resolution for the last 5 years, publishing relevant methods to major journals and conferences.



Title: Community Detection and Evaluation in Social and Information Networks

Christos Giatsidis

Ecole Polytechnique, France, http://www.lix.polytechnique.fr/~giatsidis

Fragkiskos D. Malliaros

Ecole Polytechnique, France, http://www.lix.polytechnique.fr/~fmalliaros

Michalis Vazirgiannis

Ecole Polytechnique, France, http://www.lix.polytechnique.fr/~mvazirg and Athens University of Economics and Business, Greece

Abstract

Graphs (or networks) constitute a dominant data structure and appear essentially in all forms of information (e.g., social and information networks, technological networks and networks from the areas of biology and neuroscience). A cornerstone issue in the analysis of such graphs is the detection and evaluation of communities (or clusters) - bearing multiple and diverse semantics. Typically, the communities correspond to groups of nodes that tend to be highly similar sharing common features, while nodes of different communities show low similarity. Detecting and evaluating the community structure of real-world graphs constitutes an essential task in several areas, with many important applications. For example, communities in a social network (e.g., Facebook, Twitter) correspond to individuals with increased social ties (e.g., friendship relationships, common interests). The goal of this tutorial is to present community detection and evaluation techniques as mining tools for real graphs. We present a thorough review of graph clustering and community detection methods, demonstrating their basic methodological principles. Special mention is made to the degeneracy (k-cores and extensions) approach for community evaluation, presenting also several case studies on real-world networks.

Bio

Christos Giatsidis is currently a Post-doctoral researcher in the Computer Science Laboratory at Ecole Polytechnique in France. He received his Diploma in Computer Science from the Athens Univ. of Economics & Business, Greece in 2009 and his PhD from Ecole Polytechnique, under the supervision of Prof. Michalis Vazirgiannis. In 2014 he received a "thesis prize" for his thesis entitled "Graph Mining and Community Detection with Degeneracy". He has experience in both the research and industrial domain. He has published seven referred articles in international journals and conferences in the areas of data/web mining and social network analysis. His research interests include data/graph mining and algorithms for big data management.

Fragkiskos D. Malliaros is a Ph.D. candidate at Ecole Polytechnique in France, working under the supervision of Prof. Michalis Vazirgiannis. He received his Diploma and his M.Sc. degree from the University of Patras, Greece in 2009 and 2011 respectively. He is the recipient of the 2012 Google European Doctoral Fellowship in Graph Mining. During the summer of 2014, he will be a research intern at the Palo Alto Research Center (PARC), working on anomaly detection in social networks. He has also published six referred articles in international journals and conferences. His research interests span the broad areas of data mining, algorithmic data analysis and data management, with focus on mining and analysis of large, time-evolving graphs.

Michalis Vazirgiannis is a Professor in Ecole Polytechnique, France and in AUEB, Greece and the leader of the Data Science and Mining (DaSciM) team. He has worked as a researcher in the different places: in NTUA, in GMD-IPSI (currently Frauhofer - IPSI), Germany, in Fern-Universitaet Hagen, in project VERSO (later GEMO) in INRIA/Paris, in IBM India Research Laboratory and in Max Planck Institut fur Informatik (Saarbruecken, Germany). He held a Marie Curie Intra-European fellow (2006-2007) in area of "P2P Web Search", hosted by INRIA FUTURS in Orsay, Paris. He is currently working in the area of Data Science for Big Data - aiming at harnessing the potential of machine learning algorithms for large scale data sets including text and graphs. More specifically his current work is on graph degeneracy for large scale graph mining, graph based text retrieval, learning models from time series data, and text mining for the web (i.e., advertising, news streams).



Title: Extensions on Map-Reduce

Himanshu Gupta

IBM Research, India

L Venkata Subramaniam

IBM Research, India

Sriram Raghavan

IBM Research, India

Abstract

Map-Reduce has emerged as a popular framework for building distributed and large-scale analytics applications. It is mainly due to various salient features the framework provides like scalability, fault-tolerance, ease of programming etc. However despite its merits and success, the map-reduce framework has performance limitations for miscellaneous analytical tasks. This tutorial will present an overview of various systems and algorithms which have extended the map-reduce framework to address these limitations and improve its performance. The tutorial will run in four parts. The tutorial will start with an introduction of the map-reduce framework along-with its strengths and limitations. The first part will look at systems which focus on processing relational data and on providing indexing support on map-reduce. The second part will discuss the systems providing support for incremental, iterative and recurring queries. The third part will present an overview of systems which improve the performance of map-reduce framework in a variety of ways like skew-management, data-placement, reusing the results of a computation etc. The fourth part will finally look at various initiatives within IBM to improve the capabilities of its big-data product, IBM BigInsights.

Bio

Himanshu Gupta is currently working as a technical staff member at IBM Research, India. His research interests include information integration, big-data, hadoop and map-reduce processing, data management etc. He has an externsive experience in Hadoop and map-reduce processing, has published multiple papers in premier databse conferences in this space and has contributed to various IBM's Big-data projects and initiatives. He holds a BTech and MS in Computer Science from Indian Institute of Technology Kanpur and Indian Institute of Technology, Delhi respectively.


L Venkata Subramaniam is a Senior Technical Staff Member (STSM) and manages the Data-Fusion & Big-Data Solutions group at IBM Research India. He received his PhD from Indian Institute of Technology, Delhi in 1999. His research focuses on unstructured information management, statistical natural language processing, noisy text analytics, text and data mining, information theory, speech and image processing. He often teaches and guides student thesis at IIT Delhi on these topics. He co founded the AND (Analytics for Noisy Unstructured Text Data) workshop series and also co-chaired the first four workshops, 2007-2010. He was guest co-editor of two special issues on Noisy Text Analytics in the International Journal of Document Analysis and Recognition in 2007 and 2009.

Sriram Raghavan is a Senior Technical Staff Member (STSM) and Senior Manager of the Information & Analytics Department at IBM Research - India. In his current role, he leads a team of researchers building the next generation of IBM's platforms for big data and cognitive applications. His team focus on research directions that are the intersection of data management, text analytics/NLP, machine learning, and distributed systems. Prior to joining IBM Research - India in 2010, Sriram spent eight years at IBM's Almaden Research Center as a Research Staff Member and Manager of the Search and Analytics research group. Sriram is an alumnus of the Indian Institute of Technology, Madras and Stanford University.



Title: Similarity Search: Navigating the choices for Similarity Operators

Deepak P

IBM Research, India, deepak.s.p@in.ibm.com

Prasad M. Deshpande

IBM Research, India, prasdesh@in.ibm.com

Abstract

With the growing variety of entities that have their presence on the web, increasingly sophisticated data representation and indexing mechanisms to retrieve relevant entities to a query are being devised. Though relatively less discussed, another dimension in retrieval that has recorded tremendous progress over the years has been the development of mechanisms to enhance expressivity in specifying information needs; this has been affected by the advancements in research on similarity operators. In this tutorial, we focus on the vocabulary of similarity operators that has grown from just a set of two operators, top-k and skyline search, as it stood in the early 2000s. Today, there are efficient algorithms to process complicated needs such as finding the top-k customers for a product wherein the customers are to be sorted based on the rank of the chosen product in their preference list. Arguably due to the complexity in the specification of new operators such as the above, uptake of such similarity operators has been low even though emergence of complex entities such as social media profiles warrant significant expansion in query expressivity. In this tutorial, we systematically survey the set of similarity operators and mechanisms to process them effectively. We believe that the importance of similarity search operators is immense in an era of when the web is populated with increasingly complex objects spanning the entire spectrum, though mostly pronounced in the social and e-commerce web.

Bio

Deepak P is a researcher in the Information Management Group at IBM Research - India, Bangalore. He obtained his B.Tech degree from Cochin University, India followed by M.Tech and PhD degrees from IIT Madras, India, all in Computer Science. His current research interests include Similarity Search, Spatiotemporal Data Analytics, Graph Mining, Information Retrieval and Machine Learning. He has authored over 20 papers in reputed conferences and has filed several patent applications with the US PTO including two issued patents. He has been working in the area of similarity search since 2008; he co-chaired the 2011 EDBT Workshop on New Trends in Similarity Search. He is a senior member of the ACM and IEEE.



Prasad M Deshpande is a Senior Technical Staff Member at IBM Research - India and Manager of the Watson Foundations - Platforms and Infrastructure group. His areas of expertise lie in data management, specifically data integration, OLAP, data mining and text analytics. He received a B. Tech in Computer Science and Engineering from IIT, Bombay and a M.S. and Ph.D. in Database systems from the University of Wisconsin, Madison. He is an ACM Distinguished Scientist and member of the IBM Academy of Technology. His current focus is in the areas of data discovery and curation for big data platforms, data integration and machine data analytics. He has worked at several companies, including IBM Almaden Research Center prior to joining IBM Research - India in 2005. He has more than 40 publications in reputed conferences and journals and 11 patents issued. He has served on the Program Committee of many conferences and has been the Industry Chair for COMAD 2009 and COMAD 2013, PC Co-Chair for COMAD 2011, ACM Compute 2010 and the 2011 EDBT Workshop on New Trends in Similarity Search.