Aristotle University of Thessaloniki - Department of Informatics - M.Sc. Program

"Technologies for Big Data Management and Analytics"







Apostolos N. Papadopoulos, Assistant Professor ( papadopo@csd.auth.gr )
Anastasios Gounaris, Assistant Professor ( gounaria@csd.auth.gr )
Data Engineering Lab, Dept of Informatics, Aristotle University of Thessaloniki
Course starts 8 October 2015, 14:00. Room 3, Ethnikis Antistasis

INTRODUCTION
Welcome to Technologies for Big Data Management and Analytics. In this course, you are going to learn fundamental as well as advanced concepts related to the efficient processing and analysis of Big Data. We are going to focus on the state-of-the-art technologies being used as well as on algorithmic concepts. By attending this course you will learn the main set of tools and techniques in order to process and analyze massive amounts of data using state-of-the-art technology. So, open your mind and get ready for this exciting journey!
REQUIREMENTS
To be able to attend the lectures, enrolled students must bring their laptops in class during hands-on lectures. The course is based on open-source software that should be installed in Ubuntu 14.04 LTS. For your convenience, there is a Virtual Machine created for you containing all necessary software. The Virtual Machine will be distributed to you in class. In order to use it you must have VirtualBox installed on your machine. You can find VirtualBox in the following link: https://www.virtualbox.org/wiki/Downloads. Therefore, it doesn't matter what is the system you are working on (Linux, MacOS, Windows). As long as you have VirtualBox installed, you can attend the course without problems. In case you want to install all software by yourself, we can provide some assistance as well.
In addition to this website, you may use the e-learning website hosted by the e-learning platform of Aristotle University. You may self-enroll in the class by following the link https://elearning.auth.gr/course/view.php?id=7951

The software that you should install includes the following (in the parenthesis, we mention the version to be used in the course):
PROJECTS
Students must form groups of two (depending on the number of enrolled students). Each team whould carry out two assignments: i) a theoretical one (much like a survey) where teams should cover a specific topic related to big data processing and analytics and ii) a more practical one, where teams must implement an algorithm using the Spark engine and perform a series of experiments. The projects account for 70% of the final grade. The rest 30% accounts for the result of the final examination (end of semester).
ANNOUNCEMENTS
19/10/2015: The time of the lectures has changed to 14:00 - 16:30 on Thursdays. The room will be Room 3 (lab).

LECTURES AND DATES DESCRIPTION RESOURCES
Lecture 1 - 8 October 2015
Instructor: A. Gounaris
Introduction to big data management and analytics. Motivation. Applications that require big data management. What is big data (volume, velocity, verasity etc). The complete Hadoop ecosystem. Hadoop and MapReduce. Basics of the MapReduce model of computation. Simple applications.Spark. slides,
useful book: Data-Intensive Text Processing with MapReduce by Jimmy Lin και Chris Dyer
Lecture 2 - 15 October 2015
Instructor: A. Gounaris
HANDS-ON. Using hdfs and hadoop. Copy files from/to local file system to/from hdfs. Executing word count using JAVA. Writing MapReduce programs in JAVA. slides, lab resources
Lecture 3 - 22 October 2015
Instructor: A. Gounaris
Introduction to NoSQL. Types of NoSQL databases (key-value stores, column-family stores, document stores, graphs stores). Focus on HBase and Cassandra, architecture, main characteristics and applications. slides, Scalable Datastores by R. Cattell
Lecture 4 - 29 October 2015
Instructor: A. Gounaris
HANDS-ON. Creating and using tables in HBase. Executing joins in HBase. template java file for hands-on

Lecture 5 - 5 November 2015
Instructor: A. Papadopoulos
The Scala programming language. The Spark engine. Fundamental concepts (RDDs, transformations, actions, broadcast variables, aggregate variables, etc.). SQL, MLLib, Streaming and GraphX. Spark and HDFS.
PROJECT ASSIGNMENT PRESENTATION
resources
Lecture 6 - 12 November 2015
Instructor: A. Papadopoulos
HANDS-ON. Spark standalone cluster configuration. Running spark-shell, using spark-submit, application development using IntelliJ IDEA. More on Scala. Implementing an Information Retrieval Application using Spark and Scala i) simple keyword-based search, ii) search using tf-idf. resources
Lecture 7 - 19 November 2015
Instructor: A. Gounaris
Special topics in Apache Spark. Tuning, tips, tricks and traps. Experiences from running jobs in a large custer. "Spark Deployment and Performance Evaluation on the MareNostrum Supercomputer"
"Making Sense of Performance in Data Analytics Frameworks"
Lecture 8 - 26 November 2015
Instructor: A. Papadopoulos
HANDS-ON. Graph processing and analytics. Applications: i) simple graph processing (e.g., degree sequence), ii) triangle counting, iii) more ... resources
Lecture 9 - 3 December 2015
Instructor: A. Papadopoulos
HANDS-ON. Text and Graph Analytics in Python. Using Python in a centralized environment as well as in Spark (with pySpark). resources
Lecture 10 - 10 December 2015
Instructor: 
Special Topic: Distributed Core Decomposition in Spark. resources
Lecture 11 - 17 December 2015
Instructor: ???

resources
XMAS VACATION

Lecture 12 - 21 January 2016 Presentation of students' projects OR Data Camp. resources
Lecture 13 - 28 January 2016 Presentation of students' projects OR Data Camp. resources
EXAMS (date TBA)