INTRODUCTION Welcome to Technologies for Big Data Management and Analytics. In this course, you are going to learn fundamental as well as advanced concepts related to the efficient processing and analysis of Big Data. We are going to focus on the state-of-the-art technologies being used as well as on algorithmic concepts. By attending this course you will learn the main set of tools and techniques in order to process and analyze massive amounts of data using state-of-the-art technology. So, open your mind and get ready for this exciting journey! |
REQUIREMENTS To be able to attend the lectures, enrolled students must bring their laptops in class during hands-on lectures. The course is based on open-source software that should be installed in Ubuntu 14.04 LTS. For your convenience, there is a Virtual Machine created for you containing all necessary software. The Virtual Machine will be distributed to you in class. In order to use it you must have VirtualBox installed on your machine. You can find VirtualBox in the following link: https://www.virtualbox.org/wiki/Downloads. Therefore, it doesn't matter what is the system you are working on (Linux, MacOS, Windows). As long as you have VirtualBox installed, you can attend the course without problems. In case you want to install all software by yourself, we can provide some assistance as well. In addition to this website, you may use the e-learning website hosted by the e-learning platform of Aristotle University. You may self-enroll in the class by following the link https://elearning.auth.gr/course/view.php?id=7951 The software that you should install includes the following (in the parenthesis, we mention the version to be used in the course): |
PROJECTS Students must form groups of two (depending on the number of enrolled students). Each team whould carry out two assignments: i) a theoretical one (much like a survey) where teams should cover a specific topic related to big data processing and analytics and ii) a more practical one, where teams must implement an algorithm using the Spark engine and perform a series of experiments. The projects account for 70% of the final grade. The rest 30% accounts for the result of the final examination (end of semester). |
ANNOUNCEMENTS 19/10/2015: The time of the lectures has changed to 14:00 - 16:30 on Thursdays. The room will be Room 3 (lab). |
LECTURES AND DATES | DESCRIPTION | RESOURCES |
Lecture 1 - 8 October 2015
Instructor: A. Gounaris |
Introduction to big data management and analytics. Motivation. Applications that require big data management. What is big data (volume, velocity, verasity etc). The complete Hadoop ecosystem. Hadoop and MapReduce. Basics of the MapReduce model of computation. Simple applications.Spark. | slides, useful book: Data-Intensive Text Processing with MapReduce by Jimmy Lin και Chris Dyer |
Lecture 2 - 15 October
2015 Instructor: A. Gounaris |
HANDS-ON. Using hdfs and hadoop. Copy files from/to local file system to/from hdfs. Executing word count using JAVA. Writing MapReduce programs in JAVA. | slides, lab resources |
Lecture 3 - 22 October
2015 Instructor: A. Gounaris |
Introduction to NoSQL. Types of NoSQL databases (key-value stores, column-family stores, document stores, graphs stores). Focus on HBase and Cassandra, architecture, main characteristics and applications. | slides,
Scalable
Datastores by R. Cattell |
Lecture 4 - 29 October
2015 Instructor: A. Gounaris |
HANDS-ON. Creating and using tables in HBase. Executing joins in HBase. | template java file for
hands-on |
Lecture 5 - 5 November
2015 Instructor: A. Papadopoulos |
The Scala programming
language. The Spark engine. Fundamental concepts (RDDs,
transformations, actions, broadcast variables, aggregate
variables, etc.). SQL, MLLib, Streaming and GraphX. Spark
and HDFS. PROJECT ASSIGNMENT PRESENTATION |
resources |
Lecture 6 - 12 November
2015 Instructor: A. Papadopoulos |
HANDS-ON. Spark standalone cluster configuration. Running spark-shell, using spark-submit, application development using IntelliJ IDEA. More on Scala. Implementing an Information Retrieval Application using Spark and Scala i) simple keyword-based search, ii) search using tf-idf. | resources |
Lecture 7 - 19 November
2015 Instructor: A. Gounaris |
Special topics in Apache Spark. Tuning, tips, tricks and traps. Experiences from running jobs in a large custer. | "Spark
Deployment and Performance Evaluation on the MareNostrum
Supercomputer" "Making Sense of Performance in Data Analytics Frameworks" |
Lecture 8 - 26 November
2015 Instructor: A. Papadopoulos |
HANDS-ON. Graph processing and analytics. Applications: i) simple graph processing (e.g., degree sequence), ii) triangle counting, iii) more ... | resources |
Lecture 9 - 3 December
2015 Instructor: A. Papadopoulos |
HANDS-ON. Text and Graph Analytics in Python. Using Python in a centralized environment as well as in Spark (with pySpark). | resources |
Lecture 10 - 10 December
2015 Instructor: |
Special Topic: Distributed Core Decomposition in Spark. | resources |
Lecture 11 - 17 December
2015 Instructor: ??? |
|
resources |
XMAS VACATION | |
|
Lecture 12 - 21 January 2016 | Presentation of students' projects OR Data Camp. | resources |
Lecture 13 - 28 January 2016 | Presentation of students' projects OR Data Camp. | resources |
EXAMS (date TBA) | |
|