Logo eprints

Big data and the web: algorithms for data intensive scalable computing

De Francisci Morales, Gianmarco (2012) Big data and the web: algorithms for data intensive scalable computing. Advisor: Lucchese, Dr. Claudio. Coadvisor: Baraglia, Dr. Ranieri . pp. 170. [IMT PhD Thesis]

De Francisci_phdthesis.pdf - Published Version
Available under License Creative Commons Attribution No Derivatives.

Download (2MB) | Preview


This thesis explores the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different ways. Effectively extracting information requires careful design of algorithms for specific categories of data. TheWeb is huge, but DISC systems offer a platform for building scalable solutions. However, they provide restricted computing primitives for the sake of performance. Efficiently harnessing the power of parallelism offered by DISC systems involves rethinking traditional algorithms. This thesis tackles three classical problems in Web mining. First we propose a novel solution to finding similar items in a bag of Web pages. Second we consider how to effectively distribute content from Web 2.0 to users via graph matching. Third we show how to harness the streams from the real-time Web to suggest news articles. Our main contribution lies in rethinking these problems in the context of massive scaleWeb mining, and in designing efficient MapReduce and streaming algorithms to solve these problems on DISC systems.

Item Type: IMT PhD Thesis
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
PhD Course: Computer Science and Engineering
Identification Number: 10.6092/imtlucca/e-theses/34
NBN Number: urn:nbn:it:imtlucca-27070
Date Deposited: 10 Jul 2012 14:32
URI: http://e-theses.imtlucca.it/id/eprint/34

Actions (login required, only for staff repository)

View Item View Item