Паралелизациjа на алгоритми за предобработка и класификациjа на податоци

Здравевски, Ефтим

Please use this identifier to cite or link to this item: http://hdl.handle.net/20.500.12188/17580

Title:	Паралелизациjа на алгоритми за предобработка и класификациjа на податоци
Other Titles:	Parallelization of algorithms for data processing and classification
Authors:	Здравевски, Ефтим
Keywords:	data preprocessing, feature engineering, feature extraction, feature selection, classification aglrithms, parallelization, distributed computing, computer cluster
Issue Date:	2017
Publisher:	ФИНКИ, УКИМ, Скопје
Source:	Здравевски, Ефтим (2017). Паралелизациjа на алгоритми за предобработка и класификациjа на податоци. Докторска дисертација. Скопје: ФИНКИ, УКИМ.
Abstract:	Companies and organizations routinely collect data of various types. For many of them, the success directly depends on their ability to analyze the collected data. Machine learning offers powerful tools for automatic analysis of large amounts of data. In order to apply machine learning algorithms, business and data needs to be understood, data needs to be preprocessed, feature engineering needs to be performed and relevant features to be selected. These phases take up a lot of time because they hugely depend on the nature of the data and the data collection process. One of the main tasks in preprocessing is generation of features that describe various aspects of the business problem and they should facilitate building of high-quality machine learning models. Feature generation needs to take into account the data types (i.e. numeric, nominal or text) and the way it was generated (i.e. event-driven generation or constant rate collection). Feature selection identifies the most relevant features and enables machine learning algorithms to focus on the aspects of data that are most useful for analysis and prediction, which in turn improves the predictive performance and lowers the computational and memory complexity of algorithms. In recent years, companies collect high-volume data with high velocity. IBM estimates that over 90% of the world data is generated in the last couple of years. The data volume growth is attributed to: ubiquitous mobile devices and sensors; software logs; user interaction with web sites, applications and multimedia; social networks, etc. Depending on the context, data utilization faces at least some of the following challenges: collection, analysis, processing, searching, visualisation, transfer, security, privacy, report generation, decision making, providing recommendations and decision support, etc. When the data is so big that these challenges cannot be easily overcame with the traditional algorithms and methods, it is popularly called Big Data. In this context, implementation and execution of algorithms for data preprocessing, feature engineering and building prediction models become even more complex. This PhD thesis proposes novel techniques for feature transformation of nominal and time series data. It also proposes and analyzes hybrid methods for feature selection which significantly reduce the feature selection time, but more importantly lower the time and memory requirements needed for building prediction models. By utilizing generic architectures for parallel and distributed processing, it proposes implementation of several feature selection algorithms that offer high scalability. The parallelization methods do not require any specialized hardware, are suitable for variety of domains, while providing scalability that can cope with the data growth. Owing to the generic approaches that are proposed in this study, the methods and results can be utilized in many business applications and scientific projects. Namely, many businesses collect data in variety of data types, which can be appropriately processed with the proposed methods, so their predictive potential is used optimally. Then, by applying the proposed methods for feature engineering, which include feature extraction and selection, the predictive models could be created in shorter time, while being more robust. The parallel implementation of algorithms mitigates many of the Big Data challenges and provides scalability. As a result, the data growth trends could be easily followed, without the need of changes in the architecture of applications or reimplementation of their functionalities. This can be a huge benefit for companies because Big Data can be analyzed with sophisticated algorithms in real time. Consequently, it can bring companies competitive advantages by improving their services, while increasing the number of users and their loyalty. From user’s perspective, the quality of services that they consume will be improved. Moreover, novel groundbreaking services that were not even possible earlier could be offered to users.
Description:	Докторска дисертација одбранета во 2017 година на Факултетот за информатички науки и компјутерско инженерство во Скопје, под менторство на проф. д–р Андреа Кулаков.
URI:	http://hdl.handle.net/20.500.12188/17580
Appears in Collections:	UKIM 02: Dissertations from the Doctoral School / Дисертации од Докторската школа

Files in This Item:

File	Description	Size	Format
S-EftimZdravevski2017.pdf		7.84 MB	Adobe PDF	View/Open

Show full item record

Page view(s)

62

checked on Jul 24, 2024

Download(s)

7

checked on Jul 24, 2024

Google Scholar^TM

Check

Repository of UKIM

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM