An Advanced Lakehouse with Intelligent Data Ingestion and Storage Management System

Loading...
Thumbnail Image

Authors

Harby, Ahmed A.

Date

2025-04-03

Type

thesis

Language

eng

Keyword

Lakehouse , Intelligent system , Vectorization , Smart Ingestion

Research Projects

Organizational Units

Journal Issue

Alternative Title

Abstract

The Lakehouse (LH) cloud data storage service is a cutting-edge solution that merges the best features of data warehouses (DW) and data lakes (DL). While DLs handle large volumes of diverse data, they often lack robust analytics, metadata, and query tools. Conversely, DWs excel at analytics and decision support but struggle with dynamic data ingestion and organization. LH aims to bridge these gaps by providing a unified system that intelligently manages structured, semi-structured, and unstructured data. Our research focuses on enhancing the LH system by incorporating advanced knowledge extraction, storage, and metadata management. We utilize big data management techniques and machine learning models to learn and store critical data patterns, reducing storage needs while ensuring data reproducibility for future analysis. This approach not only simplifies data management but also improves decision-making capabilities and prevents data swamps. By addressing a key shortfall in current LH models, we automate the data ingestion process to increase efficiency, reduce costs, and speed up time to insights, replacing inefficient, error-prone manual methods. Additionally, we leverage AI-powered tools to extract meaningful information from large datasets, tackling the 5v challenges of big data: Velocity, Volume, Variety, Veracity, and Value. A unique contribution of our research is the conversion of unstructured raw data into semi-structured formats like numerical vectors, making them easier to analyze. This transformation is essential for linking diverse data types—text, images, and videos—in analytical processing. The vector-based approach enhances metadata management, enabling advanced search techniques for faster record retrieval. Additionally, it improves data quality, governance, and lineage tracking, ensuring consistency across the Lakehouse (LH) ecosystem by eliminating duplicates. Our validation results show a 62.6% storage reduction while maintaining high retrieval accuracy, balancing efficiency and precision. Vectorization and profiling further improve ingestion performance by 62% over the baseline, demonstrating the effectiveness of these enhancements in handling large-scale datasets. These findings confirm the scalability and future-readiness of our approach in optimizing storage and processing efficiency.

Description

Citation

Publisher

License

Queen's University's Thesis/Dissertation Non-Exclusive License for Deposit to QSpace and Library and Archives Canada
ProQuest PhD and Master's Theses International Dissemination Agreement
Intellectual Property Guidelines at Queen's University
Copying and Preserving Your Thesis
This publication is made available by the authority of the copyright owner solely for the purpose of private study and research and may not be copied or reproduced except as permitted by the copyright laws without written authority from the copyright owner.
Attribution 4.0 International

Journal

Volume

Issue

PubMed ID

External DOI

ISSN

EISSN