Week1 - Continuation
Article Outline.:
Everything about Cloud
Apache Spark - Just the bare basics
Diving into D's - Db, Data Warehouse, and Data Lake
Data Pipelines: Making Data Move
Everything about Cloud
Before we jump into the cloud craziness, let's rewind a bit. Back in the day, on-premise setups, or "on-prem" ruled the scene. "This involved companies managing private data centers within their facilities.
When people talk about the cloud today, they generally mean public cloud services, where a third-party provider offers infrastructure on a pay-as-you-go basis. Familiar names in this space include AWS, GCP, and Azure. However, there are other models, including:
Private Cloud
Hybrid Cloud
Apache Spark
Apache Spark is a versatile, in-memory computing engine initially developed in Scala. In many cases, it serves as an equivalent to MapReduce. Spark stands out as a plug-and-play computing engine, compatible with any storage or resource manager. This flexibility contrasts with MapReduce, which aligns specifically with HDFS or YARN.
Moreover, Apache Spark supports implementation in various languages, including Python, Scala, Java, and R.
Diving into D's - Db, Data Warehouse, and Data Lake
Database (Db):
Databases store transactional data typically structured and capable of forming relationships. They are ideal for storing recent data used for day-to-day operations, and structured databases usually follow a schema-on-write approach.
Data Warehouse (DWH):
The loading process into a Data Warehouse involves Extract, Transform, Load (ETL) operations. DWHs boast the capacity to store large volumes of data, often in terabytes or petabytes. They are preferred for analytical purposes, following a schema-on-write approach (e.g., Teradata).
Data Lake:
Data retrieval in a Data Lake follows an Extract, Load, Transform (ELT) process. Cost-effective and flexible, Data Lakes store data in its raw form, employing a schema-on-read approach (e.g., Amazon S3, GCS).
Data Pipelines: Making Data Move
A data pipeline is a sequence of actions moving data from source to destination. Key components of a typical data pipeline include:
Data Sources: Systems or applications where data originates, such as databases, APIs, files, logs, social media platforms, and IoT devices.
Data Ingestion: Retrieve data from various sources, including databases, APIs, files, or streaming platforms.
Data Transformation: Cleaning, validating, and transforming data into a standardized format suitable for analysis or storage.
Data Loading: Loading transformed data into a target destination, which could be a data warehouse, database, or cloud storage platform.
Orchestration: Managing the overall workflow and scheduling to ensure proper execution and coordination of data processing tasks.
Common technologies for building data pipelines include:
Extract, Transform, Load (ETL) Tools
Apache Kafka
Apache Airflow
Cloud-Based Services (AWS Glue, Azure Data Factory, GCP Dataflow)
Further Readings.:
The resources I consulted for reference are credited in the above section contributing valuable insights to the content presented.
Image Credits.: I do not claim credit for the image; all acknowledgment and appreciation go to the original creator.
If you found this article helpful or learned something new, please show your support by liking it and following me for updates on future posts.
Till next time, happy coding!