Introduction to ETL, ETL vs data pipelines and how it looks like when we process big data. The challenges, complications and things we should consider when architecting big data system.
Stream processing vs batch processing and how we can combine both using Lamba architecture.
Learn more:
aka.ms/data-guide
aka.ms/stream-processing
aka.ms/building-blocks
aka.ms/start-with-the-cloud
14. Data Challenges - Collection of data
Audio, video, images. Meaningless
without adding some structure
Unstructured
JSON, XML, sensor data, social media,
device data, web logs. Flexible data
model structure
Semi-Structured
Structured CSV, Columnar Storage (Parquet, ORC).
Strict data model structure
@adipolak
31. What did we learn today:
• ETL vs Data Pipelines
• Big Data and Data challenges
• Data Engineers
• Streaming vs Batch processing
• Architectures
• Tools
@adipolak
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Relational databases (RDBMS) work with structured data. Non-relational databases (NoSQL) work with semi-structured data
Relational data and non-relational data are data models, describing how data is organized. Structured, semi-structured, and unstructured data are data types
Variety: It can be structured, semi-structured, or unstructured
Velocity: It can be streaming, near real-time or batch
Volume: It can be 1GB or 1ZB
Apache Hadoop-based analytics - it is an open source platform for distributed storage and distributed processing of data sets. These services provide data storage, processing, data access, security, governance, and operations. You need to have a good grasp on tools like MapReduce, Hadoop and HBase etc.
Knowledge of SQL - you need to have strong knowledge in relational database management system such as SQL to manage data.
Data Warehousing - learning how to construct and use a data warehouse is a must. Data warehousing helps you aggregate unstructured data from one or more sources to compare and analyse for better business.
Data Architecture - having knowledge of building complex database systems for companies. This term also refers to processes that address the data at rest, data at motion, data sets, and how they relate to data dependent applications and processes.
Coding skills - you need to have good coding skills in Python, Java, Perl etc.
Machine Learning - Although, it is said that machine learning is an integral part of data science but having some level of understanding of how to put the data into use using statistical analysis and data modelling is a huge advantage. Therefore, knowing is machine learning can be like cherry to the cake.
Operating system - extensive knowledge in operating systems such as Linux, UNIX and Solaris etc. can be very helpful since most of the tools will be based on these systems.