SlideShare a Scribd company logo
1 of 37
Download to read offline
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building Your Data Lake
on AWS
Luke Anderson
Business Development, AWS
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect from the session
1. Defining the Data Lake
2. Reducing Costs
3. Increasing Performance
4. Planning for the Future
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink how to become a data-driven business
• Business outcomes
• Experimentation
• Agile and timely
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics looked like this
(Duplication & Sprawl)
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Data Lake Components
Any analytic workload, any scale, at the lowest possible cost
Insights
Analytics
Data Lake
Data Movement
QuickSight SageMaker
Glue
(ETL & Data Catalog)
S3/Glacier
(Storage)
Redshift
+Spectrum
EMR Athena
Elasticsearch service
Kinesis Data Analytics
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams
Real-time
Comprehend
DW Big data processing Interactive
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Reasons to choose Amazon S3 for data lake
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reducing Data Lake Costs
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS  Use EMR/Hadoop with local
HDFS for hottest data sets
 Store cooler data in S3 and
cold in Glacier to reduce costs
 Use S3 Analytics to optimize
tiering strategy
S3 Analytics
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Process data in place…
Amazon Athena Amazon Redshift
Spectrum
Amazon EMR
AWS Glue
Amazon S3
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR: Decouple compute & storage
Highly distributed
processing frameworks such
as Hadoop/Spark
Compress datasets
Columnar file formats
Aggregate small files
S3distcp “group-by” clause
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum: Exabyte Scale
query-in-place
Structured data w/ joins
Multiple on-demand
clusters-scale concurrency
Columnar file formats
Data partitioning
Better query performance
with predicate pushdown
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena: Query without ETL
Serverless service
Schema on read
Compress datasets
Columnar file formats
Optimize file sizes
Optimize querying (Presto
backend)
Query Data in Glacier
(Coming)
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: All of these tools…
retrieve a lot of data they don’t need and
do the heavy lifting
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: You need to….
entire object from Amazon Glacier to Amazon S3
and then use it.
Amazon
S3
Amazon
Glacier
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Select
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivation Behind S3 Select
GET all the data from S3 objects, and my application will filter the data that I need
Redshift Spectrum Example:
• Beta customer: Run 50,000 queries
• Amount of data fetched from S3: 6 PBs
• Amount of data used in Redshift: 650 TB
Data needed from S3: 10%
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
SELECT a filtered set of data from within an object using standard SQL Statements
• First content aware API within Amazon S3
• Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system
• SQL Statement operates on a per-object basis—not across a group of objects
• Works and scales like GET requests
• Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow
• Who will use it?
• Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines
• Everyone doing log mining
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
Output
Format: delimited text (CSV,
TSV), JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV,
TSV), JSON …
Compression: GZIP …
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Simple pattern matches
…get-object …object… | awk -F ’{ if($4=="x") print $1}’
...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Serverless applications
Amazon
S3
AWS
Lambda
Amazon
SNS
S3
Select
Lambda
Trigger
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demo – S3 Select Timing
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select with Presto
Works with your existing Hive Metastore
Automatically converts predicates into S3 Select requests
Amazon S3
S3 Select
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
Amazon S3 Select: Accelerating big data
After
After
5X Faster with 1/40 of the CPU
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Amazon Glacier Select
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Amazon Glacier Select Works
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Delivering Results Faster
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing data lake performance
Aggregate small files
EMR: S3distcp
Amazon Kinesis Firehose
S3 Select
Big data cheaper, faster
Up to 400% faster
Data Formats
Columnar formats
EMRFS consistent view
Amazon
S3
Amazon
DynamoDB
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis—Real Time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
New
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data preparation accounts for ~80% of the
work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Serverless Data catalog & ETL
service
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon SageMaker (GA)
The quickest and easiest way to get ML models from idea to production
End-to-End
Machine Learning
Platform
Zero setup Flexible Model
Training
Pay by the second
$
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Planning for the Future
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transactional Data
Stream Data
Collect Store Analyze Visualize
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
Amazon
Elastic
MapReduce
Amazon
ElastiCache
Search
SQL
NoSQL
Cache
Stream
Processing
Batch
Interactive
Logging
Stream
Storage
IoT
Applications
File
Storage
Analysis
&
Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
File Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Evolve As Needed!
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Training Offer
Make your data driven decisions count, and make a career in Big
Data on AWS. Follow the Big Data Specialty learning path and
become a specialist in Big Data:
• Implement core AWS Big Data services according to best
practices
• Design and maintain Big Data
• Leverage tools to automate data analysis
Certified Cloud
Practitioner
Associate-level Certification
AWS Certified Big Data - Specialty
• Enterprise solutions
architects
• Data scientists
• Big Data solutions
architects
• Data analysts
Who should attend
Free AWS digital training: Foundational
knowledge
Big Data on AWS – 3-day Classroom Training
Free AWS digital training:
Big Data Technology Fundamentals
Visit www.aws.training to find out more.
© 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved.
We hope you found it interesting! A kind reminder to complete the survey.
Let us know what you thought of today’s event and how we can improve
the event experience for you in the future.
Thank You For Attending
AWS Data Driven Decisions Webinar Series.
aws-apac-marketing@amazon.com
twitter.com/AWSCloud
facebook.com/AmazonWebServices
youtube.com/user/AmazonWebServices
slideshare.net/AmazonWebServices
twitch.tv/aws

More Related Content

Similar to Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf

Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data LakeAmazon Web Services
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...Amazon Web Services
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesAmazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data LakesAmazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopWild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopAWS Germany
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data LakesAmazon Web Services
 

Similar to Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf (20)

Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser... SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
SRV327 Replicate, Analyze, and Visualize Data Using Managed Database and Ser...
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
Query in Place with AWS (STG315-R1) - AWS re:Invent 2018
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
Creare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data WarehousesCreare e gestire Data Lake e Data Warehouses
Creare e gestire Data Lake e Data Warehouses
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless WorkshopWild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
Wild Rydes with Big Data/Kinesis focus: AWS Serverless Workshop
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 

Recently uploaded

西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造kbdhl05e
 
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Service
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts ServiceCall Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Service
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Servicejennyeacort
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Rndexperts
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)jennyeacort
 
Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in designnooreen17
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一z xss
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...Amil baba
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一lvtagr7
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricksabhishekparmar618
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdfvaibhavkanaujia
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改yuu sss
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证nhjeo1gg
 
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCRdollysharma2066
 
Untitled presedddddddddddddddddntation (1).pptx
Untitled presedddddddddddddddddntation (1).pptxUntitled presedddddddddddddddddntation (1).pptx
Untitled presedddddddddddddddddntation (1).pptxmapanig881
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 

Recently uploaded (20)

西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造西北大学毕业证学位证成绩单-怎么样办伪造
西北大学毕业证学位证成绩单-怎么样办伪造
 
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
2024新版美国旧金山州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Okhla Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Service
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts ServiceCall Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Service
Call Girls in Ashok Nagar Delhi ✡️9711147426✡️ Escorts Service
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025
 
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
Call Us ✡️97111⇛47426⇛Call In girls Vasant Vihar༒(Delhi)
 
Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.Mookuthi is an artisanal nose ornament brand based in Madras.
Mookuthi is an artisanal nose ornament brand based in Madras.
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in design
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
 
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
NO1 Famous Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Add...
 
Call Girls in Pratap Nagar, 9953056974 Escort Service
Call Girls in Pratap Nagar,  9953056974 Escort ServiceCall Girls in Pratap Nagar,  9953056974 Escort Service
Call Girls in Pratap Nagar, 9953056974 Escort Service
 
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
定制(RMIT毕业证书)澳洲墨尔本皇家理工大学毕业证成绩单原版一比一
 
Cosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable BricksCosumer Willingness to Pay for Sustainable Bricks
Cosumer Willingness to Pay for Sustainable Bricks
 
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
办理卡尔顿大学毕业证成绩单|购买加拿大文凭证书
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdf
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
 
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
8377877756 Full Enjoy @24/7 Call Girls in Nirman Vihar Delhi NCR
 
Untitled presedddddddddddddddddntation (1).pptx
Untitled presedddddddddddddddddntation (1).pptxUntitled presedddddddddddddddddntation (1).pptx
Untitled presedddddddddddddddddntation (1).pptx
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf

  • 1. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Building Your Data Lake on AWS Luke Anderson Business Development, AWS
  • 2. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect from the session 1. Defining the Data Lake 2. Reducing Costs 3. Increasing Performance 4. Planning for the Future
  • 3. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rethink how to become a data-driven business • Business outcomes • Experimentation • Agile and timely
  • 4. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, Analytics looked like this (Duplication & Sprawl) Hadoop Spark NoSQL Storage Arrays Databases Data Warehouse Structured Data SQL Raw Data ETL Advanced Analytics ETL
  • 5. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining the AWS data lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  • 6. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Data Lake Components Any analytic workload, any scale, at the lowest possible cost Insights Analytics Data Lake Data Movement QuickSight SageMaker Glue (ETL & Data Catalog) S3/Glacier (Storage) Redshift +Spectrum EMR Athena Elasticsearch service Kinesis Data Analytics Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams Real-time Comprehend DW Big data processing Interactive
  • 7. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrations Most ways to bring data in Reasons to choose Amazon S3 for data lake
  • 8. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Reducing Data Lake Costs
  • 9. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon Glacier HDFS  Use EMR/Hadoop with local HDFS for hottest data sets  Store cooler data in S3 and cold in Glacier to reduce costs  Use S3 Analytics to optimize tiering strategy S3 Analytics
  • 10. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process data in place… Amazon Athena Amazon Redshift Spectrum Amazon EMR AWS Glue Amazon S3
  • 11. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon EMR: Decouple compute & storage Highly distributed processing frameworks such as Hadoop/Spark Compress datasets Columnar file formats Aggregate small files S3distcp “group-by” clause
  • 12. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum: Exabyte Scale query-in-place Structured data w/ joins Multiple on-demand clusters-scale concurrency Columnar file formats Data partitioning Better query performance with predicate pushdown
  • 13. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Athena: Query without ETL Serverless service Schema on read Compress datasets Columnar file formats Optimize file sizes Optimize querying (Presto backend) Query Data in Glacier (Coming)
  • 14. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: All of these tools… retrieve a lot of data they don’t need and do the heavy lifting
  • 15. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: You need to…. entire object from Amazon Glacier to Amazon S3 and then use it. Amazon S3 Amazon Glacier
  • 16. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  • 17. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivation Behind S3 Select GET all the data from S3 objects, and my application will filter the data that I need Redshift Spectrum Example: • Beta customer: Run 50,000 queries • Amount of data fetched from S3: 6 PBs • Amount of data used in Redshift: 650 TB Data needed from S3: 10%
  • 18. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select SELECT a filtered set of data from within an object using standard SQL Statements • First content aware API within Amazon S3 • Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system • SQL Statement operates on a per-object basis—not across a group of objects • Works and scales like GET requests • Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow • Who will use it? • Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines • Everyone doing log mining
  • 19. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV), JSON … Compression: GZIP …
  • 20. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Simple pattern matches …get-object …object… | awk -F ’{ if($4=="x") print $1}’ ...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
  • 21. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Serverless applications Amazon S3 AWS Lambda Amazon SNS S3 Select Lambda Trigger
  • 22. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  • 23. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demo – S3 Select Timing
  • 24. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select with Presto Works with your existing Hive Metastore Automatically converts predicates into S3 Select requests Amazon S3 S3 Select
  • 25. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before Amazon S3 Select: Accelerating big data After After 5X Faster with 1/40 of the CPU
  • 26. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Amazon Glacier Select
  • 27. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Amazon Glacier Select Works
  • 28. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Delivering Results Faster
  • 29. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimizing data lake performance Aggregate small files EMR: S3distcp Amazon Kinesis Firehose S3 Select Big data cheaper, faster Up to 400% faster Data Formats Columnar formats EMRFS consistent view Amazon S3 Amazon DynamoDB
  • 30. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis—Real Time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL New
  • 31. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 32. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Serverless Data catalog & ETL service Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  • 33. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon SageMaker (GA) The quickest and easiest way to get ML models from idea to production End-to-End Machine Learning Platform Zero setup Flexible Model Training Pay by the second $
  • 34. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Planning for the Future
  • 35. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transactional Data Stream Data Collect Store Analyze Visualize A iOS Android Web Apps Logstash Amazon RDS Amazon DynamoDB Amazon ES Amazon S3 Apache Kafka Amazon Glacier Amazon Kinesis Amazon DynamoDB Amazon Redshift Impala Pig Amazon ML Streaming Amazon Kinesis AWS Lambda Amazon Elastic MapReduce Amazon ElastiCache Search SQL NoSQL Cache Stream Processing Batch Interactive Logging Stream Storage IoT Applications File Storage Analysis & Visualization Hot Cold Warm Hot Slow Hot ML Fast Fast Amazon QuickSight File Data Notebooks Predictions Apps & APIs Mobile Apps IDE Search Data ETL Evolve As Needed!
  • 36. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Training Offer Make your data driven decisions count, and make a career in Big Data on AWS. Follow the Big Data Specialty learning path and become a specialist in Big Data: • Implement core AWS Big Data services according to best practices • Design and maintain Big Data • Leverage tools to automate data analysis Certified Cloud Practitioner Associate-level Certification AWS Certified Big Data - Specialty • Enterprise solutions architects • Data scientists • Big Data solutions architects • Data analysts Who should attend Free AWS digital training: Foundational knowledge Big Data on AWS – 3-day Classroom Training Free AWS digital training: Big Data Technology Fundamentals Visit www.aws.training to find out more.
  • 37. © 2018 Amazon Web Services, Inc. or its Affiliates. All rights reserved. We hope you found it interesting! A kind reminder to complete the survey. Let us know what you thought of today’s event and how we can improve the event experience for you in the future. Thank You For Attending AWS Data Driven Decisions Webinar Series. aws-apac-marketing@amazon.com twitter.com/AWSCloud facebook.com/AmazonWebServices youtube.com/user/AmazonWebServices slideshare.net/AmazonWebServices twitch.tv/aws