As the leading cloud provider, Amazon Web Services offers numerous tools for a variety of applications. The sheer number of offerings can be overwhelming and it may not be so clear which tools are worth using. The following outlines a number of advanced tools that may be relevant to data scientists and explains how they can be useful. This is by no means an exhaustive list but should provide some insights as to the basic, essential tools available.
Data Storage
There are a number of scalable storage options for data science needs, including data lake and data warehousing services. Data scientists typically require storage options that go beyond the capabilities of Amazon Simple Storage Service (S3). Likewise, for data that is not actively used on a frequent basis, cold storage, such as that provided by Glacier, is a more cost-efficient option. It is also important to back up the data of your EC2 instances, which can do using AWS snapshots, though this differs from the archiving capabilities of Glacier.
For data warehousing, you can use Amazon Redshift, which can run complex queries against structured or unstructured data. To help them manage and search for data, analysts and data scientists can use AWS Glue, which automatically creates a unified catalog of all data in the data lake, with metadata attached to make it discoverable.
Machine Learning with Amazon SageMaker
SageMaker is the fundamental machine learning platform for developers and data science needs, which runs on the Elastic Compute Cloud (EC2). It is a fully-managed service that allows you to organize your data, build a machine learning model and scale your operations. The training data you collect can then be used to generate predictions and inform your actions. Applications of machine learning range from computer vision, speech recognition and translation to analytics, forecasting and the provision of recommendations.
To prepare your model, you can choose an algorithm from the AWS Marketplace, train it and tweak it for optimization. The most popular choices are machine learning frameworks such as TensorFlow, PyTorch and Keras. SageMaker can configure and optimize these frameworks automatically, or you can modify them yourself. You can also introduce your own algorithm by building it in a Docker container.
To visualize your data and build your machine learning model, you can use a Jupyter notebook. While it is possible to build a notebook from scratch, it is simpler to use a pre-built notebook, which you can use as-is or tweak to suit your needs.
Automated Machine Learning with H20 Driverless AI
H20 Driverless AI, or H20.ai, is an artificial intelligence platform that allows you to take advantage of machine learning even if your experience is limited. You can use Amazon’s machine learning capabilities for fast and accurate data analysis, enabling you to make better-informed business decisions and improving outcomes such as your sales conversion rate.
AWS provides a number of pre-trained AI services that don’t require specialized skills. For example, you can take advantage of Amazon Machine Images (AMIs) that are specially designed for deep learning. You can use EC2 instances with pre-installed deep learning frameworks, which will save time and effort. This is also a good place to start if you want to acquire new skills.
Analytics
To make use of the raw data stored in a data lake, you need to be able to analyze it. AWS provides a number of analytics services, including:
- Amazon Athena━facilitates interactive analysis of data in S3 or Glacier. It is fast, serverless and works using standard SQL queries.
- Amazon Elastic MapReduce (EMR)━processes big data using Spark and Hadoop. It is a managed service and provides managed notebooks for data science and data engineering applications.
- Amazon Kinesis━allows you to easily aggregate and process streaming data in real time, so you can perform analytics as the data arrives in your data lake. Uses include website clickstreams, application logs, and telemetry data from IoT devices.
- Amazon Elasticsearch━allows you to manage data for operational analytics. The APIs are easy to use and the service is fast, scalable and highly available.
- Amazon QuickSight━provides visualizations for your analytics, with dashboards that you can access remotely from a browser or mobile device.
Take Advantage of Third-Party Tools
There is a fairly long list of third-party tools that integrate with AWS. However, I’ve selected a few that I think are particularly relevant for data scientists or those interested in AI and machine learning projects:
- DataScience.com━a platform centralized platform that allows data science teams to collaborate easily. Provides data visualization, analysis sharing and performance tracking capabilities.
- Splice Machine━allows you to manage operational processes for predictive applications. Supports machine learning-based analytical processing, with the ability to improve over time.
- Trifacta━works with Amazon Redshift and S3, facilitating the preparation of a variety of data types, which could otherwise be time-consuming. It is designed for small to medium-sized datasets that don’t require additional computing power.
- KNIME━an analytics platform to help manage predictive analytics projects. It has over 1000 modules, as well as community support and integrated tools.
Conclusion
With all the tools available on AWS, it can be tough to choose which service you want to use. The tools mentioned above leverage capabilities such as AI to provide data science functions, and can help you significantly reduce the time it takes to manage tasks such as predictive analytics. It’s always a good idea to do some further research before committing to any specific tool, but this list has hopefully given you an idea of where to look.
Note: This is a guest post, and opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.marktechpost.com please contact at asif@marktechpost.com
Gilad David Maayan is a technology writer who has worked with over 150 technology companies including SAP, Samsung NEXT, NetApp and Imperva, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership.