Data Science Solutions
Last updated on 2017-02-07
About the Book
The field of data science, big data, machine learning, and artificial intelligence is exciting and complex at the same time. Demand-supply equation is favouring the job seeker. The market size is in billions of dollars. Venture funding, mergers, and acquisition activity is very high in this field. Enterprises are investing significantly building capabilities in data science.
Bundled along with the huge opportunity comes fair degree of hype and complexity. Data science is a multi-disciplinary field involving statistics, mathematics, behavioural psychology, Cloud computing technologies, among many other specialised areas of study. Data science is also rapidly growing with new tools, technologies, algorithms, datasets, and use cases. For a beginner in this field, the learning curve can be fairly daunting. This is where this book helps.
The data science solutions book provides a repeatable, robust, and reliable framework to apply the right-fit workflows, strategies, tools, APIs, and domain for your data science projects.
This book takes a solutions focused approach to data science. Each chapter meets an end-to-end objective of solving for data science workflow or technology requirements. At the end of each chapter you either complete a data science tools pipeline or write a fully functional coding project meeting your data science workflow requirements.
Seven stages of data science solutions workflow
Every chapter in this book will go through one or more of these seven stages of data science solutions workflow.
Question. Problem. Solution.
Before starting a data science project we must ask relevant questions specific to our project domain and datasets. We may answer or solve these during the course of our project. Think of these questions-solutions as the key requirements for our data science project. Here are some `templates` that can be used to frame questions for our data science projects.
- Can we classify an entity based on given features if our data science model is trained on certain number of samples with similar features related to specific classes?
- Do the samples, in a given dataset, cluster in specific classes based on similar or correlated features?
- Can our machine learning model recognise and classify new inputs based on prior training on a sample of similar inputs?
- Can we analyse the sentiment of a given sample?
Acquire. Search. Create. Catalog.
This stage involves data acquisition strategies including searching for datasets on popular data sources or internally within your organisation. We may also create a dataset based on external or internal data sources.
The acquire stage may feedback to the question stage, refining our problem and solution definition based on the constraints and characteristics of the acquired datasets.
Wrangle. Prepare. Cleanse.
The data wrangle phase prepares and cleanses our datasets for our project goals. This workflow stage starts by importing a dataset, exploring the dataset for its features and available samples, preparing the dataset using appropriate data types and data structures, and optionally cleansing the data set for creating model training and solution testing samples.
The wrangle stage may circle back to the acquire stage to identify complementary datasets to combine and complete the existing dataset.
Analyse. Patterns. Explore.
The analyse phase explores the given datasets to determine patterns, correlations, classification, and nature of the dataset. This helps determine choice of model algorithms and strategies that may work best on the dataset.
The analyse stage may also visualise the dataset to determine such patterns.
Model. Predict. Solve.
The model stage uses prediction and solution algorithms to train on a given dataset and apply this training to solve for a given problem.
Visualise. Report. Present.
The visualisation stage can help data wrangling, analysis, and modeling stages. Data can be visualized using charts and plots suiting the characteristics of the dataset and the desired results.
Visualisation stage may also provide the inputs for the supply stage.
Supply. Products. Services.
Once we are ready to monetise our data science solution or derive further return on investment from our projects, we need to think about distribution and data supply chain. This stage circles back to the acquisition stage. In fact we are acquiring data from someone else's data supply chain.
Learning path for data science
In this book we accomplish several learning goals covering the multi-disciplinary field of data science.
Open Source Technologies. Open source technologies offering solutions across the data science stack. These include technologies like D3 for visualising data, Hadoop, Spark, and many others.
Enterprise Tools. Commercial products offering enterprise scale, power user solutions across the data science stack. These include products like Tableau for exploring data, Trifacta for wrangling data, among others. Neo4j for Graph databases. It is important to learn about these products, while knowing the editions and alternatives available in the open source stack.
Data Science Algorithms. Algorithms to prepare and analyse datasets for data science projects. APIs offering algorithms for computer vision, speech recognition, and natural language processing are available from leading Cloud providers like Google as a paid service. Using open source libraries like Tensorflow, Theano, and Python Scikit-learn you can explore more developing your own data science models.
Cloud Platform Development. Developing full-stack frontend apps and backend APIs to deliver data science solutions and data products.
Data Science Thinking. These learning goals are inter-dependent. An open source technology may be integrated within an Enterprise product. Cloud platform may deliver the data science algorithms as APIs. The Cloud platform itself may be built using several open source technologies.
We will evolve Data Science Thinking over the course of this book using strategies and workflow which connect the learning goals into a unified and high performance project pipeline.
- LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY
Data Science Solutions
- Seven stages of data science solutions workflow
- Learning path for data science
- Data science strategies
- Technology stack for data science
- Starting a data science project
- Anaconda Python environment manager
- Jupyter Python Notebook
- Importing dependencies
- Acquire and wrangle data
- Analyse and visualize
- Model and predict
Data Acquisition Workflow
- Benefits of data acquisition workflow
- Data Science Thinking - Data Framing
- Public datasets for data science
- Open Data
- Prepared data sources
- APIs for data acquisition
- Data acquisition workflow
Python Data Scraping
- Import dependencies
- Choosing the data source
- Deriving data structure
- Creating a new categorical feature
- Extracting feature based on multiple rules
- Extracting new numerical feature
- Creating new text feature
- Scraping HTML
- Creating new data structure
- Analyzing samples
- Create CSV data files
- Second pass wrangling
OpenRefine Data Wrangling
- Text filter samples
- Custom facet on text length
- Create new feature based on existing
- OpenRefine undo/redo history
- Clustering features
- Filter on expression
- Facet to extraction
- Export CSV file
Neo4j Graph Database
- Benefits of graph databases
- Install Neo4j community edition
- Atom editor support for Cypher
- Cypher Graph Query Language
- Cypher naming conventions
- Neo4j browser commands
- Creating Data Science Graph
- Preparing data for import
- Constraints and indexes
- Import CSV samples as nodes
- Create relationships
- Profiling graph queries
- Create a new node
- Transform property into a relationship
- Relating Awesome List
- Category relationships
- Graph design patterns
Data Science Graph
- Graph Refactor
- Questions by graph exploration
- Refactoring the entire database
- Question and acquire workflows
- Analyze and visualize workflows
- Wrangling workflow
- Model and supply workflows
Data Science Competitions
- Workflow stages
- Question and problem definition
- Workflow goals
- Import dependencies for competition project
- Acquire data
- Analyze by describing data
- Assumtions based on data analysis
- Analyze by pivoting features
- Analyze by visualizing data
- Wrangle data
- Model, predict and solve
Amazon Cloud DynamoDB
- Setup Python for Amazon Cloud
- Setup DynamoDB local database
- Create table using Python SDK
- Boto3 CRUD and low-level operations
- Primary, partition and sort keys
- Provisioned throughput capacity
- Deploy DynamoDB to the Cloud
- Importing data into DynamoDB
- Where do we go from here
Google Cloud Firebase
- Creating Firebase account
- Import data into Firebase
Cloud Data API
- Development strategies for Cloud data API
- About API design and terminology
- Atom Python editor setup
- Google App Engine Launcher
- Google App Engine project
- App configuration using app.yaml
- Data Model definition in models.py
- Implement API in cloud_data.py
- Development staging with App Engine Launcher
- Deploy API on Google App Engine
- Monitor Cloud app
The Leanpub 45-day 100% Happiness Guarantee
Within 45 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.
See full terms
Free Updates. DRM Free.
If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).
Most Leanpub books are available in PDF (for computers), EPUB (for phones and tablets) and MOBI (for Kindle). The formats that a book includes are shown at the top right corner of this page.
Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.