Data Science Solutions
Minimum price
Suggested price

Data Science Solutions

Machine Learning, Python, Neo4j, Kaggle, Cloud Platforms

About the Book

The field of data science, big data, machine learning, and artificial intelligence is exciting and complex at the same time. Demand-supply equation is favouring the job seeker. The market size is in billions of dollars. Venture funding, mergers, and acquisition activity is very high in this field. Enterprises are investing significantly building capabilities in data science.

Bundled along with the huge opportunity comes fair degree of hype and complexity. Data science is a multi-disciplinary field involving statistics, mathematics, behavioural psychology, Cloud computing technologies, among many other specialised areas of study. Data science is also rapidly growing with new tools, technologies, algorithms, datasets, and use cases. For a beginner in this field, the learning curve can be fairly daunting. This is where this book helps.

The data science solutions book provides a repeatable, robust, and reliable framework to apply the right-fit workflows, strategies, tools, APIs, and domain for your data science projects.

This book takes a solutions focused approach to data science. Each chapter meets an end-to-end objective of solving for data science workflow or technology requirements. At the end of each chapter you either complete a data science tools pipeline or write a fully functional coding project meeting your data science workflow requirements.

Seven stages of data science solutions workflow

Every chapter in this book will go through one or more of these seven stages of data science solutions workflow.

Question. Problem. Solution.

Before starting a data science project we must ask relevant questions specific to our project domain and datasets. We may answer or solve these during the course of our project. Think of these questions-solutions as the key requirements for our data science project. Here are some `templates` that can be used to frame questions for our data science projects.

  • Can we classify an entity based on given features if our data science model is trained on certain number of samples with similar features related to specific classes?
  • Do the samples, in a given dataset, cluster in specific classes based on similar or correlated features?
  • Can our machine learning model recognise and classify new inputs based on prior training on a sample of similar inputs?
  • Can we analyse the sentiment of a given sample?

Acquire. Search. Create. Catalog.

This stage involves data acquisition strategies including searching for datasets on popular data sources or internally within your organisation. We may also create a dataset based on external or internal data sources.

The acquire stage may feedback to the question stage, refining our problem and solution definition based on the constraints and characteristics of the acquired datasets.

Wrangle. Prepare. Cleanse.

The data wrangle phase prepares and cleanses our datasets for our project goals. This workflow stage starts by importing a dataset, exploring the dataset for its features and available samples, preparing the dataset using appropriate data types and data structures, and optionally cleansing the data set for creating model training and solution testing samples.

The wrangle stage may circle back to the acquire stage to identify complementary datasets to combine and complete the existing dataset.

Analyse. Patterns. Explore.

The analyse phase explores the given datasets to determine patterns, correlations, classification, and nature of the dataset. This helps determine choice of model algorithms and strategies that may work best on the dataset.

The analyse stage may also visualise the dataset to determine such patterns.

Model. Predict. Solve.

The model stage uses prediction and solution algorithms to train on a given dataset and apply this training to solve for a given problem.

Visualise. Report. Present.

The visualisation stage can help data wrangling, analysis, and modeling stages. Data can be visualized using charts and plots suiting the characteristics of the dataset and the desired results.

Visualisation stage may also provide the inputs for the supply stage.

Supply. Products. Services.

Once we are ready to monetise our data science solution or derive further return on investment from our projects, we need to think about distribution and data supply chain. This stage circles back to the acquisition stage. In fact we are acquiring data from someone else's data supply chain.

Learning path for data science

In this book we accomplish several learning goals covering the multi-disciplinary field of data science.

Open Source Technologies. Open source technologies offering solutions across the data science stack. These include technologies like D3 for visualising data, Hadoop, Spark, and many others.

Enterprise Tools. Commercial products offering enterprise scale, power user solutions across the data science stack. These include products like Tableau for exploring data, Trifacta for wrangling data, among others. Neo4j for Graph databases. It is important to learn about these products, while knowing the editions and alternatives available in the open source stack.

Data Science Algorithms. Algorithms to prepare and analyse datasets for data science projects. APIs offering algorithms for computer vision, speech recognition, and natural language processing are available from leading Cloud providers like Google as a paid service. Using open source libraries like Tensorflow, Theano, and Python Scikit-learn you can explore more developing your own data science models.

Cloud Platform Development. Developing full-stack frontend apps and backend APIs to deliver data science solutions and data products.

Data Science Thinking. These learning goals are inter-dependent. An open source technology may be integrated within an Enterprise product. Cloud platform may deliver the data science algorithms as APIs. The Cloud platform itself may be built using several open source technologies.

We will evolve Data Science Thinking over the course of this book using strategies and workflow which connect the learning goals into a unified and high performance project pipeline.

  • Share this book

  • Categories

    • Startups
    • Python
    • Data Science
    • Software
    • Consulting
    • APIs
    • API Design
    • DevOps
    • JavaScript
  • Feedback

    Email the Author(s)

About the Author

Manav Sehgal
Manav Sehgal

Manav Sehgal is a builder, author, and inventor specializing in product management, software engineering, cloud, and data science with more than 15 years of experience at Amazon (AWS India), Xerox PARC, HCL Technologies, and Daily Mail Group.

During his career he has also built, mentored, and led technology and product management for six startups with successful exits including Rightster (Video Advertising), Map of Medicine (Healthcare), Cytura (Media Services), Infonetmap (E-commerce), Edynamics (Digital Marketing), and AgRisk (Agriculture Analytics). Manav is AWS Certified Solutions Architect Associate (2019).

Daily Mail Group (RMS) sponsored him for Executive MBA module in Leading Innovative Change, from UC Berkeley, Haas School of Business (2015). He has completed CMMI certification from SEI CMU (2006). He studied Computer Engineering from Delhi College of Engineering, while employed at Xerox PARC (1995-99). He also earned distinction in Lean Management program conducted by AOTS, JMAM Japan (1994-95).

Manav is author of five books on Data science solutions (, React rapid development (, E-commerce platform development (, Java component technology, and JavaScript object oriented programming. He is a popular GitHub contributor active on AWS, Google, and Facebook open source projects. Manav has written the top five voted (as on July 2019) data science tutorial on Kaggle (, the largest data science community online.

Table of Contents

  • Legal
    • Copyright
    • Trademarks
  • Data Science Solutions
    • Seven stages of data science solutions workflow
    • Learning path for data science
    • Data science strategies
    • Technology stack for data science
    • Starting a data science project
    • Anaconda Python environment manager
    • Jupyter Python Notebook
    • Importing dependencies
    • Acquire and wrangle data
    • Analyse and visualize
    • Model and predict
  • Data Acquisition Workflow
    • Benefits of data acquisition workflow
    • Data Science Thinking - Data Framing
    • Public datasets for data science
    • Open Data
    • Prepared data sources
    • APIs for data acquisition
    • Data acquisition workflow
  • Python Data Scraping
    • Import dependencies
    • Choosing the data source
    • Deriving data structure
    • Creating a new categorical feature
    • Extracting feature based on multiple rules
    • Extracting new numerical feature
    • Creating new text feature
    • Scraping HTML
    • Creating new data structure
    • Analyzing samples
    • Create CSV data files
    • Second pass wrangling
  • OpenRefine Data Wrangling
    • Text filter samples
    • Custom facet on text length
    • Create new feature based on existing
    • OpenRefine undo/redo history
    • Clustering features
    • Filter on expression
    • Facet to extraction
    • Export CSV file
  • Neo4j Graph Database
    • Benefits of graph databases
    • Install Neo4j community edition
    • Atom editor support for Cypher
    • Cypher Graph Query Language
    • Cypher naming conventions
    • Neo4j browser commands
    • Creating Data Science Graph
    • Preparing data for import
    • Constraints and indexes
    • Import CSV samples as nodes
    • Create relationships
    • Profiling graph queries
    • Create a new node
    • Transform property into a relationship
    • Relating Awesome List
    • Category relationships
    • Graph design patterns
  • Data Science Graph
    • Graph Refactor
    • Questions by graph exploration
    • Refactoring the entire database
    • Question and acquire workflows
    • Analyze and visualize workflows
    • Wrangling workflow
    • Model and supply workflows
  • Data Science Competitions
    • Workflow stages
    • Question and problem definition
    • Workflow goals
    • Import dependencies for competition project
    • Acquire data
    • Analyze by describing data
    • Assumtions based on data analysis
    • Analyze by pivoting features
    • Analyze by visualizing data
    • Wrangle data
    • Model, predict and solve
  • Amazon Cloud DynamoDB
    • Setup Python for Amazon Cloud
    • Setup DynamoDB local database
    • Create table using Python SDK
    • Boto3 CRUD and low-level operations
    • Primary, partition and sort keys
    • Provisioned throughput capacity
    • Deploy DynamoDB to the Cloud
    • Importing data into DynamoDB
    • Where do we go from here
  • Google Cloud Firebase
    • Creating Firebase account
    • Import data into Firebase
  • Cloud Data API
    • Development strategies for Cloud data API
    • About API design and terminology
    • Atom Python editor setup
    • Google App Engine Launcher
    • Google App Engine project
    • App configuration using app.yaml
    • Data Model definition in
    • Implement API in
    • Development staging with App Engine Launcher
    • Deploy API on Google App Engine
    • Monitor Cloud app

The Leanpub 60-day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms

80% Royalties. Earn $16 on a $20 book.

We pay 80% royalties. That's not a typo: you earn $16 on a $20 sale. If we sell 5000 non-refunded copies of your book or course for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earnedover $12 millionwriting, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub