About the Book
The field of data science, big data, machine learning, and artificial intelligence is exciting and complex at the same time. Demand-supply equation is favouring the job seeker. The market size is in billions of dollars. Venture funding, mergers, and acquisition activity is very high in this field. Enterprises are investing significantly building capabilities in data science.
Bundled along with the huge opportunity comes fair degree of hype and complexity. Data science is a multi-disciplinary field involving statistics, mathematics, behavioural psychology, Cloud computing technologies, among many other specialised areas of study. Data science is also rapidly growing with new tools, technologies, algorithms, datasets, and use cases. For a beginner in this field, the learning curve can be fairly daunting. This is where this book helps.
The data science solutions book provides a repeatable, robust, and reliable framework to apply the right-fit workflows, strategies, tools, APIs, and domain for your data science projects.
This book takes a solutions focused approach to data science. Each chapter meets an end-to-end objective of solving for data science workflow or technology requirements. At the end of each chapter you either complete a data science tools pipeline or write a fully functional coding project meeting your data science workflow requirements.
Seven stages of data science solutions workflow
Every chapter in this book will go through one or more of these seven stages of data science solutions workflow.
Question. Problem. Solution.
Before starting a data science project we must ask relevant questions specific to our project domain and datasets. We may answer or solve these during the course of our project. Think of these questions-solutions as the key requirements for our data science project. Here are some `templates` that can be used to frame questions for our data science projects.
- Can we classify an entity based on given features if our data science model is trained on certain number of samples with similar features related to specific classes?
- Do the samples, in a given dataset, cluster in specific classes based on similar or correlated features?
- Can our machine learning model recognise and classify new inputs based on prior training on a sample of similar inputs?
- Can we analyse the sentiment of a given sample?
Acquire. Search. Create. Catalog.
This stage involves data acquisition strategies including searching for datasets on popular data sources or internally within your organisation. We may also create a dataset based on external or internal data sources.
The acquire stage may feedback to the question stage, refining our problem and solution definition based on the constraints and characteristics of the acquired datasets.
Wrangle. Prepare. Cleanse.
The data wrangle phase prepares and cleanses our datasets for our project goals. This workflow stage starts by importing a dataset, exploring the dataset for its features and available samples, preparing the dataset using appropriate data types and data structures, and optionally cleansing the data set for creating model training and solution testing samples.
The wrangle stage may circle back to the acquire stage to identify complementary datasets to combine and complete the existing dataset.
Analyse. Patterns. Explore.
The analyse phase explores the given datasets to determine patterns, correlations, classification, and nature of the dataset. This helps determine choice of model algorithms and strategies that may work best on the dataset.
The analyse stage may also visualise the dataset to determine such patterns.
Model. Predict. Solve.
The model stage uses prediction and solution algorithms to train on a given dataset and apply this training to solve for a given problem.
Visualise. Report. Present.
The visualisation stage can help data wrangling, analysis, and modeling stages. Data can be visualized using charts and plots suiting the characteristics of the dataset and the desired results.
Visualisation stage may also provide the inputs for the supply stage.
Supply. Products. Services.
Once we are ready to monetise our data science solution or derive further return on investment from our projects, we need to think about distribution and data supply chain. This stage circles back to the acquisition stage. In fact we are acquiring data from someone else's data supply chain.
Learning path for data science
In this book we accomplish several learning goals covering the multi-disciplinary field of data science.
Open Source Technologies. Open source technologies offering solutions across the data science stack. These include technologies like D3 for visualising data, Hadoop, Spark, and many others.
Enterprise Tools. Commercial products offering enterprise scale, power user solutions across the data science stack. These include products like Tableau for exploring data, Trifacta for wrangling data, among others. Neo4j for Graph databases. It is important to learn about these products, while knowing the editions and alternatives available in the open source stack.
Data Science Algorithms. Algorithms to prepare and analyse datasets for data science projects. APIs offering algorithms for computer vision, speech recognition, and natural language processing are available from leading Cloud providers like Google as a paid service. Using open source libraries like Tensorflow, Theano, and Python Scikit-learn you can explore more developing your own data science models.
Cloud Platform Development. Developing full-stack frontend apps and backend APIs to deliver data science solutions and data products.
Data Science Thinking. These learning goals are inter-dependent. An open source technology may be integrated within an Enterprise product. Cloud platform may deliver the data science algorithms as APIs. The Cloud platform itself may be built using several open source technologies.
We will evolve Data Science Thinking over the course of this book using strategies and workflow which connect the learning goals into a unified and high performance project pipeline.
About the Author
Manav Sehgal is a builder, author, and inventor specializing in product management, software engineering, cloud, and data science with more than 15 years of experience at Amazon (AWS India), Xerox PARC, HCL Technologies, and Daily Mail Group.
During his career he has also built, mentored, and led technology and product management for six startups with successful exits including Rightster (Video Advertising), Map of Medicine (Healthcare), Cytura (Media Services), Infonetmap (E-commerce), Edynamics (Digital Marketing), and AgRisk (Agriculture Analytics). Manav is AWS Certified Solutions Architect Associate (2019).
Daily Mail Group (RMS) sponsored him for Executive MBA module in Leading Innovative Change, from UC Berkeley, Haas School of Business (2015). He has completed CMMI certification from SEI CMU (2006). He studied Computer Engineering from Delhi College of Engineering, while employed at Xerox PARC (1995-99). He also earned distinction in Lean Management program conducted by AOTS, JMAM Japan (1994-95).