Leanpub: Publish Early, Publish Often

Glossary

This appendix presents popular terms related to the data world and a brief explanation about each that you can use as a reference when looking for new ideas to implement or searching for even more references.

The terms are divided by the chapter they are related to, although not all of them were directly mentioned. But with the basic overview the chapters provide, you have the foundation to understand what they mean, and then search deeper for the terms you think could be useful in your organization.

What is Data

Type	Definition	Example
Integer	A round number, positive or negative	`12`, `9473`, `8345262`
Floating point	A real number that can have as many decimal digits as needed, while its size on the computer abstraction doesn’t have to change.	`40.2`, `3.4159`
Character	An alphabetical character (from any alphabet), special characters, spaces, and even numbers. They are represented with previously defined sets of characters, so there are no calculations performed among them.	`A`, `π`, `&`
Array	A group of data points of any data type, like an array of integers or an array of strings. Data in any position of an array can be defined, accessed, and replaced by its index number.	`[1, 1, 2, 3, 5, 8]`, `[‘the’, ‘cow’, ‘jumped’, ‘over’, ‘the’, ‘moon’]`
String	Strings are arrays of characters; used to store words, sentences of other groups of characters in a sequence.	`’Diddle’`, `’cat’`, `’And the dish ran away with the spoon.’`
Boolean	They are represented by either 0 or 1, but they’re not the same as integers; they are representations of False and True states, respectively.	`0`, `1`, `False`, `True`
Date and Time	Date and time, in computing standards, are represented by YYYY-MM-DD (meaning Year, Month, and Day) and * H.H.:MM: S.S.* (meaning Hour, Minute, and Second), respectively.	`2017-10-14T22:11:20`
Geolocation	Geolocation data identifies an electronic device’s real-world location, like a GPS, smartphone, or other sensors, gathered by some kind of network connection. It is composed of two decimal numbers representing the latitude and longitude of a location.	`38.0000,-97.0000`

How Data is Born

Term	Definition
User-generated content	Also known as user-created content is any form of content—text, images, videos, etc.—proactively posted by users in digital platforms. The platform can use this type of content to promote interactions and populate its pages and by companies to promote products and perform marketing actions.
Application-generated Data	Application data is data necessary for an application to run and is a result of software events. Every application we use daily will generate data: be it from its users, calculations based on existing data, and logs or settings definitions.
Logs	Logs are the register generated by processing relevant events in a computer system. This register can serve purposes like reestablishing the system to a previous version or following a sequence of states.
Hardware-generated Data	Hardware-generated data is any data that wouldn’t be possible to be collected using only the software itself, like data coming from sensors. One of the prominent examples of hardware-generated data comes from all sorts of sensors used on the Internet of Things.
Internet of Things (IoT)	IoT is described as a network of physical objects—the “things”—with embedded systems that capture analog data, transform it into digital data, and send it to other objects and software over the network.

How Data is Stored

Term	Definition
Cloud file sharing	Cloud file sharing refers to a range of cloud services that allows people to store and synchronize documents, photos, videos, and other files in the cloud — and share them with other people. These services also allow users to share and synchronize data among multiple devices—notebooks, desktops, smartphones, mobile and web applications—for a single owner. Google Drive and Dropbox are some examples of this service.
Object Storage	Object storage (or object store) is a technology that allows you to store any kind of file or document (object), normally in the cloud. Object stores are well-suited to storing large volumes of multi-structured data and are often used to support data lakes. Amazon S3 is an example of a well-knonw cloud-based object store service.
Column-store Database Management System	A column-store database management system is a DBMS that indexes each column of a table, storing the column indexes together in disk — contrary to traditional relational DBMSs using row-store, where data is stored in rows, with indexes optional. In addition, most column-store DBMSs include additional optimization techniques (such as compression and tokenization) to compress the data further, using less storage and increasing input/output (I/O) performance.
Data Lake	A data lake is a collection of raw instances of various data assets in addition to the original data sources. The purpose of a data lake is to present an unrefined view of data. Its use for analytics is mostly reserved for highly skilled analysts to help them explore their data refinement and analysis techniques independent of any of the system-of-record (such as a data mart or data warehouse).
Data Warehouse	A data warehouse is a storage architecture designed to hold data extracted from various sources, like transaction systems, operational data stores, and external sources. The warehouse then combines that data in an aggregated, summary form suitable for data analysis and reporting for predefined business needs.
Database Management System (DBMS)	A database management system (DBMS) is a product used for the storage and organization of data that typically has defined formats and structures. DBMSs are categorized by their fundamental structures, like Relational Database Management Systems—that can be divided into row-store and column-store—, Document store, Key-Value, Graph, Time Series, and others.
Online Transaction Processing (OLTP)	Online transaction processing (OLTP) is a mode of database processing that is characterized by short transactions recording events and normally requires high availability and consistent, fast response times. This category of applications requires that a service request be answered within a predictable period that approaches “real-time.”
Relational Database Management System (RDBMS)	A database management system (DBMS) that incorporates the relational-data model, generally including a Structured Query Language (SQL) interface to access and manage the data. An RDBMS is a DBMS in which the database is organized and accessed according to the relationships between data items. Relationships between data items are expressed using tables.
Private Cloud database Platform as a Service	Private cloud database platform as a service (dbPaaS) offerings bring the self-service and scalability of public cloud dbPaaS to a private cloud infrastructure, without external exposure. They can be deployed and managed as part of an existing private cloud management framework, with similar benefits to a public cloud — a DBMS or data store engineered as a scalable and elastic service, ideally with a subscription of chargeback pricing models.
Document Store DBMS	Document store DBMSs contain objects stored in a hierarchical format. Documents contained in these DBMSs typically lack a predefined formal schema and do not have references to other documents within the collection. Documents are commonly self-described with JSON or XML.
Graph DBMS	Graph DBMSs represent relationships among entities and support complex network traversal operations that isn’t easy to do at scale with traditional relational database management systems (RDBMSs), although graph features are being tested by leading RDBMSs in various ways. Most graph DBMSs use basic graph theory suitable for general-purpose uses, such as processing the complex many-to-many connections found in social networks.
Key-Value DBMS	Key-value DBMSs map keys and their correspondent values with functions that make access fast and scalable. They keep data as a binary object, where it is added and read, but there are no “fields” to update — the entire value, other than the key, must be updated if changes are to be made. Key-value DBMSs support rapid scaling for simple data collections by automating “sharding” — splitting and distributing data across nodes in a massively parallel environment.
Public Cloud Computing	Public Cloud Computing is a style of computing where scalable and elastic IT-enabled resources are provided as a service to external customers using the internet. Using public cloud services generates economies of scale and sharing of resources that can reduce costs and increase technology choices.

How Data is Analyzed

Term	Definition
Advanced Analytics	Advanced analytics is the autonomous or semi-autonomous examination of data using techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations. Among analytic techniques are data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, and sentiment analysis.
Analytic Applications	Analytic applications are packaged BI capabilities for a particular domain or business problem. Traditional BI tools often lack the “packaging” required to facilitate adoption among most employees. This packaging can mean features like predefined integration with other business applications or visualization templates.
Analytics	Analytics is used to describe statistical and mathematical data analysis that clusters, segments, scores, and predicts scenarios. Analytics has gained traction and popularity among business users for its possible business applications and capacities in decision making.
Artificial Intelligence	Artificial intelligence (AI) is technology that appears to emulate human learning by coming to its own conclusions based on the pattern presented in the data dimensions. Activities performed by AI can appear to understand complex content, engage in natural dialogs with people, enhance human cognitive performance (also known as cognitive computing) or replace people on execution of non-routine tasks. Applications include autonomous vehicles, automatic speech recognition and generation, and detecting novel concepts and abstractions.
Business Analytics	Business analytics is a discipline that comprises solutions used to build analysis models and simulations to create scenarios, understand realities, and predict future states. Business analytics includes capabilities of several disciplines, like Statistics, Advanced Analytics, Data Visualization, and Machine Learning, adapted to work with data from specific domains and delivered as applications suitable for a business user.
Business Intelligence	Business intelligence (BI) is an umbrella term that includes the applications, infrastructure, tools, techniques, and best practices that enable analysis and access of information to improve and optimize decisions and performance.
Cloud Analytics and BI	Analytics and BI platform as a service — aka cloud ABI — delivers analytics capabilities and tools as a service. Those services can be databases, data integration and preparation tools, and Business Intelligence and visualization tools. Solutions are often architected with integrated information management and business analytics stacks.
Customer Analytics	Customer analytics is the use of data to understand the composition, needs, and satisfaction of the customer. This data can also be fed to Advanced Analytics algorithms to find clusters, predict behavior, determine trends, and other predictions using Machine Learning techniques.
Data Exploration	Data exploration is one of the first steps of an analysis, in which the analyst uses visualization and aggregation techniques to understand the characteristics of a dataset and assess what potential value can be extracted from it.
Data Scientist	The data scientist role is critical for organizations looking to extract insight from information assets for “big data” initiatives. It requires a broad combination of skills, like math and statistics, computer science, data visualization, communication, and understanding of the business domain. Those capacities are frequently fulfilled with a team. Data scientists’ activities are mainly to choose the best model for the data, train the existing data, perform predictions with test datasets, compare the predicted results with observed results, and tune the data parameters to reach the closest similarity possible between predicted and observed results. This process is called the data science pipeline.
Data Storytelling	Data storytelling combines interactive data visualization with narrative techniques to deliver insights in compelling, easily assimilated forms. Analytic data stories are intended to prompt discussion and drive collaborative decision making, while journalistic or reportage style data stories aim to inform or educate. Both commonly link data and time or events via a narrative story arc.
Descriptive Analysis	Descriptive analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”, generally using traditional business intelligence (BI) and visualizations such as pie charts, bar charts, line graphs, or tables, and presented to stakeholders using data storytelling techniques to create a timeline of events.
Diagnostic Analysis	Diagnostic analytics is a form of advanced analytics that examines data or content to answer the question “Why did it happen?”. It is characterized by techniques such as drill-down, data discovery, data mining, and the discovery of correlations between different characteristics and attributes of data.
Logic Data Warehouses	The logical data warehouse (LDW) is a technique that centralizes the access to data in a single interface, accessing the data in its original storage. It is now a best practice analytics data management architecture and design and combines the strengths of traditional repository warehouses with alternative data management and access strategies.
Machine Learning	Machine Learning algorithms are composed of many technologies, such as deep learning, neural networks, and natural language processing. They are used in unsupervised and supervised learning, guided by lessons from existing information to comprehend context and correlations and perform predictions about new entries of data.
Predictive Analytics	Predictive analytics is a form of advanced analytics that examines data or content to answer the question, “What is going to happen?” or, more precisely, “What is likely to happen?” It is characterized by techniques such as regression analysis, forecasting, multivariate statistics, pattern matching, predictive modeling, and other Machine Learning techniques.
Prescriptive Analytics	Prescriptive analytics techniques are employed to answer the question “What should we do?”. Common examples of prescriptive analytics are a combination of predictive analytics and rules, heuristics, and decision analysis methods. The output of a prescriptive analysis is a recommendation or automated action.
Product Analytics	Product analytics is a type of specialized business intelligence (BI) application that consumes service reports, product returns, warranties, customer feedback, and data from embedded sensors to help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all of these factors to customer experience. Advanced techniques can be used in assessing the customer experience subjectively, like analysis of social media, organic growth, and popularity.
Sales Analytics	Sales analytics systems provide all types of analysis (descriptive, diagnostic, predictive, and prescriptive functions) to monitor sales activity and sales execution and to guide vendor strategies and management actions.
Self-Service Analytics	Self-service analytics is a form of business intelligence (BI) in which business professionals are enabled and encouraged to perform queries, data exploration, and generate reports on their own. Simple-to-use BI tools often characterize self-service analytics with basic analytic capabilities and an underlying data model that has been simplified or scaled down for ease of understanding and straightforward data access.
Social Analytics	Social analytics applications assist organizations in the process of collecting, measuring, analyzing, and interpreting the results of interactions and associations among people, topics, ideas, and other content types on social media.
Software Usage Analytics	Software usage analytics is application-generated data that contains detailed tracking and analysis of user interactions within an application. Its analysis, along with data about user’s profiles, provides insights used to improve user experience, prioritize feature enhancement, measure user adoption, track compliance, and provide real-time user help.
Speech Analytics	Speech analytics, also known as audio mining, analyses keywords, phonetic, and transcriptions to extract insights from prerecorded and real-time voice streams. Analytics strategies can be used to classify calls, trigger alerts and workflows, and improve customer service performance. Research applications can analyze speeches in political and social scenarios.
Text Analytics	Text analytics is the process of deriving analytical insights from textual data sources. Insights can include determining and classifying the subjects of texts, summarizing texts, extracting key entities from texts, and identifying the tone or sentiment of texts. Text analytics can be applied in understanding customer satisfaction, media repercussion and extract insights from large batches of documents.
Web Analytics	Web analytics refers to specialized analytic applications to analyze user behavior on web pages and improve strategies for user experience, visitor acquisition and actions, and performance of digital marketing and advertising campaigns.
Workplace Analytics	Workplace analytics refers to aggregated insight derived from a collection of data sources, tools, and processes to enhance the quality of the digital workplace. The collective insights enable improvements in business value, employee engagement, IT operational performance, and security risk mitigation.

How Data is Managed

Term	Definition
Application Data Management	Application data management (ADM) is a business discipline in which business and IT work together to ensure uniformity, accuracy, stewardship, governance, semantic consistency, and accountability for data. Application data is the consistent set of identifiers and attributes of the data maintained and/or used within an application.
Batch processing	Batch processing is the processing of application programs and their data in large quantities with one being completed before the next is started. It is a process planned from time to time and is used to perform periodic events like shipping or payrolls. It is also a method used to process data from a source to a Data Warehouse where the pipeline is executed from time to time with all the data generated since the last batch processed.
Stream Processing	A stream processing is computing performed on the fly on the data arriving or being created in real-time. Its purpose is to stream data integration or stream analytics. Stream processing can be executed: 1. As new data arrives, using event-driven systems; 2. Shortly after it arrives, using real-time, on-demand queries; 3. Long after it has been stored, using on-demand queries on historical data.
Content Migration	Content migration refers to the process of consolidating and transferring unstructured content (files, documents, objects), along with related metadata, permissions, certificates, compounded structure, and linked components, stored permanently in one or more content repositories, to a new environment (cloud content services). This process often involves cleansing and archiving outdated files.
Data and analytics Governance	Data and analytics governance is the specification of decision rights and an accountability framework to ensure appropriate behavior in the valuation, creation, storage, access, consumption, retention, and disposal of all information assets. It includes the standards, rules, and processes related to dealing with these assets.
Data and Analytics Services	Data and analytics services are the consulting, implementation, and managed services for decision support, analytics, and data management capabilities that support an organization’s data-driven activities. These services can go from delivering analytics and BI solutions to data governance and data management solutions and infrastructure management.
Data Catalog	A data catalog is a technology used to build and maintain a data inventory through the discovery, organization, and description of datasets that are mostly (but not exclusively) related to the formation of data lakes. It provides context to help data analysts, data engineers, data scientists, managers, and other data consumers (including business users) locate a relevant dataset and understand what it means in order to extract business value.
Data Engineering	Data engineering is the practice of making the appropriate data accessible and available to various data users, like analysts and data scientists. The routine activities generally involve building data pipelines, applying extraction, transformation, and loading techniques considering data quality and governance standards to make relevant data available to users who will then extract value from it through business processes.
Data Integration	The discipline of data integration comprises the practices, architectural techniques, and tools for achieving the consistent access and delivery of data across the data sources and data structure types in the organization to meet the data consumption requirements of all applications and business processes.
Data Ops	Data ops is a methodology oriented to automated processes and activities to improve data quality and reduce the cycle of data preparation while protecting the privacy and access restrictions of data. It is becoming a complementary practice to data analytics and data science to deliver reliable data for insights.
Data Preparation	Data preparation is an iterative and agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for data integration, data science, and analytics. Data preparation tools can accelerate time to insight by reducing the complexity of data preparation, finding patterns in their integrated datasets, and sharing their findings for further analysis without the need of getting into the infrastructure.
Data Profiling	Data profiling is a technology for discovering and investigating data quality issues such as lack of consistency, accuracy, and completeness. This is accomplished by analyzing data sources and collecting metadata that shows the condition of the data. It enables managers to investigate the origin of data errors. The tools provide statistics about the quality, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats.
Data Quality Tools	As data quality is a collection of metrics and processes that ensure data can be trusted to deliver insights, the market for data quality tools have become highly visible in recent years as more organizations understand the impact of poor quality data and seek solutions for improvement. Data quality tools can assess and perform simple predefined modifications to fields based on rules, and were traditionally aligned with the cleansing of customer data (names and addresses) in support of CRM-related activities, although the tools have expanded the domains and integrations to assess the quality of data in different sources.
Enterprise Information Management (EIM) Program	Enterprise information management (EIM) is an integrative discipline for structuring, describing, and governing data and analytics across organizational and technological boundaries in order to maximize business outcomes. EIM programs usually are led by senior business and information roles such as the chief data officer (CDO), and often start small with specific business-relevant programs but then expand over time.
Information Lifecycle Management	Information life cycle management (ILM) is an approach to data and storage management that recognizes that the value of information changes over time and that it must be managed accordingly. ILM seeks to classify data according to its business value and establish policies to migrate, store, and archive data according to their appropriate tier.
Master Data Management (MDM)	Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets. Master data is the consistent and uniform set of identifiers and attributes that describes the core entities of the enterprise, including customers, leads, suppliers, sites, hierarchies, etc.
Metadata	Metadata is information that describes various facets of an information asset to improve its usability throughout its life cycle. Generally speaking, the more valuable the information asset, the more critical it is to manage the metadata it’s because it is the metadata definition that provides the understanding that unlocks the value of data. Understanding and correctly managing this value is also part of the Information Lifecycle Management discipline.
Metadata Management Solutions	Metadata management solutions (MMS) are software applications that manage metadata with capabilities like metadata repositories, business glossary, data lineage, impact analysis, rule management, semantic frameworks, and metadata ingestion and translation from different data sources.
Model Management	Model management aims to streamline the prioritization, creation, operationalization, and execution of predictive and deterministic models. It is used by Machine Learning engineers to provide support for running models, and supports version and access control, model performance tracking, scheduled or condition-based recalibration of models, and also serves as a “library” to facilitate end-users’ access and reuse of completed models.

How Data is Used in an Organization

Term	Definition
Business Process Analysis Tools	Business process analysis tools are used by business users to document, analyze and streamline complex processes, improve productivity, increase quality, become more agile and effective, and comply with business rules and laws. They enable users and managers to better understand business processes, events, workflows, and data using proven modeling techniques.
Decision Management	Decision management is the discipline of designing, building, and maintaining systems that produce structured decisions based on a representation of the decision-making process with inputs, algorithms, and results. Decision-making systems may be implemented with rule engines, optimization algorithms, or other kinds of machine learning algorithms.
Information Architecture	Information architecture is the processes of requirements, principles, and models that define the current state, future state, and guidance necessary to flexibly share and exchange information assets effectively throughout the organization to achieve a data-driven culture.