Part II - Data Usage
From here on, this book addresses common possibilities and concerns for business people dealing with data. The next chapters go over a series of analytical techniques and practices currently under the spotlight and closer to state-of-the-art when it comes to taking care of data from a business perspective.
When going through the steps below - especially when crossing with more technical activities - keep in mind the goal of applying these processes: becoming data-driven. The hype over the importance of data is justifiable, as data is, in fact, capable of guiding decisions without bias and is focused on results, but doing so effectively is not easy and requires knowledge.
Deploying a database and feeding data to a Business Intelligence software without a solid plan might be as damaging as not using data at all. It can lead to wrong assumptions based on inadequate data or the wrong type of analysis.
The following chapters follow a bottom-up approach:
- First, we’ll introduce the data concept closest to business operation: data analytics. We will then go over types of analyses and scratch the surface of what Machine Learning means, so you can start planning what kinds of analyses would help your organization get the results it needs. An overview of what types of software to look for is provided, as well as a high-level approach on how it relates to the databases presented in Chapter 4.
- We then talk about business strategies related to data and analytics. After getting familiar with the technology stack concepts and visualizing how each can impact your business, managing expectations and presenting results is essential to building a robust data culture.
- Then, we get more profound on the stack and talk about the technology team that will operate these tools. Most importantly, how they will take care of the data and why it impacts directly on the analyses and insights.
- The final chapter lays down how you can start building this team. How to search for new knowledge and why a culture of constant learning should be developed. After all, new data keeps coming in nonstop.
We hope you find value in the following chapters and can build a solid and sustainable data initiative with the information!
Chapter 9 - Wrapping it up
To summarize the entire data process we learned in this book, we will propose a hypothetical scenario of an organization and how it can start using its data to promote business impact.
We will follow the same sequence of content that was presented in this book, starting from Chapter 1 (What is Data) and finishing at Chapter 8 (How to Build a Data Team). While reading this chapter, feel free to pause and go back to the corresponding chapters if you don’t remember a concept that is being mentioned.
Now, let’s dive into the story. If you’re familiar with the TV mockumentary sitcom The Office 1, you might have heard about the Dunder Mifflin Paper Company, Inc. It’s a company that resells paper from manufacturers to offices and other companies. We will examine how Dunder Mifflin could become data-driven and thrive in the paper business.
First, what data could Dunder Mifflin have?
They have data from all of their paper suppliers. Names, emails, addresses, phone numbers, last purchase dates, and the supplier status, currently “active” or “inactive”. In a catalog of every type of paper, they also have the type, the price, and how many sheets are in a package.
Additionally, they have information about their customers: names, emails, addresses, and date of the last purchase.
They also have data about all of their sales. Every order can have an identification number and the name of the salesman responsible for that sale. The order’s content may include the type of paper, the number of packages, and the charged price. They also have an identification of the customer, the order date, shipping date, and delivery date.
In a more modern system, they could even store the buyer’s address so the GPS data generated by the delivery truck can automatically calculate the best routes and delivery ETAs.
Although all this data above seems like many different types of data, notice that all these fields could be categorized into a few data types: Integer (the type of paper, the number of packages), Floating-point (charged price), Character/String (Names, emails, addresses) and Boolean (status of the supplier is currently active or inactive).
How can this data become information at Dunder Mifflin?
At first, someone randomly looking at all these fields and types of data separately wouldn’t generate any value for Dunder Mifflin, but the moment a branch manager or a salesman looks at this data combined, it clearly becomes Information, as they know how to interpret and act based on it and, more importantly, they know how valuable this information could be to its competitors.
Then, where else could we gather data from?
If the sales system registers the date and time of a sale and stores it, it is application-generated. If the trucks that deliver the supplies for the buyers have sensors installed to register the truck’s speed and how long it took to reach the destination, and at what hour that happened, this is hardware-generated data.
Suppose there is a system that registers support tickets and product reviews submitted by the buyers. In that case, this can be user-generated data and can be used to analyze customer satisfaction.
How does Dunder Mifflin store its data?
The software used to register and manage sales can have its own Relational Database Management System (RDBMS) with the following tables:
- One table containing all the customers and their details with a unique identification (ID) for each customer.
- One table containing all the orders and their details containing the customer’s ID who made the order, along with the unique identifier for each order and product bought.
- One table containing all the possible discounts for specific price ranges.
- One table with all the salesmen and their details, like region and commission range.
- One table with the relationship salesman-to-customer, where every record stores the ID of a customer and the ID of the salesman who is responsible for that customer.
Dunder Mifflin also uses an independent supply chain software that has its own RDBMS and contains tables like:
- The suppliers, their details, and an ID for each.
- The products available from the suppliers, with name, price, and the supplier’s ID that sells it.
- Each order made by Dunder Mifflin, containing the ID of the supplier it was ordered from and the IDs of the products ordered.
From a data storage perspective, Dunder Mifflin, as a modern paper company, has its customer and supplier management software running as Software-as-a-Service and stored on the cloud along with their servers that hold their relational databases. Nothing is stored in their headquarters or branch (on-premises). Yes, even Michael Scott uses the cloud!
How does Dunder Mifflin analyze its data?
Using all the data from the relational databases to support the customer and supplier systems, Dunder Mifflin built a Data Warehouse using a modern column-based (OLAP) database that unifies all the company’s data to be analyzed. To prepare the data for the analysis, data engineers clean records registered with errors, treat empty fields so they won’t generate incorrect reports, and assure that data is in the right format and range of values.
With a Business Intelligence platform, Dunder Mifflin’s managers can continuously access and monitor the state of the reports, which can be in dashboards containing charts and historical timelines for easier comprehension.
Analysts could perform a descriptive analysis to check their customers’ satisfaction with metrics like the percentage of orders that receive complaints and what customers make more expensive orders so they can pay more attention to their requests, and what orders are still waiting for delivery.
Predictive analyses can be made to visualize future trends in the paper business:
- Based on purchase history, what customers are likely to cancel their contracts?
- Based on suppliers’ price change, what suppliers are likely to increase their prices?
- Based on orders, what type of paper is likely to be bought by customers of what business industries?
Then, even prescriptive analysis can be performed to increase the number of sales, like:
- What kind of marketing actions will result in the most sales for bigger offices?
- What kind of customer relationship channel (phone, email, in-person visits) will lead to more satisfied customers from smaller offices?
How can a Dunder Mifflin branch convince the executives it needs an infrastructure for these types of analysis?
The corporate team can be insecure about investing in technology, as it is the very reason why the paper business is shrinking. A good pitch for the case can change the odds for the proposition.
- Start with a known business question: With the rise of digital businesses and software that manages traditional companies’ operations, paper demand decreases every day. What markets do we have to explore in order to grow in the next quarter despite the digital evolution of offices?
- Present evidence with consistent outputs: Dunder Mifflin can generate reports about its sales that show which businesses are growing, which are shrinking, and what kind of marketing or customer relationship actions can make a change in that. In this case, it can be useful to present an action that already worked: when a branch manager personally spent hours having dinner with a local client and got the order because the client saw value in supporting a local business. Seeing what kind of clients would value this trait, manager’s hours could be more efficiently spent with these clients.
- Address possible concerns: The corporate staff might not seriously consider that technology can help a traditional company. Because of that, it can be essential to show the market tendencies: the fact that the company can be out of business in a few years or months if things stay as they are. Investing in technology can cause strategic changes and make administrative work faster and cut costs there, too.
Create a mission statement: It is essential to convince the corporate office and the teams in the branches to get along with the change and really use the systems so the analytical processes can have input to work with. A mission statement can be: “To support Dunder Mifflin in being the paper company with the best customer relationship, we will focus our efforts on our target customers by offering exclusive support, close contact, and fast shipping at a friendly cost.”
What is the data team going to look like?
The data engineers will have several attributions, including:
- Integrating the data sources from different systems and databases to the Data Warehouse and then to the Business Intelligence platform.
- Applying transformation to this data when it is being loaded into the Data Warehouse, so it is standardized for analysis.
- Fixing possible errors in this data, like wrong formats, inconsistencies, and empty values.
- Mapping, providing, and monitoring access to all data sources and the platforms where the data can be accessed.
- Constantly monitoring the performance of the databases and the pipelines to fix errors and identify bottlenecks for improvement.
The team will also need analysts - people that make the interface between the business questions and the data to create the dashboards and reports required. Those analysts will have constant contact with branch management; the supplier relationship manager, the customer relationship manager, the accounting team, and the human resources representative.
These teams will need to know what the data shows about their actions and their work, to then change strategies accordingly.
Analysts will have contact with the corporate team and report the branch’s results about sales, orders, marketing, and other operational issues. Not only will analysts extract these metrics, but they will also need communication skills to tell the corporate what actions led to those results and why.
When a good amount of data about orders and suppliers has been registered, the branch can consider bringing in a data scientist to help with predictions. They can build models that will forecast sales, customer acquisition, and other metrics that will show in what direction the current strategies are taking a company.
With all of these pieces in place, Dunder Mifflin can now stop drifting in a digital economy and start taking action based on data to put its efforts into the right moves to keep itself in business!
We hope this conclusion helped you consolidate all the concepts presented in this book, and we really wish this could be the beginning of a new “data-related life” for you.
Remember that at the end of Chapter 8 (8.4 Learning Path), we proposed a few technical and business books as a learning path for the next steps if you decide to go deeper into the data world.
Finally, we present a glossary with the many terms, concepts, and jargon used within this book that you will frequently hear if you decide to start working more intimately with data. There’s also a brief explanation for each term and concept that you can use as a quick reference guide.
Glossary
This appendix presents popular terms related to the data world and a brief explanation about each that you can use as a reference when looking for new ideas to implement or searching for even more references.
The terms are divided by the chapter they are related to, although not all of them were directly mentioned. But with the basic overview the chapters provide, you have the foundation to understand what they mean, and then search deeper for the terms you think could be useful in your organization.
What is Data
| Type | Definition | Example |
|---|---|---|
| Integer | A round number, positive or negative |
12, 9473, 8345262
|
| Floating point | A real number that can have as many decimal digits as needed, while its size on the computer abstraction doesn’t have to change. |
40.2, 3.4159
|
| Character | An alphabetical character (from any alphabet), special characters, spaces, and even numbers. They are represented with previously defined sets of characters, so there are no calculations performed among them. |
A, π, &
|
| Array | A group of data points of any data type, like an array of integers or an array of strings. Data in any position of an array can be defined, accessed, and replaced by its index number. |
[1, 1, 2, 3, 5, 8], [‘the’, ‘cow’, ‘jumped’, ‘over’, ‘the’, ‘moon’]
|
| String | Strings are arrays of characters; used to store words, sentences of other groups of characters in a sequence. |
’Diddle’, ’cat’, ’And the dish ran away with the spoon.’
|
| Boolean | They are represented by either 0 or 1, but they’re not the same as integers; they are representations of False and True states, respectively. |
0, 1, False, True
|
| Date and Time | Date and time, in computing standards, are represented by YYYY-MM-DD (meaning Year, Month, and Day) and * H.H.:MM: S.S.* (meaning Hour, Minute, and Second), respectively. | 2017-10-14T22:11:20 |
| Geolocation | Geolocation data identifies an electronic device’s real-world location, like a GPS, smartphone, or other sensors, gathered by some kind of network connection. It is composed of two decimal numbers representing the latitude and longitude of a location. | 38.0000,-97.0000 |
How Data is Born
| Term | Definition |
|---|---|
| User-generated content | Also known as user-created content is any form of content—text, images, videos, etc.—proactively posted by users in digital platforms. The platform can use this type of content to promote interactions and populate its pages and by companies to promote products and perform marketing actions. |
| Application-generated Data | Application data is data necessary for an application to run and is a result of software events. Every application we use daily will generate data: be it from its users, calculations based on existing data, and logs or settings definitions. |
| Logs | Logs are the register generated by processing relevant events in a computer system. This register can serve purposes like reestablishing the system to a previous version or following a sequence of states. |
| Hardware-generated Data | Hardware-generated data is any data that wouldn’t be possible to be collected using only the software itself, like data coming from sensors. One of the prominent examples of hardware-generated data comes from all sorts of sensors used on the Internet of Things. |
| Internet of Things (IoT) | IoT is described as a network of physical objects—the “things”—with embedded systems that capture analog data, transform it into digital data, and send it to other objects and software over the network. |
How Data is Stored
| Term | Definition |
|---|---|
| Cloud file sharing | Cloud file sharing refers to a range of cloud services that allows people to store and synchronize documents, photos, videos, and other files in the cloud — and share them with other people. These services also allow users to share and synchronize data among multiple devices—notebooks, desktops, smartphones, mobile and web applications—for a single owner. Google Drive and Dropbox are some examples of this service. |
| Object Storage | Object storage (or object store) is a technology that allows you to store any kind of file or document (object), normally in the cloud. Object stores are well-suited to storing large volumes of multi-structured data and are often used to support data lakes. Amazon S3 is an example of a well-knonw cloud-based object store service. |
| Column-store Database Management System | A column-store database management system is a DBMS that indexes each column of a table, storing the column indexes together in disk — contrary to traditional relational DBMSs using row-store, where data is stored in rows, with indexes optional. In addition, most column-store DBMSs include additional optimization techniques (such as compression and tokenization) to compress the data further, using less storage and increasing input/output (I/O) performance. |
| Data Lake | A data lake is a collection of raw instances of various data assets in addition to the original data sources. The purpose of a data lake is to present an unrefined view of data. Its use for analytics is mostly reserved for highly skilled analysts to help them explore their data refinement and analysis techniques independent of any of the system-of-record (such as a data mart or data warehouse). |
| Data Warehouse | A data warehouse is a storage architecture designed to hold data extracted from various sources, like transaction systems, operational data stores, and external sources. The warehouse then combines that data in an aggregated, summary form suitable for data analysis and reporting for predefined business needs. |
| Database Management System (DBMS) | A database management system (DBMS) is a product used for the storage and organization of data that typically has defined formats and structures. DBMSs are categorized by their fundamental structures, like Relational Database Management Systems—that can be divided into row-store and column-store—, Document store, Key-Value, Graph, Time Series, and others. |
| Online Transaction Processing (OLTP) | Online transaction processing (OLTP) is a mode of database processing that is characterized by short transactions recording events and normally requires high availability and consistent, fast response times. This category of applications requires that a service request be answered within a predictable period that approaches “real-time.” |
| Relational Database Management System (RDBMS) | A database management system (DBMS) that incorporates the relational-data model, generally including a Structured Query Language (SQL) interface to access and manage the data. An RDBMS is a DBMS in which the database is organized and accessed according to the relationships between data items. Relationships between data items are expressed using tables. |
| Private Cloud database Platform as a Service | Private cloud database platform as a service (dbPaaS) offerings bring the self-service and scalability of public cloud dbPaaS to a private cloud infrastructure, without external exposure. They can be deployed and managed as part of an existing private cloud management framework, with similar benefits to a public cloud — a DBMS or data store engineered as a scalable and elastic service, ideally with a subscription of chargeback pricing models. |
| Document Store DBMS | Document store DBMSs contain objects stored in a hierarchical format. Documents contained in these DBMSs typically lack a predefined formal schema and do not have references to other documents within the collection. Documents are commonly self-described with JSON or XML. |
| Graph DBMS | Graph DBMSs represent relationships among entities and support complex network traversal operations that isn’t easy to do at scale with traditional relational database management systems (RDBMSs), although graph features are being tested by leading RDBMSs in various ways. Most graph DBMSs use basic graph theory suitable for general-purpose uses, such as processing the complex many-to-many connections found in social networks. |
| Key-Value DBMS | Key-value DBMSs map keys and their correspondent values with functions that make access fast and scalable. They keep data as a binary object, where it is added and read, but there are no “fields” to update — the entire value, other than the key, must be updated if changes are to be made. Key-value DBMSs support rapid scaling for simple data collections by automating “sharding” — splitting and distributing data across nodes in a massively parallel environment. |
| Public Cloud Computing | Public Cloud Computing is a style of computing where scalable and elastic IT-enabled resources are provided as a service to external customers using the internet. Using public cloud services generates economies of scale and sharing of resources that can reduce costs and increase technology choices. |
How Data is Analyzed
| Term | Definition |
| Advanced Analytics | Advanced analytics is the autonomous or semi-autonomous examination of data using techniques and tools, typically beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations. Among analytic techniques are data/text mining, machine learning, pattern matching, forecasting, visualization, semantic analysis, and sentiment analysis. |
| Analytic Applications | Analytic applications are packaged BI capabilities for a particular domain or business problem. Traditional BI tools often lack the “packaging” required to facilitate adoption among most employees. This packaging can mean features like predefined integration with other business applications or visualization templates. |
| Analytics | Analytics is used to describe statistical and mathematical data analysis that clusters, segments, scores, and predicts scenarios. Analytics has gained traction and popularity among business users for its possible business applications and capacities in decision making. |
| Artificial Intelligence | Artificial intelligence (AI) is technology that appears to emulate human learning by coming to its own conclusions based on the pattern presented in the data dimensions. Activities performed by AI can appear to understand complex content, engage in natural dialogs with people, enhance human cognitive performance (also known as cognitive computing) or replace people on execution of non-routine tasks. Applications include autonomous vehicles, automatic speech recognition and generation, and detecting novel concepts and abstractions. |
| Business Analytics | Business analytics is a discipline that comprises solutions used to build analysis models and simulations to create scenarios, understand realities, and predict future states. Business analytics includes capabilities of several disciplines, like Statistics, Advanced Analytics, Data Visualization, and Machine Learning, adapted to work with data from specific domains and delivered as applications suitable for a business user. |
| Business Intelligence | Business intelligence (BI) is an umbrella term that includes the applications, infrastructure, tools, techniques, and best practices that enable analysis and access of information to improve and optimize decisions and performance. |
| Cloud Analytics and BI | Analytics and BI platform as a service — aka cloud ABI — delivers analytics capabilities and tools as a service. Those services can be databases, data integration and preparation tools, and Business Intelligence and visualization tools. Solutions are often architected with integrated information management and business analytics stacks. |
| Customer Analytics | Customer analytics is the use of data to understand the composition, needs, and satisfaction of the customer. This data can also be fed to Advanced Analytics algorithms to find clusters, predict behavior, determine trends, and other predictions using Machine Learning techniques. |
| Data Exploration | Data exploration is one of the first steps of an analysis, in which the analyst uses visualization and aggregation techniques to understand the characteristics of a dataset and assess what potential value can be extracted from it. |
| Data Scientist | The data scientist role is critical for organizations looking to extract insight from information assets for “big data” initiatives. It requires a broad combination of skills, like math and statistics, computer science, data visualization, communication, and understanding of the business domain. Those capacities are frequently fulfilled with a team. Data scientists’ activities are mainly to choose the best model for the data, train the existing data, perform predictions with test datasets, compare the predicted results with observed results, and tune the data parameters to reach the closest similarity possible between predicted and observed results. This process is called the data science pipeline. |
| Data Storytelling | Data storytelling combines interactive data visualization with narrative techniques to deliver insights in compelling, easily assimilated forms. Analytic data stories are intended to prompt discussion and drive collaborative decision making, while journalistic or reportage style data stories aim to inform or educate. Both commonly link data and time or events via a narrative story arc. |
| Descriptive Analysis | Descriptive analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”, generally using traditional business intelligence (BI) and visualizations such as pie charts, bar charts, line graphs, or tables, and presented to stakeholders using data storytelling techniques to create a timeline of events. |
| Diagnostic Analysis | Diagnostic analytics is a form of advanced analytics that examines data or content to answer the question “Why did it happen?”. It is characterized by techniques such as drill-down, data discovery, data mining, and the discovery of correlations between different characteristics and attributes of data. |
| Logic Data Warehouses | The logical data warehouse (LDW) is a technique that centralizes the access to data in a single interface, accessing the data in its original storage. It is now a best practice analytics data management architecture and design and combines the strengths of traditional repository warehouses with alternative data management and access strategies. |
| Machine Learning | Machine Learning algorithms are composed of many technologies, such as deep learning, neural networks, and natural language processing. They are used in unsupervised and supervised learning, guided by lessons from existing information to comprehend context and correlations and perform predictions about new entries of data. |
| Predictive Analytics | Predictive analytics is a form of advanced analytics that examines data or content to answer the question, “What is going to happen?” or, more precisely, “What is likely to happen?” It is characterized by techniques such as regression analysis, forecasting, multivariate statistics, pattern matching, predictive modeling, and other Machine Learning techniques. |
| Prescriptive Analytics | Prescriptive analytics techniques are employed to answer the question “What should we do?”. Common examples of prescriptive analytics are a combination of predictive analytics and rules, heuristics, and decision analysis methods. The output of a prescriptive analysis is a recommendation or automated action. |
| Product Analytics | Product analytics is a type of specialized business intelligence (BI) application that consumes service reports, product returns, warranties, customer feedback, and data from embedded sensors to help manufacturers evaluate product defects, identify opportunities for product improvements, detect patterns in usage or capacity of products, and link all of these factors to customer experience. Advanced techniques can be used in assessing the customer experience subjectively, like analysis of social media, organic growth, and popularity. |
| Sales Analytics | Sales analytics systems provide all types of analysis (descriptive, diagnostic, predictive, and prescriptive functions) to monitor sales activity and sales execution and to guide vendor strategies and management actions. |
| Self-Service Analytics | Self-service analytics is a form of business intelligence (BI) in which business professionals are enabled and encouraged to perform queries, data exploration, and generate reports on their own. Simple-to-use BI tools often characterize self-service analytics with basic analytic capabilities and an underlying data model that has been simplified or scaled down for ease of understanding and straightforward data access. |
| Social Analytics | Social analytics applications assist organizations in the process of collecting, measuring, analyzing, and interpreting the results of interactions and associations among people, topics, ideas, and other content types on social media. |
| Software Usage Analytics | Software usage analytics is application-generated data that contains detailed tracking and analysis of user interactions within an application. Its analysis, along with data about user’s profiles, provides insights used to improve user experience, prioritize feature enhancement, measure user adoption, track compliance, and provide real-time user help. |
| Speech Analytics | Speech analytics, also known as audio mining, analyses keywords, phonetic, and transcriptions to extract insights from prerecorded and real-time voice streams. Analytics strategies can be used to classify calls, trigger alerts and workflows, and improve customer service performance. Research applications can analyze speeches in political and social scenarios. |
| Text Analytics | Text analytics is the process of deriving analytical insights from textual data sources. Insights can include determining and classifying the subjects of texts, summarizing texts, extracting key entities from texts, and identifying the tone or sentiment of texts. Text analytics can be applied in understanding customer satisfaction, media repercussion and extract insights from large batches of documents. |
| Web Analytics | Web analytics refers to specialized analytic applications to analyze user behavior on web pages and improve strategies for user experience, visitor acquisition and actions, and performance of digital marketing and advertising campaigns. |
| Workplace Analytics | Workplace analytics refers to aggregated insight derived from a collection of data sources, tools, and processes to enhance the quality of the digital workplace. The collective insights enable improvements in business value, employee engagement, IT operational performance, and security risk mitigation. |
How Data is Managed
| Term | Definition |
|---|---|
| Application Data Management | Application data management (ADM) is a business discipline in which business and IT work together to ensure uniformity, accuracy, stewardship, governance, semantic consistency, and accountability for data. Application data is the consistent set of identifiers and attributes of the data maintained and/or used within an application. |
| Batch processing | Batch processing is the processing of application programs and their data in large quantities with one being completed before the next is started. It is a process planned from time to time and is used to perform periodic events like shipping or payrolls. It is also a method used to process data from a source to a Data Warehouse where the pipeline is executed from time to time with all the data generated since the last batch processed. |
| Stream Processing | A stream processing is computing performed on the fly on the data arriving or being created in real-time. Its purpose is to stream data integration or stream analytics. Stream processing can be executed: 1. As new data arrives, using event-driven systems; 2. Shortly after it arrives, using real-time, on-demand queries; 3. Long after it has been stored, using on-demand queries on historical data. |
| Content Migration | Content migration refers to the process of consolidating and transferring unstructured content (files, documents, objects), along with related metadata, permissions, certificates, compounded structure, and linked components, stored permanently in one or more content repositories, to a new environment (cloud content services). This process often involves cleansing and archiving outdated files. |
| Data and analytics Governance | Data and analytics governance is the specification of decision rights and an accountability framework to ensure appropriate behavior in the valuation, creation, storage, access, consumption, retention, and disposal of all information assets. It includes the standards, rules, and processes related to dealing with these assets. |
| Data and Analytics Services | Data and analytics services are the consulting, implementation, and managed services for decision support, analytics, and data management capabilities that support an organization’s data-driven activities. These services can go from delivering analytics and BI solutions to data governance and data management solutions and infrastructure management. |
| Data Catalog | A data catalog is a technology used to build and maintain a data inventory through the discovery, organization, and description of datasets that are mostly (but not exclusively) related to the formation of data lakes. It provides context to help data analysts, data engineers, data scientists, managers, and other data consumers (including business users) locate a relevant dataset and understand what it means in order to extract business value. |
| Data Engineering | Data engineering is the practice of making the appropriate data accessible and available to various data users, like analysts and data scientists. The routine activities generally involve building data pipelines, applying extraction, transformation, and loading techniques considering data quality and governance standards to make relevant data available to users who will then extract value from it through business processes. |
| Data Integration | The discipline of data integration comprises the practices, architectural techniques, and tools for achieving the consistent access and delivery of data across the data sources and data structure types in the organization to meet the data consumption requirements of all applications and business processes. |
| Data Ops | Data ops is a methodology oriented to automated processes and activities to improve data quality and reduce the cycle of data preparation while protecting the privacy and access restrictions of data. It is becoming a complementary practice to data analytics and data science to deliver reliable data for insights. |
| Data Preparation | Data preparation is an iterative and agile process for exploring, combining, cleaning, and transforming raw data into curated datasets for data integration, data science, and analytics. Data preparation tools can accelerate time to insight by reducing the complexity of data preparation, finding patterns in their integrated datasets, and sharing their findings for further analysis without the need of getting into the infrastructure. |
| Data Profiling | Data profiling is a technology for discovering and investigating data quality issues such as lack of consistency, accuracy, and completeness. This is accomplished by analyzing data sources and collecting metadata that shows the condition of the data. It enables managers to investigate the origin of data errors. The tools provide statistics about the quality, such as degree of duplication and ratios of attribute values, both in tabular and graphical formats. |
| Data Quality Tools | As data quality is a collection of metrics and processes that ensure data can be trusted to deliver insights, the market for data quality tools have become highly visible in recent years as more organizations understand the impact of poor quality data and seek solutions for improvement. Data quality tools can assess and perform simple predefined modifications to fields based on rules, and were traditionally aligned with the cleansing of customer data (names and addresses) in support of CRM-related activities, although the tools have expanded the domains and integrations to assess the quality of data in different sources. |
| Enterprise Information Management (EIM) Program | Enterprise information management (EIM) is an integrative discipline for structuring, describing, and governing data and analytics across organizational and technological boundaries in order to maximize business outcomes. EIM programs usually are led by senior business and information roles such as the chief data officer (CDO), and often start small with specific business-relevant programs but then expand over time. |
| Information Lifecycle Management | Information life cycle management (ILM) is an approach to data and storage management that recognizes that the value of information changes over time and that it must be managed accordingly. ILM seeks to classify data according to its business value and establish policies to migrate, store, and archive data according to their appropriate tier. |
| Master Data Management (MDM) | Master data management (MDM) is a technology-enabled discipline in which business and IT work together to **ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets**. Master data is the consistent and uniform set of identifiers and attributes that describes the core entities of the enterprise, including customers, leads, suppliers, sites, hierarchies, etc. |
| Metadata | Metadata is information that describes various facets of an information asset to improve its usability throughout its life cycle. Generally speaking, the more valuable the information asset, the more critical it is to manage the metadata it’s because it is the metadata definition that provides the understanding that unlocks the value of data. Understanding and correctly managing this value is also part of the Information Lifecycle Management discipline. |
| Metadata Management Solutions | Metadata management solutions (MMS) are software applications that manage metadata with capabilities like metadata repositories, business glossary, data lineage, impact analysis, rule management, semantic frameworks, and metadata ingestion and translation from different data sources. |
| Model Management | Model management aims to streamline the prioritization, creation, operationalization, and execution of predictive and deterministic models. It is used by Machine Learning engineers to provide support for running models, and supports version and access control, model performance tracking, scheduled or condition-based recalibration of models, and also serves as a “library” to facilitate end-users’ access and reuse of completed models. |
How Data is Used in an Organization
| Term | Definition |
| Business Process Analysis Tools | Business process analysis tools are used by business users to document, analyze and streamline complex processes, improve productivity, increase quality, become more agile and effective, and comply with business rules and laws. They enable users and managers to better understand business processes, events, workflows, and data using proven modeling techniques. |
| Decision Management | Decision management is the discipline of designing, building, and maintaining systems that produce structured decisions based on a representation of the decision-making process with inputs, algorithms, and results. Decision-making systems may be implemented with rule engines, optimization algorithms, or other kinds of machine learning algorithms. |
| Information Architecture | Information architecture is the processes of requirements, principles, and models that define the current state, future state, and guidance necessary to flexibly share and exchange information assets effectively throughout the organization to achieve a data-driven culture. |