whiteboard

Definitions of Business Intelligence Terminology

  • Accuracy
  • The rate of correct predictions made by a model over a data set. Accuracy is usually estimated by using an independent test set that was not used at any time during the training process.
  • Algorithm
  • A procedure with a finite set of step-by-step instructions for solving a problem.
  • Attribute
  • A label that describes a type of data. The type denotes the legal values of data represented by the attribute, usually represented as a non identifying column. Attributes can be categorical or continuous.
  • Attribute-based Dimensions
  • A dimension that does not need to be hierarchical. In SQL Server 2005, dimensions can be based on many attributes, each of which can be used for slicing and filtering queries. Attributes in a dimension can be combined into hierarchies regardless of data relationships. Attributes allow support for more traditional relational type querying. Traditional hierarchical dimensions still exist, as does the parent- child dimension, but also added are new types of dimensions (Role playing, Fact, Reference, Datamining and Many to Many) For more information on Dimensions see UDM or www.ecwise.com/SQL2005_UDM.aspx.
  • Bias
  • A false association that results from the failure to account for some skewing or influencing factor, or a tendency for the observed results to deviate from the "true" results. Bias distorts results in a particular direction. It may result from incomplete information or invalid collection methods, and may be intentional or unintentional. An example is selection bias, which occurs when data or subjects in a study are chosen in a way that can misleadingly increase or decrease the strength of an association. Choosing experimental and control group subjects from different populations would result in a selection bias.
  • Categorical variable
  • A finite number of qualitative data or variables that differ in "description, label or order" rather than detailed variance by amounts or degree. Ordinal data is categorical data that can be ordered, such as (extra large, large, medium, small). Nominal data is categorical data that is unordered, such as name or color (red, blue, and green).
  • Class conditional independence
  • Describes the case, when calculating the dependent variables, that the independent variables do not depend on or affect each other; there is no interaction between the variables and each variable has no effect on the values of other variables.
  • Classification
  • A process of placing entities into predefined grouping (called classes) based on the values of the attributes of the entity. Entities are often stored as records, with their attributes being based on one or more column values. In general classification is a mapping from unlabeled entity to (discrete) classes. Classification processes have a defined algorithm (naive baise, decision tree, and neural net), a set of predefined groups plus a process for interpreting results (including missing values, unknowns, outliers). It is a supervised learning process.
  • Clustering
  • A process of discovering reasonable groups or partitions with which to organize your entities. It is an unsupervised learning process.
  • Confidence
  • Describes the likelihood of an output variable having some value, given a set of input variable values. Given an input value A and output value B, the confidence is a measure, usually expressed as a percentage, of how much more likely it is that B occurs when A has occurred.
  • Containers (class, group)
  • A non technical term used to explain target groups or classes.
  • Continuous variable
  • A quantitative variable that can assume an infinite number of values associated with the numbers on a line interval. Normally continuous variables are the result of some measurement process, such as precise age, height, temperate, or distance. Usually represented by real numbers, but integers are usually treated as continuous for many real world problems.
  • Correlation
  • A measure of association between two variables. It measures how strongly the variables are related, or change, with each other. If two variables tend to move up or down together, they are said to be positively correlated; if they tend to move in opposite directions, they are said to be negatively correlated.
  • Coverage
  • The data on which a model is used to make a prediction; a classifier may not be able to confidently make predictions on the entire set of data.
  • Dimension (referring to OLAP)
  • An attribute or a collection of attributes that describe a property of the data. A dimension can be thought of as the axis of an OLAP Cube and has business significance. Dimensions can be hierarchical or based on attributes. The values of the dimensions of an OLAP cube are used to address the measures located in the cells of the cube. A time dimension could have the attributes Year, Month, and Day.
  • Data Cleansing
  • Improving the quality of data (by correcting incorrect data values, removing incorrect data, adding missing data, mapping semantically equivalent data etc.).
  • Deviation Analysis
  • A process for examining variance from the norm, or from what was planned or expected. A budget depicts what you expect to spend (expenses) and earn (revenue) over a time period. Budget deviation analysis regularly compares what you expected, or planned, to earn and spend with what you actually spent and earned. Credit Card Fraud analysis is another example that uses deviation analysis.
  • DTS
  • Data Transformation Services Microsoft's version of ETL which stands for Extract Transform and Load. It is a set of database utilities used to extract information from one database, transform it and load it into a second database. These tools are particularly useful to aggregate data from different database suppliers and to then populate data warehouses and OLAP applications with clean, consistent, integrated and summarized data.
  • Entity (case, example, instance, record)
  • A data representation of an object of interest, represented as a collection of attributes.
  • Feature
  • A characteristic of what's being represented by the data, specifically an attribute and its data. Two examples: (marriage status, single) (age, 37). It is also common usage to see feature used as being synonymous with attribute or to describe a feature as a collection of attributes and their data. Example, a name can have two parts (FirstName, John; LastName, Smith).
  • Data mining
  • An iterative, quantitative process of identifying potentially useful, valid, and new patterns in data, usually for the purpose of prediction. Unlike analytical processes that are primarily descriptive, data mining is a process that attempts to determine numeric validity and attempts to measure accuracy, bias, confidence, coverage, cost of analysis, probability, relevance etc.
  • KPI
  • A Key Performance Indicator is a visual representation for building dashboards or reports that can show a measured value, a goal for the value, a status (a range to represent performance from very poorly to very good), and a trend (improving, unchanging, getting worse). Signal lights and Gauges are commonly used visual representations for KPI.
  • Likelihood
  • Expressed as either a frequency or a probability. Frequency is a measure of the rate at which events occur over time (e.g. # of sales/year). Probability is a measure of the rate of a possible event expressed as a fraction of the total number of events (e.g. # of sales/sales calls).
  • Linear relationship
  • A relationship where the output (dependent variable Y) is directly proportional to the input (the independent variable X); usually written as Y= ax + b. A linear relationship, if graphed, is a straight line.
  • Model
  • An abstraction that has a form and a process of interpretation that summarizes (or partially summarizes) a set of data to help describe the data or to facilitate prediction.
  • Measures (in OLAP)
  • Strictly speaking it represents measured data, such as sales, but is also used in OLAP to represent derived data.
  • Non linear relationship
  • A relationship where the output is related to some power or power series of the inputs; a graph of a non linear data relationship is not a straight line, an example is a curved line.
  • OLAP
  • Stands for On-Line Analytical Processing. An approach to building analytic solutions that enables the analyst to interact with information that has been transformed from normalized data into a form that highlights business dimensions and often has a time based dimension. OLAP business dimensions are often organized into hierarchies that facilitate drill down to detailed views or roll up to higher level aggregated values. OLAP engines facilitate the exploration of data along several (predetermined) dimensions. OLAP commonly uses intermediate data structures to store pre-calculated results on multidimensional data, allowing fast computations. ROLAP (relational OLAP) refers to performing OLAP using relational databases. SQL Server 2005 deliberately blurs the line between relational and multi-dimensional OLAP databases. (see: UDM for a more in depth explanation)
  • Optimization
  • The selection of a better or best alternative among a number of possible alternatives to maximize or minimize some objective of a system.
  • Prediction
  • A specific statement about the value of a dependent variable based on some model and a set of independent variables.
  • Regression
  • A form of statistical modeling that attempts to evaluate the relationship between one variable (termed the dependent variable) and one or more other variables (termed the independent variables). Linear regression attempts to explain or predict how the value of the dependent variable, Y, is effected by the independent variable, X, with a straight line fit of the data. Assuming each data point in the data set can be represented by an (x, y) pair. Then a straight line fit of all points: Y=a + bX + u, where u, called the regression residual is a random variable with a mean of zero. The coefficients a and b are selected so that the sum of the square residuals is as small as possible. The residual, for any single data point represents how far that point deviates from the regression line.
  • Robust
  • An adjective used to describe a model or process which is relatively insensitive to data fluctuations, variable choice, or the way it is carried out by an analyst.
  • Supervised learning
  • A process in which a system's biases and weights are changed in response to feedback. Given inputs similar to those observed during the training phase, the algorithm determines the correct outputs. We perform supervised learning using a training set comprised of selected data (input and target output) to train the system to recognize patterns. A "trained" algorithm is able to predict when a new unknown input will again result in a particular output. Classification is a subset of supervised learning.
  • Unsupervised learning
  • A model that doesn't need to know what the target groupings will be. It's training set does not include the desired output, so the system must self adjust based on the inputs, outputs and existing biases built into the model. If you were searching for similarities or patterns in data collected about customers, and you were not aware of the best groupings, you could use an unsupervised learning model (e.g. clustering) to group the customers based on their profiles, with the hope of identifying customer groups.
  • Support
  • Often used to refer to the percentage of times a subset of all possible entities appears across a set of transactions. For example, given a set of transactions T (200 customers shopping baskets) and a set of items I (40 examples of both pretzels and Coke in a shopping basket) then the support for the subset{pretzels, Coke} is I/T *100% = (40/200) *100% = 20%.
  • Training set
  • A set of data used to teach a model how to identify a pattern P from a set of inputs I1..IN. The inputs (and for supervised learning, the predicted outcomes from the inputs) are included in the training set.
  • Transparency
  • A measure of how easy it is for a person to understand the decisions made by a model.
  • UDM
  • Short for the Unified Dimensional Model is a set of cubes and dimensions defined in Microsoft SQL Server 2005. Traditional OLAP was more analytical than operational, provided faster performance through aggregation and fast navigation, drill down and summation through hierarchies. However traditional OLAP doesn't support the operational richness of real time ad hoc querying against many attributes, many to many relationships and other RDBMS features. UDM allows the database architect to create a single dimensional model and set of cubes and then select the latency, number of attributes and model the want to use. Essentially the architect is selecting a configuration that is pure OLAP, pure relational, or somewhere in between based on business needs. The somewhere in between is what is new and unique about UDM.
Copyright © 2008, EC Wise, Inc. All Rights Reserved.