EC
|
Wise
SOFTWARE DESIGN & DEVELOPMENT
Definitions of Business Intelligence Terminology
-
Accuracy
-
The rate of correct predictions made by a model over a data set.
Accuracy is usually estimated by using an independent test set that was not used
at any time during the training process.
-
Algorithm
-
A procedure with a finite set of step-by-step instructions for solving a problem.
-
Attribute
-
A label that describes a type of data. The type denotes the legal
values of data represented by the attribute, usually represented as a non identifying
column. Attributes can be categorical or continuous.
-
Attribute-based Dimensions
-
A dimension that
does not need to be hierarchical. In SQL Server 2005, dimensions can be based on many attributes,
each of which can be used for slicing and filtering queries. Attributes in a dimension
can be combined into hierarchies regardless of data relationships. Attributes allow
support for more traditional relational type querying. Traditional hierarchical
dimensions still exist, as does the parent- child dimension, but also added are
new types of dimensions (Role playing, Fact, Reference, Datamining and Many to Many)
For more information on Dimensions see UDM or www.ecwise.com/SQL2005_UDM.aspx.
-
Bias
-
A false association that results from the failure to account
for some skewing or influencing factor, or a tendency for the observed results to
deviate from the "true" results. Bias distorts results in a particular direction.
It may result from incomplete information or invalid collection methods, and may
be intentional or unintentional. An example is selection bias, which occurs when
data or subjects in a study are chosen in a way that can misleadingly increase or
decrease the strength of an association. Choosing experimental and control group
subjects from different populations would result in a selection bias.
-
Categorical variable
-
A finite number of qualitative data or variables that differ
in "description, label or order" rather than detailed variance by amounts or degree.
Ordinal data is categorical data that can be ordered, such as (extra large, large,
medium, small). Nominal data is categorical data that is unordered, such as name
or color (red, blue, and green).
-
Class conditional independence
-
Describes the case, when calculating the dependent variables,
that the independent variables do not depend on or affect each other; there is no
interaction between the variables and each variable has no effect on the values
of other variables.
-
Classification
-
A process of placing entities into predefined grouping (called
classes) based on the values of the attributes of the entity. Entities are often
stored as records, with their attributes being based on one or more column values.
In general classification is a mapping from unlabeled entity to (discrete) classes.
Classification processes have a defined algorithm (naive baise, decision tree, and
neural net), a set of predefined groups plus a process for interpreting results
(including missing values, unknowns, outliers). It is a supervised learning process.
-
Clustering
-
A process of discovering reasonable groups or partitions with
which to organize your entities. It is an unsupervised learning process.
-
Confidence
-
Describes the likelihood of an output variable having some value,
given a set of input variable values. Given an input value A and output value B,
the confidence is a measure, usually expressed as a percentage, of how much more
likely it is that B occurs when A has occurred.
-
Containers (class, group)
-
A non technical term used to explain target groups or classes.
-
Continuous variable
-
A quantitative variable that can assume an infinite number of
values associated with the numbers on a line interval. Normally continuous variables
are the result of some measurement process, such as precise age, height, temperate,
or distance. Usually represented by real numbers, but integers are usually treated
as continuous for many real world problems.
-
Correlation
-
A measure of association between two variables. It measures how
strongly the variables are related, or change, with each other. If two variables
tend to move up or down together, they are said to be positively correlated; if
they tend to move in opposite directions, they are said to be negatively correlated.
-
Coverage
-
The data on which a model is used to make a prediction; a classifier
may not be able to confidently make predictions on the entire set of data.
-
Dimension (referring to OLAP)
-
An attribute or a collection of attributes that describe a property
of the data. A dimension can be thought of as the axis of an OLAP Cube and has business
significance. Dimensions can be hierarchical or based on attributes. The values
of the dimensions of an OLAP cube are used to address the measures located in the
cells of the cube. A time dimension could have the attributes Year, Month, and Day.
-
Data Cleansing
-
Improving the quality of data (by correcting incorrect data values,
removing incorrect data, adding missing data, mapping semantically equivalent data
etc.).
-
Deviation Analysis
-
A process for examining variance from the norm, or from what
was planned or expected. A budget depicts what you expect to spend (expenses) and
earn (revenue) over a time period. Budget deviation analysis regularly compares
what you expected, or planned, to earn and spend with what you actually spent and
earned. Credit Card Fraud analysis is another example that uses deviation analysis.
-
DTS
-
Data Transformation Services Microsoft's version of ETL which
stands for Extract Transform and Load. It is a set of database utilities used to
extract information from one database, transform it and load it into a second database.
These tools are particularly useful to aggregate data from different database suppliers
and to then populate data warehouses and OLAP applications with clean, consistent,
integrated and summarized data.
-
Entity (case, example, instance, record)
-
A data representation of an object of interest, represented as
a collection of attributes.
-
Feature
-
A characteristic of what's being represented by the data, specifically
an attribute and its data. Two examples: (marriage status, single) (age, 37). It
is also common usage to see feature used as being synonymous with attribute or to
describe a feature as a collection of attributes and their data. Example, a name
can have two parts (FirstName, John; LastName, Smith).
-
Data mining
-
An iterative, quantitative process of identifying potentially
useful, valid, and new patterns in data, usually for the purpose of prediction.
Unlike analytical processes that are primarily descriptive, data mining is a process
that attempts to determine numeric validity and attempts to measure accuracy, bias,
confidence, coverage, cost of analysis, probability, relevance etc.
-
KPI
-
A Key Performance Indicator is a visual representation
for building dashboards or reports that can show a measured value, a goal for the
value, a status (a range to represent performance from very poorly to very good),
and a trend (improving, unchanging, getting worse). Signal lights and Gauges are
commonly used visual representations for KPI.
-
Likelihood
-
Expressed as either a frequency or a probability. Frequency is
a measure of the rate at which events occur over time (e.g. # of sales/year). Probability
is a measure of the rate of a possible event expressed as a fraction of the total
number of events (e.g. # of sales/sales calls).
-
Linear relationship
-
A relationship where the output (dependent variable Y) is directly
proportional to the input (the independent variable X); usually written as Y= ax
+ b. A linear relationship, if graphed, is a straight line.
-
Model
-
An abstraction that has a form and a process of interpretation
that summarizes (or partially summarizes) a set of data to help describe the data
or to facilitate prediction.
-
Measures (in OLAP)
-
Strictly speaking it represents measured data, such as sales,
but is also used in OLAP to represent derived data.
-
Non linear relationship
-
A relationship where the output is related to some power or power
series of the inputs; a graph of a non linear data relationship is not a straight
line, an example is a curved line.
-
OLAP
-
Stands for On-Line Analytical Processing. An approach to building
analytic solutions that enables the analyst to interact with information that has
been transformed from normalized data into a form that highlights business dimensions
and often has a time based dimension. OLAP business dimensions are often organized
into hierarchies that facilitate drill down to detailed views or roll up to higher
level aggregated values. OLAP engines facilitate the exploration of data along several
(predetermined) dimensions. OLAP commonly uses intermediate data structures to store
pre-calculated results on multidimensional data, allowing fast computations. ROLAP
(relational OLAP) refers to performing OLAP using relational databases. SQL Server 2005 deliberately
blurs the line between relational and multi-dimensional OLAP databases. (see: UDM
for a more in depth explanation)
-
Optimization
-
The selection of a better or best alternative among a number
of possible alternatives to maximize or minimize some objective of a system.
-
Prediction
-
A specific statement about the value of a dependent variable
based on some model and a set of independent variables.
-
Regression
-
A form of statistical modeling that attempts to evaluate the
relationship between one variable (termed the dependent variable) and one or more
other variables (termed the independent variables). Linear regression attempts to
explain or predict how the value of the dependent variable, Y, is effected by the
independent variable, X, with a straight line fit of the data. Assuming each data
point in the data set can be represented by an (x, y) pair. Then a straight line
fit of all points: Y=a + bX + u, where u, called the regression residual is a random
variable with a mean of zero. The coefficients a and b are selected so that the
sum of the square residuals is as small as possible. The residual, for any single
data point represents how far that point deviates from the regression line.
-
Robust
-
An adjective used to describe a model or process which is relatively
insensitive to data fluctuations, variable choice, or the way it is carried out
by an analyst.
-
Supervised learning
-
A process in which a system's biases and weights are changed
in response to feedback. Given inputs similar to those observed during the training
phase, the algorithm determines the correct outputs. We perform supervised learning
using a training set comprised of selected data (input and target output) to train
the system to recognize patterns. A "trained" algorithm is able to predict when
a new unknown input will again result in a particular output. Classification is
a subset of supervised learning.
-
Unsupervised learning
-
A model that doesn't need to know what the target groupings will
be. It's training set does not include the desired output, so the system must self
adjust based on the inputs, outputs and existing biases built into the model. If
you were searching for similarities or patterns in data collected about customers,
and you were not aware of the best groupings, you could use an unsupervised learning
model (e.g. clustering) to group the customers based on their profiles, with the
hope of identifying customer groups.
-
Support
-
Often used to refer to the percentage of times a subset of all
possible entities appears across a set of transactions. For example, given a set
of transactions T (200 customers shopping baskets) and a set of items I (40 examples
of both pretzels and Coke in a shopping basket) then the support for the subset{pretzels,
Coke} is I/T *100% = (40/200) *100% = 20%.
-
Training set
-
A set of data used to teach a model how to identify a pattern
P from a set of inputs I1..IN. The inputs (and for supervised learning, the predicted
outcomes from the inputs) are included in the training set.
-
Transparency
-
A measure of how easy it is for a person to understand the decisions
made by a model.
-
UDM
-
Short for the Unified Dimensional Model is a set of cubes and
dimensions defined in Microsoft SQL Server 2005. Traditional OLAP was more analytical
than operational, provided faster performance through aggregation and fast navigation,
drill down and summation through hierarchies. However traditional OLAP doesn't support
the operational richness of real time ad hoc querying against many attributes, many
to many relationships and other RDBMS features. UDM allows the database architect
to create a single dimensional model and set of cubes and then select the latency,
number of attributes and model the want to use. Essentially the architect is selecting
a configuration that is pure OLAP, pure relational, or somewhere in between based
on business needs. The somewhere in between is what is new and unique about UDM.
Copyright © 2008, EC Wise, Inc. All Rights Reserved.