# Before Class Question: June 19

Spend 5 minutes to think about the following questions and pick either to answer:

1. What is the difference between classification and clustering?

2. What is the statistical tests (or value) we used to evaluate a decision tree? What about clustering?

Classification involves setting predefined labels to groups of data while clustering allows the software to group the data and have ex post labels applied. Classification is subjective while clustering is objective. We use chi squared and gini index tests on decision trees, and sum of squares error for clustering.

With classification, there is an area set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of objects and find whether there is some relationship between the objects. Basically, classification is supervised learning, while clustering is unsupervised. Gini impurity and information gain are used with decision trees, while sum of squares and the Rand index are used with clustering.

If thinking intuitively, I would assume both statistical tests would be used in both the decision tree and clustering. The chi-square test helps with modeling a “good-fit,” therefore, in both decision tree and clustering, we want data that would best represent the likeliness of the relationship with its group (p-value); the Gini index shows how much inequality there is in the values, therefore, in both decision tree and clustering, we want the data to be closely related as possible (index close to one).

In classification, you have a set of pre-labeled data, and want to know to which class or group each object belongs. Clustering is to form groups using a set of objects according to relationships between those objects. Also using clustering, you might be able to shrink the size of data because it summarises the data and make it treated as a single data point.

We used Chi squared and Gini index tests for decision tree, and Sum-of-Squares error (SSE) for clustering.

1) A difference between the two is that classification is manual, where categories are created beforehand for the data; clustering however, uses data to figure out where it should go without external specifications.

2) For decision trees we can use the Gini coefficient and sum of squares error for clustering.

Clustering uses all attributes of data to find patterns in that data in order to form groups of related elements. Classification, however, is essentially a learning process which uses a model to group data elements. It will automatically classify data after learning the model, whereas in clustering someone may manually place data in groups. For classification the chi square test and Gini index are used for analysis. Clustering uses the sum of squares error test.

1. Classification is a manual way for finding where someone belongs, and Clusting is using the data to find where someone belongs.

2. Decision Tree: Chi-Squared Test & Gini index; Clusting: Sum-of-Squares Error (SSE)

Classification is a set of predefined classes and sorts different objects to the corresponding class. Clustering groups different objects, but really only separates them by some similar attributes.

With classification you are given pre-labeled data and have to know to sort which group with what goes with it. Clustering groups up different objects yet only differentiates them by their similar attributes.

Classification is when you have groups that have been set and defined and are using them to find where a data element might belong. In the classification analysis, we decide where the data should be classified.

Clustering uses more artificial intelligence. This is when the analyst does not need to figure out where a data element might belong to. It is more an automated process that group data element together based on a predefined set of value.

Classification requires more end user work such as analyst have more work to do to classify these data compare to Clustering where the software automatically does it based on predefined values already assigned to different groups.

Classification is pre defined data labels that help you determine which group to sort the data. Clustering groups based on relationships.

Clustering is when you form groups of related data elements by using attributes to find similar patterns in a data. Whereas, Classification is when determining what group a data element belongs to based on data contained with that element. So, it is pre-labeled data before it is placed in a group.

It seems to me that the main difference between clustering and classification is that they are essentially approaching the same problem from different ends.

Classification, as everyone has been saying, ‘learns’ from the data by focusing on select attributes to create and asses patterns. This strikes me as a more ‘bottom-up’ method. It tackles the data as a whole and looks at it as such. Clustering, while relatively similar, is a distinctly different approach.

Clustering takes a more ‘top-down’ approach, breaking the population into more manageable groups. These groups can then be looked at individually to learn about what makes that group similar or different from others.