Thursday, August 31, 2017

Classifying data into predefined categories

Input and output for classification problem

  • Input to classification problem is a feature and output is called as label
  • Problem statement and training data is where we spend amount of time

Lets talk about 2 types of problems

  Problem statement 1
     Email, tweet or trading day
  • Types of problems are Spam or Ham
  • Tweet positive or negative
  • Trading day up-day or down-day

     Problem statement 2
  •  Build the black box classifier . What happens in this black box is represented using mathematical rules or equations and it is called a model

  • Every data point that we see needs to be represented as numerical attributes
  • The algorithms only except numerical algorithms
  • Even text and images are represented using numeric
  • We take large amount of historical data.These are set of problem instances that are correctly labelled. Ex emails that are marked as 
  • Each part of training data is tuples of features and label
  • The patterns the classifier learns in this phase is classified as Model 
           Note: All does not follow this
  • If the output that is requires is known, Then it is known as supervised learning
      Test Phase
  • Here we are actually classify the data that we have not seen before
Note: Most of these the algorithms are available as pre-built libraries for platforms like Python,R or spark

Algorithms for classification problem
  • Naive Bayes
  • Support Vector Machines
  • Decision Trees
  • K-Nearest Neighbors
  • Random Forest
  • Logistic Regression
Term Frequency Representation
 This is how we take the text input and represent it as set of numeric data

1 comment: