Thursday, August 31, 2017

Classifying data into predefined categories


Input and output for classification problem


  • Input to classification problem is a feature and output is called as label
  • Problem statement and training data is where we spend amount of time

Lets talk about 2 types of problems

  Problem statement 1
     Email, tweet or trading day
  • Types of problems are Spam or Ham
  • Tweet positive or negative
  • Trading day up-day or down-day


     Problem statement 2
  •  Build the black box classifier . What happens in this black box is represented using mathematical rules or equations and it is called a model

     Features
  • Every data point that we see needs to be represented as numerical attributes
  • The algorithms only except numerical algorithms
  • Even text and images are represented using numeric
     Training
  • We take large amount of historical data.These are set of problem instances that are correctly labelled. Ex emails that are marked as 
  • Each part of training data is tuples of features and label
  • The patterns the classifier learns in this phase is classified as Model 
           Note: All does not follow this
  • If the output that is requires is known, Then it is known as supervised learning
      Test Phase
  • Here we are actually classify the data that we have not seen before
Note: Most of these the algorithms are available as pre-built libraries for platforms like Python,R or spark


Algorithms for classification problem
  • Naive Bayes
  • Support Vector Machines
  • Decision Trees
  • K-Nearest Neighbors
  • Random Forest
  • Logistic Regression
Term Frequency Representation
 This is how we take the text input and represent it as set of numeric data




1 comment: