People often asks for advice online. Lets see if we can find us some new customers by training Labelf to find people asking questions about text classification. We would also like to do some market analysis and see how various classification tasks grows over time to see if our next feature should be images or Named Entity Recognition(classifying words in texts such as names, places, medicinal dosages, organizations etc).
Our goal is divided into two, finding customers and understanding potential new markets.
Our dataset will consist posts from various sources on the web. We first do a rough filtering by search for keywords such as classification and labeling. This gives us a dataset with about 120.000 posts.
Since searching by keywords is useless in practice and an ancient way of doing thing we teach Labelf to sort out all the irrelevant results from our queries.
We start by defining that task. "Could we state that this text is about classification and or labeling?" with the labels "nope, irrelevant" and "yes classification and or labeling".
After about 15 minutes we have labeled 377 examples and we have reached an accuracy of 90%. Do not worry yet however, our accuracy will increase when we have about 1000. We only have 40 items in our test set so this metric will be more accurate as we progress. Read more about metrics in labelf here. We can also get a glimpse of how bad a keyword search really is.
WIP