How is the weather in the Black Forest? - Nearest Neighbour Approaches

Planning a city or hiking trip on the weekend, a barbequeue or a romantic picnic in the mountains requires a pretty good weather forecast. What would you do if you wanted to know the weather? Eventually you would just check the weather forecast in the desired city, these weather forcasts are availyble online. But what if your route is lying in the mountains, without any reference city or village? You would then choose a location near your route, for which a weather forecast is available and assume you will get the same weather on your route.
Assume you have a sunny weather forecast for one side of your route and rainy weather forecast on the other side, you would probably rely on the forecast closer to the route, but also consider the forcast farer away.
This approach is quite reasonable assuming that the weather is similar on geographic similar locations. The same reasoning is also applied in the business and research world, to forecast key figures like preferences, behaviours and prizes. One approach is the Nearest Neighbour Approach which starts with the assumption that objects "close" to each other behave in the "same" way, so if you have to predict an unknown value, just ask the "nearest" neighbours and use their results.

However there are some problems in this approach: it might be easy to find neighbours considering numeric values like age, income or location, but how would you find neighbours for, lets say, prefered music?

As often there is no general answer and different ways to find neighbours exists. Usually the tasks to find a good neighbour consist of two main tasks:
1. Find a measure for distance between two datasets
2. Combine the datasets for the closest neighbours to make a prediction

Usually you have a dataset consisting of a collection of features, for a customer this could be a so called Customer ID hodling all the relevant data that could possibly influence the target variable. Often age, income, location are part of it. For these features is it easy to determine a distance, usually the difference (direct difference or euklidian difference) is used. For better comparison normalization is recommended, as the features have different ranges.

You could then go for the k nearest neigbours, check the known values of the target variables and would the have to combine these results to a target value. Here business knowledge is required to determine to which extend e.g. the age is important or if location is more relevant than income.

The advantages of this approach is that they are usually pretty easy to understand, often show additional insights and change whenever the data changes.
However they could be time consuming as for finding good neighbours you would have to check every single known dataset and measure the distance. Also the prediction could depend considerably on the value of k, further analysis is required. And the algorithm is discontinuous, meaning that a new dataset could have a huge impact on the existing predictions.

To overcome the bounderies different approaches have been established (e.g. reducing the number of datasets by choosing "important" neigbours), and the nearest neigbour approaches ars widely used in different classification or regression problems (e.g. supporting breast cancer detection, estimations on house prizes).
And defining the right distance measure you can even find neighbours to music files or detect song titles.

As a closing example here some simplified excercise. Consider the following data:

Target: Money spent online for hobbies per year
Gender Age Income Nof Children Target
Male     35    65.000 1 6.500
Female 24    25.000 0           2.000
Female 60    60.000 4              380
Male      48    45.000 2           2.100
Female   39 60.000 0            6.000
Male      49 75.000 2 7.500
Male      18       800 0     670

Which are your closest neigbours? And how good does the estimation of the 1-nearest neighbour model fit to you?
For the distance choose the absolute difference value in age, income and number of children, take the weights age_weight = 7, income_weight = 10, nof_children_weight = 1.
(Note that this is a very simplified example, the money spent on hobbies would in reality rely on way more factors and the variables are not independent of each other).

Hopefully you get a good approximation, so your next romantic outdoor picnic in the Black Forest does not look like this:

Modern Data Analysis

How is the weather in the Black Forest? - Nearest Neighbour Approaches

Mirko

0 Comment