# Plotting the Optimal Distance for Data Scientists in Python using the K-Nearest Neighbour K-NN Algorithm

This example uses the K-Nearest Neighbour K-NN to connect several points and plotting those as a graph on a Plotly map. This is a quick and simple example that data scientists can use to illustrate a path(s) for their professional or academic papers.

A brief file structure overview of the repository is provided. The `knn_map.py` is in the root directory. The `data` folder houses a list of target geocoordinates in a csv file. The out map is generated in the `output` folder.

```/
knn_map.py

- / data /
geocoordinates.csv

- / output /
knn_map.jpg```

The `geocoordinates.csv` includes the following target geocoordinates.

`LATITUDE	LONGITUDE48.13485	11.517391348.1348182	11.57710348.1492002	11.559246948.11005	11.5934448.158903	11.5856`

Before jumping into the code the following requirements and packages are needed to run the code:

``````Python 3.10.6
pip3 install numpy
pip3 install scipy
pip3 install pandas
pip3 install plotly
pip3 install kaleido``````

First the packages that were just installed are imported into our file `knn_map.py`

``````import numpy as np
from numpy import sqrt
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go``````

Next a function `generate_knn_grid` is created which returns the K-NN grid. This function takes the dataset of geocoordinates in the data folder as a parameter.

``def generate_knn_grid(df): ``

Inside the function the parameters for the KNN algorithm are defined. N is the number of nearest neighbours or nearest nodes we want to connect. As a standard practice K is set to the square root of the number of nearest neighbours. X stacks the geocoordinate data point into a 2D Array.

``````    # Parameters for the k-nearest neighbor algorithm
N = len(list(df.LATITUDE))
k = round(sqrt(N))
X = np.column_stack((lat, lon))``````

Next the Euclidean distances between the geocoordinates or nodes can be generated. pdist returns a matrix of distance between each pairs of nodes.

``````    # matrix of pairwise Euclidean distances will determine which dector node
is connected to another dector node
distmat = squareform(pdist(X, 'euclidean'))``````

For each node the nearest neighbours are determined using the Euclidean distance matrix.

``````    # select the kNN for each datapoint
neighbors = np.sort(np.argsort(distmat, axis=1)[:, 0:k])``````

Using the this information the geocoordinates belong to connectors or edges between the nodes can be calculated.

``````    # get edge coordinates for the lines connecting the nodes
coordinates = np.zeros((N, k, 2, 2))

for i in np.arange(len(list(df.LATITUDE))):
for j in np.arange(k):
coordinates[i, j, :, 0] = np.array([X[i,:][1], X[neighbors[i, j], :][1]])
coordinates[i, j, :, 1] = np.array([X[i,:][0], X[neighbors[i, j], :][0]])``````

The calculated information is then returned along with N and k.

``return N, k, coordinates``

Next the function for plotting the map is created. It takes the returned information from the pervious function as parameters.

``def plot_map(N, k, coordinates):``

The map figure is created and styled using Plotly. An API access token from Mapbox can be acquired for free after signing up. Mapbox has many beautiful maps from a variety of contributors. These look nice in academic papers. Remember to attribute the map creator. Times New Roman was used and the shape was set to a square. The nodes are added from the dataset.

``````    fig = px.scatter_mapbox(
df,
lat="LATITUDE",
lon="LONGITUDE",
zoom=12.2,
color_discrete_sequence=["rgb(255, 203, 3)"],
title="<span style='font-size: 32px;'><b>K-Nearest Neighbor KNN
Map</b></span>",
opacity=.8,
width=1000,
height=1000,
center=go.layout.mapbox.Center(
lat=48.14,
lon=11.57,
),
size_max=15
)

# Now using Mapbox
fig.update_layout(mapbox_style="light",
mapbox_accesstoken="",
legend=dict(yanchor="top", y=1, xanchor="left",x=0.9),
title=dict(yanchor="top", y=.85, xanchor="left",x=0.085),
font_family="Times New Roman",
font_color="#333333",
title_font_size = 32,
font_size = 18)``````

The node diameters are set to 15.

``````    # diameter of the plots
fig.update_traces(marker={'size': 15})``````

Using the information from the parameters the connectors are generated and added to the map.

``````    # add line connectors
lines = coordinates.reshape((N*k, 2, 2))

i = 0
for row in lines:
fig.add_trace(go.Scattermapbox(lon=[lines[i][0].[0],
lines[i][1][0]],
lat=[lines[i][0][1],
lines[i][1][1]],
mode='lines',
showlegend = False,
line=dict(color='#ffcb03')))
i += 1``````

The line connectors are z-indexed below the nodes.

``````    # line connectors layered below plots
fig.data = fig.data[::-1]  ``````

Finally the map is saved in the output folder and rendered in the browser.

``````    # Save map in output folder
print("Saving image to output folder...");
fig.write_image('output/knn_map.jpg', scale=5)

# Show map in web browser
print("Generating map in browser...");
fig.show()``````

After defining the functions we import our data from the data directory and format it.

``````# Data Import Path
SENSORS_CSV   = 'data/geocoordinates.csv'

# Data Import Path
df = pd.read_csv(SENSORS_CSV)

# Keep only relevant columns
df = df.loc[:, ("LATITUDE", "LONGITUDE")]

# Remove missing geocoordinates
df = df[(df["LATITUDE"] != "NEIN") & (df["LONGITUDE"] != "NEIN")]

# Format the latitudes and longitudes
lon = []; lat = [];

for row in df["LATITUDE"]:
lat.append(float(row))

for row in df["LONGITUDE"]:
lon.append(float(row))``````

Finally we call our two functions to compute and create our KNN map.

``````N, k, coordinates = generate_knn_grid(df)

plot_map(N, k, coordinates)``````

The result is presented below:

K-NN could be used, for instance, to predict how viruses and other pathogens will move across a given region or spatial area by creating a graph from the individual cases. However for traffic-related forecasting, it is preferable to construct the graph directly from the road network using the Dijkstra algorithm.