Data preprocessing

[1]:
import movekit as mkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read data input

[2]:
# Enter path to CSV file
path = "./datasets/fish-5.csv"

# Alternative: enter path to Excel file
# path = "./datasets/fish-5.xlsx"
[3]:
# Read in file using
data = mkit.read_data(path)
data.head()
[3]:
time animal_id x y
0 1 312 405.29 417.76
1 1 511 369.99 428.78
2 1 607 390.33 405.89
3 1 811 445.15 411.94
4 1 905 366.06 451.76

General preprocessing method

[4]:
# Simple call of the preprocessing method
preprocessed_data = mkit.preprocess(data)
Total number of missing values =  0
time         0
animal_id    0
x            0
y            0
dtype: int64
[5]:
# OPTIONAL: more parameters to control the preprocessing of data

# preprocessed_data = mkit.preprocess(data, dropna=True, interpolation=False, limit=1, limit_direction="forward", inplace=False, method="linear", order=1, date_format=False)

# Paramters
#  data: DataFrame to perform preprocessing on
#  dropna: Optional parameter to drop columns with  missing values for 'time' and 'animal_id'
#  interpolate: Optional parameter to perform linear interpolation
#  limit: Maximum number of consecutive NANs to fill
#  limit_direction: If limit is specified, consecutive NaNs will be filled in this direction.
#  method: Interpolation technique to use. Default is "linear".
#  order: To be used in case of polynomial and spline interpolation.
#  date_format: Boolean to define whether time is some kind of date format instead of a number.

Examples of some sampling methods

If one has a large data set it can be efficient to decrease the size of the data set by sampling (systematically or randomly) or filtering the data.

[6]:
downsampled_data = mkit.resample_systematic(preprocessed_data, 2)
filtered_data = mkit.filter_dataframe(preprocessed_data, 5, 6)
downsampled_data
[6]:
time animal_id x y
0 1 312 405.29 417.76
1 1 511 369.99 428.78
2 1 607 390.33 405.89
3 1 811 445.15 411.94
4 1 905 366.06 451.76
5 501 312 106.20 386.81
6 501 511 111.52 422.73
7 501 607 61.26 365.88
8 501 811 71.48 332.31
9 501 905 71.26 365.12

Methods to replace/convert specific values (duplicates, missings, selected values)

One can replace the coordinate values for a specific mover at a specific time period. This can be useful method to deal with outliers.

[7]:
arr_index = np.array([1, 3])
replaced_data_groups = mkit.replace_parts_animal_movement(preprocessed_data, 811, arr_index, 100, 90)
replaced_data_groups[replaced_data_groups['animal_id']==811]
[7]:
time animal_id x y
3 1 811 100.00 90.00
8 2 811 445.48 412.26
13 3 811 100.00 90.00
18 4 811 446.03 413.00
23 5 811 446.24 413.42
... ... ... ... ...
4978 996 811 761.31 307.65
4983 997 811 761.56 307.65
4988 998 811 761.86 307.65
4993 999 811 762.12 307.65
4998 1000 811 762.44 307.61

1000 rows × 4 columns

In many appliactions it is useful to normalize the data for the coordinates before the analysis.

[8]:
normalized_data = mkit.normalize(data)
normalized_data.head()
[8]:
time animal_id x y
0 1 312 0.496639 0.849376
1 1 511 0.446887 0.873750
2 1 607 0.475554 0.823122
3 1 811 0.552817 0.836504
4 1 905 0.441348 0.924578

There are two methods to get an overview over the missing data.

[9]:
#for demonstration set all x values at time period 3 to NaN
missing_data = data
missing_data.loc[data['time'] == 3, 'x'] = np.NaN

mkit.print_missing(missing_data)
mkit.print_duplicate(missing_data)
Total number of missing values =  5
x            5
time         0
animal_id    0
y            0
dtype: int64
Duplicate rows based on the columns 'animal_id' and 'time' column are:
Empty DataFrame
Columns: [time, animal_id, x, y]
Index: []

Making a pandas DataFrame compatible with movekit

If one has the data stored in a Pandas DataFrame one can easily make the DataFrame compatible with movekit.

[10]:
#Parameters:
# data: the existing data frame
# dictionary: Key-value pairs of column names. Keys store the old column names. The respective new column names are stored as their values. Values that need to be defined include 'time', 'animal_id', 'x' and 'y'.

wrong_df = pd.DataFrame({'Time':[0,1,2,3],'IDs':['A','B','C','D'],'x-values':[0,1,2,3],'y-values':[5,6,7,8]})
correct_df = mkit.from_dataframe(wrong_df, {'Time': 'time', 'IDs': 'animal_id', 'x-values': 'x', 'y-values': 'y'})
correct_df
[10]:
time animal_id x y
0 0 A 0 5
1 1 B 1 6
2 2 C 2 7
3 3 D 3 8

Support for 3d datasets

movekit also supports movement in three dimensions. All function calls remain the same for the user as the presence of a third dimension in the data is recognized by movekit.

Below we show an example of a 3D dataset that can be given to movekit.

[11]:
# create a synthetic 3D dataset by appending a third dimension to the 2D dataset from above
z = np.random.normal(loc=0.0, scale=1.0, size=len(preprocessed_data))
preprocessed_data['z'] = z
preprocessed_data
[11]:
time animal_id x y z
0 1 312 0.496639 0.849376 -0.271515
1 1 511 0.446887 0.873750 0.209787
2 1 607 0.475554 0.823122 -0.518311
3 1 811 0.552817 0.836504 -0.104081
4 1 905 0.441348 0.924578 1.042317
... ... ... ... ... ...
4995 1000 312 0.941539 0.466381 -0.624974
4996 1000 511 0.859231 0.423671 -1.143292
4997 1000 607 0.944062 0.580819 -0.840995
4998 1000 811 1.000000 0.605746 -1.494142
4999 1000 905 0.880865 0.334668 -0.910113

5000 rows × 5 columns

Support for geographic coordinates

movekit is able to project data from GPS coordinates in the latitude and longitude format to the cartesian coordinate system.

[12]:
path = "./datasets/geo.csv"

# Read in file using
geo_data = pd.read_csv(path, sep=';')
geo_data.head()
[12]:
time animal_id latitude longitude
0 1 1 47.691358 9.176731
1 1 2 52.472161 13.402034
2 1 3 47.692101 9.055353
[13]:
# mkit.convert_latlon(data, latitude='latitude', longitude='longitude', replace=True)

#Parameters:
#data: DataFrame with GPS coordinates
#latitude: str. Name of the column where latitude is stored
#longitude: str. Name of the column where longitude is stored
#replace: bool. Flag whether the xy columns should replace the latlon columns
#return: DataFrame after the transformation where latitude is projected into y and longitude is projected into x

projected_data = mkit.convert_latlon(geo_data)
projected_data.head()
[13]:
time animal_id x y
0 1 1 513261.777038 5.282012e+06
1 1 2 391460.276950 5.814756e+06
2 1 3 504153.593963 5.282081e+06

Support for data stored as GeoJSON and JSON

movekit is able to read data stored as GeoJSON (.geojson) or JSON (.json) file.

[14]:
json_data = mkit.read_geojson('./datasets/fish-4.geojson')
json_data
[14]:
time animal_id x y
0 1 fish1 99.0 0.0
1 1 fish2 120.0 4.0
2 1 fish3 120.0 6.0
3 2 fish1 101.0 1.0
4 2 fish2 200.0 5.0
5 2 fish3 33.0 5.0
6 3 fish1 8.0 8.0
7 3 fish2 125.0 43.0
8 3 fish3 45.0 87.0
9 4 fish1 -44.0 -11.0
10 4 fish2 12.0 5.0
11 4 fish3 11.0 -12.0
[ ]: