Data preprocessing¶
[1]:
import movekit as mkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Read data input¶
[2]:
# Enter path to CSV file
path = "./datasets/fish-5.csv"
# Alternative: enter path to Excel file
# path = "./datasets/fish-5.xlsx"
[3]:
# Read in file using
data = mkit.read_data(path)
data.head()
[3]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 1 | 312 | 405.29 | 417.76 |
1 | 1 | 511 | 369.99 | 428.78 |
2 | 1 | 607 | 390.33 | 405.89 |
3 | 1 | 811 | 445.15 | 411.94 |
4 | 1 | 905 | 366.06 | 451.76 |
General preprocessing method¶
[4]:
# Simple call of the preprocessing method
preprocessed_data = mkit.preprocess(data)
Total number of missing values = 0
time 0
animal_id 0
x 0
y 0
dtype: int64
[5]:
# OPTIONAL: more parameters to control the preprocessing of data
# preprocessed_data = mkit.preprocess(data, dropna=True, interpolation=False, limit=1, limit_direction="forward", inplace=False, method="linear", order=1, date_format=False)
# Paramters
# data: DataFrame to perform preprocessing on
# dropna: Optional parameter to drop columns with missing values for 'time' and 'animal_id'
# interpolate: Optional parameter to perform linear interpolation
# limit: Maximum number of consecutive NANs to fill
# limit_direction: If limit is specified, consecutive NaNs will be filled in this direction.
# method: Interpolation technique to use. Default is "linear".
# order: To be used in case of polynomial and spline interpolation.
# date_format: Boolean to define whether time is some kind of date format instead of a number.
Examples of some sampling methods¶
If one has a large data set it can be efficient to decrease the size of the data set by sampling (systematically or randomly) or filtering the data.
[6]:
downsampled_data = mkit.resample_systematic(preprocessed_data, 2)
filtered_data = mkit.filter_dataframe(preprocessed_data, 5, 6)
downsampled_data
[6]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 1 | 312 | 405.29 | 417.76 |
1 | 1 | 511 | 369.99 | 428.78 |
2 | 1 | 607 | 390.33 | 405.89 |
3 | 1 | 811 | 445.15 | 411.94 |
4 | 1 | 905 | 366.06 | 451.76 |
5 | 501 | 312 | 106.20 | 386.81 |
6 | 501 | 511 | 111.52 | 422.73 |
7 | 501 | 607 | 61.26 | 365.88 |
8 | 501 | 811 | 71.48 | 332.31 |
9 | 501 | 905 | 71.26 | 365.12 |
Methods to replace/convert specific values (duplicates, missings, selected values)¶
One can replace the coordinate values for a specific mover at a specific time period. This can be useful method to deal with outliers.
[7]:
arr_index = np.array([1, 3])
replaced_data_groups = mkit.replace_parts_animal_movement(preprocessed_data, 811, arr_index, 100, 90)
replaced_data_groups[replaced_data_groups['animal_id']==811]
[7]:
time | animal_id | x | y | |
---|---|---|---|---|
3 | 1 | 811 | 100.00 | 90.00 |
8 | 2 | 811 | 445.48 | 412.26 |
13 | 3 | 811 | 100.00 | 90.00 |
18 | 4 | 811 | 446.03 | 413.00 |
23 | 5 | 811 | 446.24 | 413.42 |
... | ... | ... | ... | ... |
4978 | 996 | 811 | 761.31 | 307.65 |
4983 | 997 | 811 | 761.56 | 307.65 |
4988 | 998 | 811 | 761.86 | 307.65 |
4993 | 999 | 811 | 762.12 | 307.65 |
4998 | 1000 | 811 | 762.44 | 307.61 |
1000 rows × 4 columns
In many appliactions it is useful to normalize the data for the coordinates before the analysis.
[8]:
normalized_data = mkit.normalize(data)
normalized_data.head()
[8]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 1 | 312 | 0.496639 | 0.849376 |
1 | 1 | 511 | 0.446887 | 0.873750 |
2 | 1 | 607 | 0.475554 | 0.823122 |
3 | 1 | 811 | 0.552817 | 0.836504 |
4 | 1 | 905 | 0.441348 | 0.924578 |
There are two methods to get an overview over the missing data.
[9]:
#for demonstration set all x values at time period 3 to NaN
missing_data = data
missing_data.loc[data['time'] == 3, 'x'] = np.NaN
mkit.print_missing(missing_data)
mkit.print_duplicate(missing_data)
Total number of missing values = 5
x 5
time 0
animal_id 0
y 0
dtype: int64
Duplicate rows based on the columns 'animal_id' and 'time' column are:
Empty DataFrame
Columns: [time, animal_id, x, y]
Index: []
Making a pandas DataFrame compatible with movekit
¶
If one has the data stored in a Pandas DataFrame one can easily make the DataFrame compatible with movekit
.
[10]:
#Parameters:
# data: the existing data frame
# dictionary: Key-value pairs of column names. Keys store the old column names. The respective new column names are stored as their values. Values that need to be defined include 'time', 'animal_id', 'x' and 'y'.
wrong_df = pd.DataFrame({'Time':[0,1,2,3],'IDs':['A','B','C','D'],'x-values':[0,1,2,3],'y-values':[5,6,7,8]})
correct_df = mkit.from_dataframe(wrong_df, {'Time': 'time', 'IDs': 'animal_id', 'x-values': 'x', 'y-values': 'y'})
correct_df
[10]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 0 | A | 0 | 5 |
1 | 1 | B | 1 | 6 |
2 | 2 | C | 2 | 7 |
3 | 3 | D | 3 | 8 |
Support for 3d datasets¶
movekit
also supports movement in three dimensions. All function calls remain the same for the user as the presence of a third dimension in the data is recognized by movekit
.
Below we show an example of a 3D dataset that can be given to movekit
.
[11]:
# create a synthetic 3D dataset by appending a third dimension to the 2D dataset from above
z = np.random.normal(loc=0.0, scale=1.0, size=len(preprocessed_data))
preprocessed_data['z'] = z
preprocessed_data
[11]:
time | animal_id | x | y | z | |
---|---|---|---|---|---|
0 | 1 | 312 | 0.496639 | 0.849376 | -0.271515 |
1 | 1 | 511 | 0.446887 | 0.873750 | 0.209787 |
2 | 1 | 607 | 0.475554 | 0.823122 | -0.518311 |
3 | 1 | 811 | 0.552817 | 0.836504 | -0.104081 |
4 | 1 | 905 | 0.441348 | 0.924578 | 1.042317 |
... | ... | ... | ... | ... | ... |
4995 | 1000 | 312 | 0.941539 | 0.466381 | -0.624974 |
4996 | 1000 | 511 | 0.859231 | 0.423671 | -1.143292 |
4997 | 1000 | 607 | 0.944062 | 0.580819 | -0.840995 |
4998 | 1000 | 811 | 1.000000 | 0.605746 | -1.494142 |
4999 | 1000 | 905 | 0.880865 | 0.334668 | -0.910113 |
5000 rows × 5 columns
Support for geographic coordinates¶
movekit
is able to project data from GPS coordinates in the latitude and longitude format to the cartesian coordinate system.
[12]:
path = "./datasets/geo.csv"
# Read in file using
geo_data = pd.read_csv(path, sep=';')
geo_data.head()
[12]:
time | animal_id | latitude | longitude | |
---|---|---|---|---|
0 | 1 | 1 | 47.691358 | 9.176731 |
1 | 1 | 2 | 52.472161 | 13.402034 |
2 | 1 | 3 | 47.692101 | 9.055353 |
[13]:
# mkit.convert_latlon(data, latitude='latitude', longitude='longitude', replace=True)
#Parameters:
#data: DataFrame with GPS coordinates
#latitude: str. Name of the column where latitude is stored
#longitude: str. Name of the column where longitude is stored
#replace: bool. Flag whether the xy columns should replace the latlon columns
#return: DataFrame after the transformation where latitude is projected into y and longitude is projected into x
projected_data = mkit.convert_latlon(geo_data)
projected_data.head()
[13]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 1 | 1 | 513261.777038 | 5.282012e+06 |
1 | 1 | 2 | 391460.276950 | 5.814756e+06 |
2 | 1 | 3 | 504153.593963 | 5.282081e+06 |
Support for data stored as GeoJSON and JSON¶
movekit
is able to read data stored as GeoJSON (.geojson) or JSON (.json) file.
[14]:
json_data = mkit.read_geojson('./datasets/fish-4.geojson')
json_data
[14]:
time | animal_id | x | y | |
---|---|---|---|---|
0 | 1 | fish1 | 99.0 | 0.0 |
1 | 1 | fish2 | 120.0 | 4.0 |
2 | 1 | fish3 | 120.0 | 6.0 |
3 | 2 | fish1 | 101.0 | 1.0 |
4 | 2 | fish2 | 200.0 | 5.0 |
5 | 2 | fish3 | 33.0 | 5.0 |
6 | 3 | fish1 | 8.0 | 8.0 |
7 | 3 | fish2 | 125.0 | 43.0 |
8 | 3 | fish3 | 45.0 | 87.0 |
9 | 4 | fish1 | -44.0 | -11.0 |
10 | 4 | fish2 | 12.0 | 5.0 |
11 | 4 | fish3 | 11.0 | -12.0 |
[ ]: