Data preprocessing¶

[1]:

import movekit as mkit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read data input¶

[2]:

# Enter path to CSV file
path = "./datasets/fish-5.csv"

# Alternative: enter path to Excel file
# path = "./datasets/fish-5.xlsx"

[3]:

# Read in file using
data = mkit.read_data(path)
data.head()

[3]:

	time	animal_id	x	y
0	1	312	405.29	417.76
1	1	511	369.99	428.78
2	1	607	390.33	405.89
3	1	811	445.15	411.94
4	1	905	366.06	451.76

General preprocessing method¶

[4]:

# Simple call of the preprocessing method
preprocessed_data = mkit.preprocess(data)

Total number of missing values =  0
time         0
animal_id    0
x            0
y            0
dtype: int64

[5]:

# OPTIONAL: more parameters to control the preprocessing of data

# preprocessed_data = mkit.preprocess(data, dropna=True, interpolation=False, limit=1, limit_direction="forward", inplace=False, method="linear", order=1, date_format=False)

# Paramters
#  data: DataFrame to perform preprocessing on
#  dropna: Optional parameter to drop columns with  missing values for 'time' and 'animal_id'
#  interpolate: Optional parameter to perform linear interpolation
#  limit: Maximum number of consecutive NANs to fill
#  limit_direction: If limit is specified, consecutive NaNs will be filled in this direction.
#  method: Interpolation technique to use. Default is "linear".
#  order: To be used in case of polynomial and spline interpolation.
#  date_format: Boolean to define whether time is some kind of date format instead of a number.

Examples of some sampling methods¶

If one has a large data set it can be efficient to decrease the size of the data set by sampling (systematically or randomly) or filtering the data.

[6]:

downsampled_data = mkit.resample_systematic(preprocessed_data, 2)
filtered_data = mkit.filter_dataframe(preprocessed_data, 5, 6)
downsampled_data

[6]:

	time	animal_id	x	y
0	1	312	405.29	417.76
1	1	511	369.99	428.78
2	1	607	390.33	405.89
3	1	811	445.15	411.94
4	1	905	366.06	451.76
5	501	312	106.20	386.81
6	501	511	111.52	422.73
7	501	607	61.26	365.88
8	501	811	71.48	332.31
9	501	905	71.26	365.12

Methods to replace/convert specific values (duplicates, missings, selected values)¶

One can replace the coordinate values for a specific mover at a specific time period. This can be useful method to deal with outliers.

[7]:

arr_index = np.array([1, 3])
replaced_data_groups = mkit.replace_parts_animal_movement(preprocessed_data, 811, arr_index, 100, 90)
replaced_data_groups[replaced_data_groups['animal_id']==811]

[7]:

	time	animal_id	x	y
3	1	811	100.00	90.00
8	2	811	445.48	412.26
13	3	811	100.00	90.00
18	4	811	446.03	413.00
23	5	811	446.24	413.42
...	...	...	...	...
4978	996	811	761.31	307.65
4983	997	811	761.56	307.65
4988	998	811	761.86	307.65
4993	999	811	762.12	307.65
4998	1000	811	762.44	307.61

1000 rows × 4 columns

In many appliactions it is useful to normalize the data for the coordinates before the analysis.

[8]:

normalized_data = mkit.normalize(data)
normalized_data.head()

[8]:

	time	animal_id	x	y
0	1	312	0.496639	0.849376
1	1	511	0.446887	0.873750
2	1	607	0.475554	0.823122
3	1	811	0.552817	0.836504
4	1	905	0.441348	0.924578

There are two methods to get an overview over the missing data.

[9]:

#for demonstration set all x values at time period 3 to NaN
missing_data = data
missing_data.loc[data['time'] == 3, 'x'] = np.NaN

mkit.print_missing(missing_data)
mkit.print_duplicate(missing_data)

Total number of missing values =  5
x            5
time         0
animal_id    0
y            0
dtype: int64
Duplicate rows based on the columns 'animal_id' and 'time' column are:
Empty DataFrame
Columns: [time, animal_id, x, y]
Index: []

Making a pandas DataFrame compatible with `movekit`¶

If one has the data stored in a Pandas DataFrame one can easily make the DataFrame compatible with movekit.

[10]:

#Parameters:
# data: the existing data frame
# dictionary: Key-value pairs of column names. Keys store the old column names. The respective new column names are stored as their values. Values that need to be defined include 'time', 'animal_id', 'x' and 'y'.

wrong_df = pd.DataFrame({'Time':[0,1,2,3],'IDs':['A','B','C','D'],'x-values':[0,1,2,3],'y-values':[5,6,7,8]})
correct_df = mkit.from_dataframe(wrong_df, {'Time': 'time', 'IDs': 'animal_id', 'x-values': 'x', 'y-values': 'y'})
correct_df

[10]:

	time	animal_id	x	y
0	0	A	0	5
1	1	B	1	6
2	2	C	2	7
3	3	D	3	8

Support for 3d datasets¶

movekit also supports movement in three dimensions. All function calls remain the same for the user as the presence of a third dimension in the data is recognized by movekit.

Below we show an example of a 3D dataset that can be given to movekit.

[11]:

# create a synthetic 3D dataset by appending a third dimension to the 2D dataset from above
z = np.random.normal(loc=0.0, scale=1.0, size=len(preprocessed_data))
preprocessed_data['z'] = z
preprocessed_data

[11]:

	time	animal_id	x	y	z
0	1	312	0.496639	0.849376	-0.271515
1	1	511	0.446887	0.873750	0.209787
2	1	607	0.475554	0.823122	-0.518311
3	1	811	0.552817	0.836504	-0.104081
4	1	905	0.441348	0.924578	1.042317
...	...	...	...	...	...
4995	1000	312	0.941539	0.466381	-0.624974
4996	1000	511	0.859231	0.423671	-1.143292
4997	1000	607	0.944062	0.580819	-0.840995
4998	1000	811	1.000000	0.605746	-1.494142
4999	1000	905	0.880865	0.334668	-0.910113

5000 rows × 5 columns

Support for geographic coordinates¶

movekit is able to project data from GPS coordinates in the latitude and longitude format to the cartesian coordinate system.

[12]:

path = "./datasets/geo.csv"

# Read in file using
geo_data = pd.read_csv(path, sep=';')
geo_data.head()

[12]:

	time	animal_id	latitude	longitude
0	1	1	47.691358	9.176731
1	1	2	52.472161	13.402034
2	1	3	47.692101	9.055353

[13]:

# mkit.convert_latlon(data, latitude='latitude', longitude='longitude', replace=True)

#Parameters:
#data: DataFrame with GPS coordinates
#latitude: str. Name of the column where latitude is stored
#longitude: str. Name of the column where longitude is stored
#replace: bool. Flag whether the xy columns should replace the latlon columns
#return: DataFrame after the transformation where latitude is projected into y and longitude is projected into x

projected_data = mkit.convert_latlon(geo_data)
projected_data.head()

[13]:

	time	animal_id	x	y
0	1	1	513261.777038	5.282012e+06
1	1	2	391460.276950	5.814756e+06
2	1	3	504153.593963	5.282081e+06

Support for data stored as GeoJSON and JSON¶

movekit is able to read data stored as GeoJSON (.geojson) or JSON (.json) file.

[14]:

json_data = mkit.read_geojson('./datasets/fish-4.geojson')
json_data

[14]:

	time	animal_id	x	y
0	1	fish1	99.0	0.0
1	1	fish2	120.0	4.0
2	1	fish3	120.0	6.0
3	2	fish1	101.0	1.0
4	2	fish2	200.0	5.0
5	2	fish3	33.0	5.0
6	3	fish1	8.0	8.0
7	3	fish2	125.0	43.0
8	3	fish3	45.0	87.0
9	4	fish1	-44.0	-11.0
10	4	fish2	12.0	5.0
11	4	fish3	11.0	-12.0

[ ]:

Data preprocessing¶

Read data input¶

General preprocessing method¶

Examples of some sampling methods¶

Methods to replace/convert specific values (duplicates, missings, selected values)¶

Making a pandas DataFrame compatible with `movekit`¶

Support for 3d datasets¶

Support for geographic coordinates¶

Support for data stored as GeoJSON and JSON¶

movekit

Navigation

Related Topics

Data preprocessing¶

Read data input¶

General preprocessing method¶

Examples of some sampling methods¶

Methods to replace/convert specific values (duplicates, missings, selected values)¶

Making a pandas DataFrame compatible with movekit¶

Support for 3d datasets¶

Support for geographic coordinates¶

Support for data stored as GeoJSON and JSON¶

Making a pandas DataFrame compatible with `movekit`¶