ml_dataframe 1.5.0
ml_dataframe: ^1.5.0 copied to clipboard
An in-memory untyped data storage with the possibility to query and modify it
ml_dataframe #
A way to store and manipulate data
The library exposes in-memory storage for dynamically typed data. The storage is represented by DataFrame class.
Table of contents #
- Usage example
- DataFrame API
- Get the header
- Get the rows
- Get the series
- Get the shape
- Add a series
- Drop a series by a name
- Drop a series by an index
- Sample a dataframe from rows
- Sample a dataframe from series indices
- Sample a dataframe from series names
- Save a dataframe
- Shuffle rows of a dataframe
- Get a JSON representation
- Convert to Matrix
- Get a series by name
- Get a series by index
- Map values
- Map values of a series
- Ways to create a dataframe
- Prefilled dataframes
- Contacts
Usage example: #
final data = [
['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species'],
[ 1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[ 2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[ 89, 5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'],
[ 90, 5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
[ 91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
];
final dataframe = DataFrame(data);
print(dataframe);
// DataFrame (5 x 6)
// Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
// 1 5.1 3.5 1.4 0.2 Iris-setosa
// 2 4.9 3.0 1.4 0.2 Iris-setosa
// 89 5.6 3.0 4.1 1.3 Iris-versicolor
// 90 5.5 2.5 4.0 1.3 Iris-versicolor
// 91 5.5 2.6 4.4 1.2 Iris-versicolor
DataFrame
API with examples: #
Let's assume that all the examples below are applied to the dataframe instance which was created above.
Get the header of the data #
By default, the very first row is considered a header, unless one specify their own header or autogenerated one. More on this is here
final header = dataframe.header;
print(header);
// ['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']
Get the rows of the data #
final rows = dataframe.rows;
print(rows);
// [
// [1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
// [2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
// [89, 5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'],
// [90, 5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
// [91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
// ],
Get the series collection (columns) of the data #
final series = dataframe.series;
print(series);
// [
// 'Id': [1, 2, 89, 90, 91],
// 'SepalLengthCm': [5.1, 4.9, 5.6, 5.5, 5.5],
// 'SepalWidthCm': [3.5, 3.0, 3.0, 2.5, 2.6],
// 'PetalLengthCm': [1.4, 1.4, 4.1, 4.0, 4.4],
// 'PetalWidthCm': [0.2, 0.2, 1.3, 1.3, 1.2],
// 'Species': ['Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor'],
// ],
Get the shape of the data #
final shape = dataframe.shape;
print(shape);
// [5, 6] - 5 rows, 6 columns
Add a series #
final firstSeries = Series('super_series', [1, 2, 3, 4, 5, 6]);
dataframe.addSeries([firstSeries]);
print(dataframe.series.last);
// 'super_series': [1, 2, 3, 4, 5, 6]
Drop a series by a series name #
print(dataframe.shape);
// [5, 6] - 6 rows, 6 columns
dataframe.dropSeries(names: ['Id']);
print(dataframe.shape);
// [5, 5] - after a series had been dropped, the number of columns became one lesser
Drop a series by a series index #
print(dataframe.shape);
// [5, 6] - 5 rows, 6 columns
dataframe.dropSeries(indices: [0]);
print(dataframe.shape);
// [5, 5] - after a series had been dropped, the number of columns became one lesser
Sample a new dataframe from rows of an existing dataframe #
final sampled = dataframe.sampleFromRows([0, 5]);
print(sampled);
// DataFrame (2 x 6)
// Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
// 1 5.1 3.5 1.4 0.2 Iris-setosa
// 91 5.5 2.6 4.4 1.2 Iris-versicolor
Sample a new dataframe from series indices of an existing dataframe #
final sampled = dataframe.sampleFromSeries(indices: [0, 1]);
print(sampled);
// DataFrame (5 x 2)
// Id SepalLengthCm
// 1 5.1
// 2 4.9
// 89 5.6
// 90 5.5
// 91 5.5
Sample a new dataframe from series names of an existing dataframe #
final sampled = dataframe.sampleFromSeries(names: ['Id', 'SepalLengthCm']);
print(sampled);
// DataFrame (5 x 2)
// Id SepalLengthCm
// 1 5.1
// 2 4.9
// 89 5.6
// 90 5.5
// 91 5.5
Save a dataframe to a JSON file #
await dataframe.saveAsJson('path/to/json/file.json');
Shuffle rows in a dataframe #
print(dataframe);
// DataFrame (5 x 6)
// Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
// 1 5.1 3.5 1.4 0.2 Iris-setosa
// 2 4.9 3.0 1.4 0.2 Iris-setosa
// 89 5.6 3.0 4.1 1.3 Iris-versicolor
// 90 5.5 2.5 4.0 1.3 Iris-versicolor
// 91 5.5 2.6 4.4 1.2 Iris-versicolor
dataframe.shuffle();
print(dataframe);
// DataFrame (5 x 6)
// Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
// 89 5.6 3.0 4.1 1.3 Iris-versicolor
// 1 5.1 3.5 1.4 0.2 Iris-setosa
// 91 5.5 2.6 4.4 1.2 Iris-versicolor
// 2 4.9 3.0 1.4 0.2 Iris-setosa
// 90 5.5 2.5 4.0 1.3 Iris-versicolor
One can use seed
parameter to keep the order of rows disregard the number of shuffle
calls:
dataframe.shuffle(seed: 10);
Get a json-serializable representation #
final json = dataframe.toJson(); // json contains a serializable map
Convert a dataframe to a matrix: #
dataframe.toMatrix();
the method throws an error if there are inconvertible to a number values in the dataframe.
Get a series by its index #
final series = dataframe[0];
print(series);
// Id: [1, 2, 89, 90, 91]
Get a series by its name #
final series = dataframe['Id'];
print(series);
// Id: [1, 2, 89, 90, 91]
Map values of a dataframe #
import 'package:ml_dataframe/ml_dataframe';
void main() {
final data = DataFrame([
['col_1', 'col_2', 'col_3'],
[ 2, 20, 200],
[ 3, 30, 300],
[ 4, 40, 400],
]);
// the first generic type ia a type of the source value, the second generic type is a type of the mapped value
final modifiedData = data.map<num, num>((value) => value * 2);
print(modifiedData);
// DataFrame (3 x 3)
// col_1 col_2 col_3
// 4 40 400
// 6 60 600
// 8 80 800
}
Map values of a specific dataframe series #
import 'package:ml_dataframe/ml_dataframe';
void main() {
final data = DataFrame([
['col_1', 'col_2', 'col_3'],
[ 2, 20, 200],
[ 3, 30, 300],
[ 4, 40, 400],
]);
// the first generic type ia a type of the source value, the second generic type is a type of the mapped value
final modifiedData = data.mapSeries<num, num>((value) => value * 2, name: 'col_2');
print(modifiedData);
// DataFrame (3 x 3)
// col_1 col_2 col_3
// 2 40 200
// 3 60 300
// 4 80 400
}
Ways to create a dataframe #
DataFrame
constructor #
import 'package:ml_dataframe/ml_dataframe.dart';
final data = [
['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species'],
[ 1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[ 2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[ 89, 5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'],
[ 90, 5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
[ 91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
];
final dataframe = DataFrame(data);
By default, the very first row is considered a header. If the data does not have a header, one can use autogenerated
header by providing headerExists: false
config to the constructor:
final data = [
[1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[89, 5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'],
[90, 5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
[91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
];
final dataframe = DataFrame(data, headerExists: false);
print(dataframe.header);
It outputs ['col_1', 'col_2', 'col_3', 'col_4', 'col_5', 'col_6']
. col_
is a default prefix for the autogenerated
columns.
Also, if there are no header row in the data, one can use a predefined header:
final data = [
[1, 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[2, 4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[89, 5.6, 3.0, 4.1, 1.3, 'Iris-versicolor'],
[90, 5.5, 2.5, 4.0, 1.3, 'Iris-versicolor'],
[91, 5.5, 2.6, 4.4, 1.2, 'Iris-versicolor'],
];
final dataframe = DataFrame(data, header: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6']);
fromCsv
function #
import 'package:ml_dataframe/ml_dataframe.dart';
final data = await fromCsv('path/to/csv/file.csv');
If the csv
file doe not have a header row, it's needed to provide the corresponding flag:
import 'package:ml_dataframe/ml_dataframe.dart';
final data = await fromCsv('path/to/csv/file.csv', headerExists: false);
Restore previously persisted as a json file dataframe - fromJson
function #
import 'package:ml_dataframe/ml_dataframe.dart';
final data = await fromJson('path/to/json/file.json');
This function works in conjunction with DataFrame saveAsJson
method.
Dataframes with prefilled data #
In order to test data processing algorithms, one can use "toy" datasets. The library exposes several of them:
Iris dataset - function getIrisDataset
#
One can create a dataframe filled with Iris data:
import 'package:ml_dataframe/ml_dataframe.dart';
void main() async {
final data = await getIrisDataset();
print(data);
// DataFrame (150 x 6)
// Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
// ...
}
Pima Indians diabetes dataset - function getPimaIndiansDiabetesDataFrame
#
One can create a dataframe filled with Pima Indians diabetes data:
import 'package:ml_dataframe/ml_dataframe.dart';
void main() {
final data = getPimaIndiansDiabetesDataFrame();
print(data);
// DataFrame (768 x 9)
// Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
// ...
}
Red wine quality dataset - function getWineQualityDataframe
#
One can create a dataframe filled with Red wine quality data:
import 'package:ml_dataframe/ml_dataframe.dart';
void main() {
final data = getWineQualityDataframe();
print(data);
// DataFrame (1599 x 12)
// fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
// ...
}
Contacts #
If you have questions, feel free to text me on