rankeval.dataset package

The rankeval.dataset module includes utilities to load datasets and dump datasets according to several supported formats.

class rankeval.dataset.Dataset(X, y, query_ids, name=None)[source]

Bases: object

This class describe the dataset object, with its utility and features

X : numpy 2d array of float
It is a dense numpy matrix of shape (n_samples, n_features),
y : numpy 1d array of float
It is a ndarray of shape (n_samples,) with the gold label
query_ids : numpy 1d array of int
It is a ndarray of shape(nsamples,)
name : str
The name to give to the dataset
n_instances : int
The number of instances in the dataset
n_features : int
The number of features in the dataset
n_queries : int
The number of queries in the dataset

This module implements the generic class for loading/dumping a dataset from/to file.

X : numpy.ndarray
The matrix with feature values
y : numpy.array
The vector with label values
query_ids : numpy.array
The vector with the query_id for each sample.
clear_X()[source]

This method clears the space used by the dataset instance for storing X (the dataset features). This space is used only for scoring, thus it can be freed after.

dump(f, format)[source]

This method implements the writing of a previously loaded dataset according to the given format on file

f : path
The file path where to store the dataset
format : str
The format to use for dumping the dataset on file (actually supported is only “svmlight” format)
static load(f, name=None, format='svmlight')[source]

This static method implements the loading of a dataset from file.

f : path
The file name of the dataset to load
name : str
The name to be given to the current dataset
format : str
The format of the dataset file to load (actually supported is only “svmlight” format)
dataset : Dataset
The dataset read from file
query_offset_iterator()[source]

This method implements and iterator over the offsets of the query_ids in the dataset.

offsets : tuple of (int, int)
The row index of instances belonging to the same query. The two indices represent (start, end) offsets.
subset_features(features)[source]

Create a new Dataset with only the features identified by the given features parameters (indices). It is useful for performing feature selection.

features : numpy array or list
The indices of the features to select in the resulting dataset
dataset : rankeval.dataset.Dataset
The resulting dataset with the given subset of features
class rankeval.dataset.DatasetContainer[source]

Bases: object

This class is a container used to easily manage a dataset and associated learning to rank models trained by using it. It also offers the possibility to store the license coming with public dataset.

license_agreement = ''
model_filenames = None
test_dataset = None
train_dataset = None
validation_dataset = None

Submodules

rankeval.dataset.dataset module

This module implements the generic class for loading/dumping a dataset from/to file.

class rankeval.dataset.dataset.Dataset(X, y, query_ids, name=None)[source]

Bases: object

This class describe the dataset object, with its utility and features

X : numpy 2d array of float
It is a dense numpy matrix of shape (n_samples, n_features),
y : numpy 1d array of float
It is a ndarray of shape (n_samples,) with the gold label
query_ids : numpy 1d array of int
It is a ndarray of shape(nsamples,)
name : str
The name to give to the dataset
n_instances : int
The number of instances in the dataset
n_features : int
The number of features in the dataset
n_queries : int
The number of queries in the dataset

This module implements the generic class for loading/dumping a dataset from/to file.

X : numpy.ndarray
The matrix with feature values
y : numpy.array
The vector with label values
query_ids : numpy.array
The vector with the query_id for each sample.
clear_X()[source]

This method clears the space used by the dataset instance for storing X (the dataset features). This space is used only for scoring, thus it can be freed after.

dump(f, format)[source]

This method implements the writing of a previously loaded dataset according to the given format on file

f : path
The file path where to store the dataset
format : str
The format to use for dumping the dataset on file (actually supported is only “svmlight” format)
static load(f, name=None, format='svmlight')[source]

This static method implements the loading of a dataset from file.

f : path
The file name of the dataset to load
name : str
The name to be given to the current dataset
format : str
The format of the dataset file to load (actually supported is only “svmlight” format)
dataset : Dataset
The dataset read from file
query_offset_iterator()[source]

This method implements and iterator over the offsets of the query_ids in the dataset.

offsets : tuple of (int, int)
The row index of instances belonging to the same query. The two indices represent (start, end) offsets.
subset_features(features)[source]

Create a new Dataset with only the features identified by the given features parameters (indices). It is useful for performing feature selection.

features : numpy array or list
The indices of the features to select in the resulting dataset
dataset : rankeval.dataset.Dataset
The resulting dataset with the given subset of features

rankeval.dataset.dataset_container module

class rankeval.dataset.dataset_container.DatasetContainer[source]

Bases: object

This class is a container used to easily manage a dataset and associated learning to rank models trained by using it. It also offers the possibility to store the license coming with public dataset.

license_agreement = ''
model_filenames = None
test_dataset = None
train_dataset = None
validation_dataset = None

rankeval.dataset.datasets_fetcher module

rankeval.dataset.datasets_fetcher.load_dataset(dataset_name, fold=None, download_if_missing=True, force_download=False, with_models=True)[source]

The method allow to download a given dataset (and available models) by providing its name.

Datasets and models are available at the following link:
http://rankeval.isti.cnr.it/rankeval-datasets/dataset_dictionary.json
dataset_name:
The name of the dataset (and models) to download.
fold : optional, None by default.
If provided, an integer identifying the specific fold to load. Example: dataset_name=msn10k, fold=1, will load train/validation/test files from the ‘Fold1’ directory. This option holds when using datasets that are already k-folded.
download_if_missing : optional, True by default.
If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
force_download : optional, False by default.
If True, download data even if it is on disk.
with_models : optional, True by default.
When True, the method downloads the models generated with different tools (QuickRank, LightGBM, XGBoost, etc.) to ease the comparison.

rankeval.dataset.svmlight_format module

This module implements a fast and memory-efficient (no memory copying) loader for the svmlight / libsvm sparse dataset format.

rankeval.dataset.svmlight_format.dump_svmlight_file(X, y, f, query_id=None, zero_based=True)[source]

Dump the dataset in svmlight / libsvm file format.

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

X : CSR sparse matrix, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]
Target values.
f : str
Specifies the path that will contain the data.
comment : list, optional
Comments to append to each row after a # character If specified, len(comment) must equal n_samples
query_id: list, optional
Query identifiers to prepend to each row If specified, len(query_id) must equal n_samples
zero_based : boolean, optional
Whether column indices should be written zero-based (True) or one-based (False).
rankeval.dataset.svmlight_format.load_svmlight_file(file_path, buffer_mb=40, query_id=False)[source]

Load datasets in the svmlight / libsvm format into sparse CSR matrix

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

This format is used as the default format for both svmlight and the libsvm command line programs.

Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.

file_path: str
Path to a file to load.
buffer_mb : integer
Buffer size to use for low level read
query_id : bool
True if the query ids has to be loaded, false otherwise

(X, y, [query_ids])

where X is a dense numpy matrix of shape (n_samples, n_features) and type dtype,
y is a ndarray of shape (n_samples,). query_ids is a ndarray of shape(nsamples,) if query_id is True, it is not returned otherwise
rankeval.dataset.svmlight_format.load_svmlight_files(files, buffer_mb=40, query_id=False)[source]

Load dataset from multiple files in SVMlight format

This function is equivalent to mapping load_svmlight_file over a list of files, except that the results are concatenated into a single, flat list and the samples vectors are constrained to all have the same number of features.

files : iterable over str
Paths to files to load.
n_features: int or None
The number of features to use. If None, it will be inferred from the first file. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have examples of every feature, hence the inferred shape might vary from one slice to another.

[X1, y1, …, Xn, yn]

where each (Xi, yi, [comment_i, query_id_i]) tuple is the result from load_svmlight_file(files[i]).

When fitting a model to a matrix X_train and evaluating it against a matrix X_test, it is essential that X_train and X_test have the same number of features (X_train.shape[1] == X_test.shape[1]). This may not be the case if you load them with load_svmlight_file separately.

load_svmlight_file

rankeval.dataset.write_json_dataset_catalogue module

rankeval.dataset.write_json_dataset_catalogue.main()[source]