rankeval.dataset package¶

The rankeval.dataset module includes utilities to load datasets and dump datasets according to several supported formats.

class rankeval.dataset.Dataset(X, y, query_ids, name=None)[source]¶

Bases: object

This class describe the dataset object, with its utility and features

X : numpy 2d array of float: It is a dense numpy matrix of shape (n_samples, n_features),
y : numpy 1d array of float: It is a ndarray of shape (n_samples,) with the gold label
query_ids : numpy 1d array of int: It is a ndarray of shape(nsamples,)
name : str: The name to give to the dataset
n_instances : int: The number of instances in the dataset
n_features : int: The number of features in the dataset
n_queries : int: The number of queries in the dataset

This module implements the generic class for loading/dumping a dataset from/to file.

X : numpy.ndarray: The matrix with feature values
y : numpy.array: The vector with label values
query_ids : numpy.array: The vector with the query_id for each sample.

clear_X()[source]¶: This method clears the space used by the dataset instance for storing X (the dataset features). This space is used only for scoring, thus it can be freed after.

dump(f, format)[source]¶

This method implements the writing of a previously loaded dataset according to the given format on file

f : path: The file path where to store the dataset
format : str: The format to use for dumping the dataset on file (actually supported is only “svmlight” format)

static load(f, name=None, format='svmlight')[source]¶

This static method implements the loading of a dataset from file.

f : path: The file name of the dataset to load
name : str: The name to be given to the current dataset
format : str: The format of the dataset file to load (actually supported is only “svmlight” format)

dataset : Dataset: The dataset read from file

query_offset_iterator()[source]¶

This method implements and iterator over the offsets of the query_ids in the dataset.

offsets : tuple of (int, int): The row index of instances belonging to the same query. The two indices represent (start, end) offsets.

subset_features(features)[source]¶

Create a new Dataset with only the features identified by the given features parameters (indices). It is useful for performing feature selection.

features : numpy array or list: The indices of the features to select in the resulting dataset

dataset : rankeval.dataset.Dataset: The resulting dataset with the given subset of features

class rankeval.dataset.DatasetContainer[source]¶

Bases: object

This class is a container used to easily manage a dataset and associated learning to rank models trained by using it. It also offers the possibility to store the license coming with public dataset.

license_agreement = ''¶

model_filenames = None¶

test_dataset = None¶

train_dataset = None¶

validation_dataset = None¶

Submodules¶

rankeval.dataset.dataset module¶

This module implements the generic class for loading/dumping a dataset from/to file.

class rankeval.dataset.dataset.Dataset(X, y, query_ids, name=None)[source]¶

Bases: object

This class describe the dataset object, with its utility and features

X : numpy 2d array of float: It is a dense numpy matrix of shape (n_samples, n_features),
y : numpy 1d array of float: It is a ndarray of shape (n_samples,) with the gold label
query_ids : numpy 1d array of int: It is a ndarray of shape(nsamples,)
name : str: The name to give to the dataset
n_instances : int: The number of instances in the dataset
n_features : int: The number of features in the dataset
n_queries : int: The number of queries in the dataset

This module implements the generic class for loading/dumping a dataset from/to file.

X : numpy.ndarray: The matrix with feature values
y : numpy.array: The vector with label values
query_ids : numpy.array: The vector with the query_id for each sample.

clear_X()[source]¶: This method clears the space used by the dataset instance for storing X (the dataset features). This space is used only for scoring, thus it can be freed after.

dump(f, format)[source]¶

This method implements the writing of a previously loaded dataset according to the given format on file

f : path: The file path where to store the dataset
format : str: The format to use for dumping the dataset on file (actually supported is only “svmlight” format)

static load(f, name=None, format='svmlight')[source]¶

This static method implements the loading of a dataset from file.

f : path: The file name of the dataset to load
name : str: The name to be given to the current dataset
format : str: The format of the dataset file to load (actually supported is only “svmlight” format)

dataset : Dataset: The dataset read from file

query_offset_iterator()[source]¶

This method implements and iterator over the offsets of the query_ids in the dataset.

offsets : tuple of (int, int): The row index of instances belonging to the same query. The two indices represent (start, end) offsets.

subset_features(features)[source]¶

Create a new Dataset with only the features identified by the given features parameters (indices). It is useful for performing feature selection.

features : numpy array or list: The indices of the features to select in the resulting dataset

dataset : rankeval.dataset.Dataset: The resulting dataset with the given subset of features

rankeval.dataset.dataset_container module¶

class rankeval.dataset.dataset_container.DatasetContainer[source]¶

Bases: object

This class is a container used to easily manage a dataset and associated learning to rank models trained by using it. It also offers the possibility to store the license coming with public dataset.

license_agreement = ''¶

model_filenames = None¶

test_dataset = None¶

train_dataset = None¶

validation_dataset = None¶

rankeval.dataset.datasets_fetcher module¶

rankeval.dataset.datasets_fetcher.load_dataset(dataset_name, fold=None, download_if_missing=True, force_download=False, with_models=True)[source]¶

The method allow to download a given dataset (and available models) by providing its name.

Datasets and models are available at the following link:: http://rankeval.isti.cnr.it/rankeval-datasets/dataset_dictionary.json

dataset_name:: The name of the dataset (and models) to download.
fold : optional, None by default.: If provided, an integer identifying the specific fold to load. Example: dataset_name=msn10k, fold=1, will load train/validation/test files from the ‘Fold1’ directory. This option holds when using datasets that are already k-folded.
download_if_missing : optional, True by default.: If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.
force_download : optional, False by default.: If True, download data even if it is on disk.
with_models : optional, True by default.: When True, the method downloads the models generated with different tools (QuickRank, LightGBM, XGBoost, etc.) to ease the comparison.

rankeval.dataset.svmlight_format module¶

This module implements a fast and memory-efficient (no memory copying) loader for the svmlight / libsvm sparse dataset format.

rankeval.dataset.svmlight_format.dump_svmlight_file(X, y, f, query_id=None, zero_based=True)[source]¶

Dump the dataset in svmlight / libsvm file format.

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

X : CSR sparse matrix, shape = [n_samples, n_features]: Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]: Target values.
f : str: Specifies the path that will contain the data.
comment : list, optional: Comments to append to each row after a # character If specified, len(comment) must equal n_samples
query_id: list, optional: Query identifiers to prepend to each row If specified, len(query_id) must equal n_samples
zero_based : boolean, optional: Whether column indices should be written zero-based (True) or one-based (False).

rankeval.dataset.svmlight_format.load_svmlight_file(file_path, buffer_mb=40, query_id=False)[source]¶

Load datasets in the svmlight / libsvm format into sparse CSR matrix

This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset.

The first element of each line can be used to store a target variable to predict.

This format is used as the default format for both svmlight and the libsvm command line programs.

Parsing a text based source can be expensive. When working on repeatedly on the same dataset, it is recommended to wrap this loader with joblib.Memory.cache to store a memmapped backup of the CSR results of the first call and benefit from the near instantaneous loading of memmapped structures for the subsequent calls.

file_path: str: Path to a file to load.
buffer_mb : integer: Buffer size to use for low level read
query_id : bool: True if the query ids has to be loaded, false otherwise

(X, y, [query_ids])

where X is a dense numpy matrix of shape (n_samples, n_features) and type dtype,: y is a ndarray of shape (n_samples,). query_ids is a ndarray of shape(nsamples,) if query_id is True, it is not returned otherwise

rankeval.dataset.svmlight_format.load_svmlight_files(files, buffer_mb=40, query_id=False)[source]¶

Load dataset from multiple files in SVMlight format

This function is equivalent to mapping load_svmlight_file over a list of files, except that the results are concatenated into a single, flat list and the samples vectors are constrained to all have the same number of features.

files : iterable over str: Paths to files to load.
n_features: int or None: The number of features to use. If None, it will be inferred from the first file. This argument is useful to load several files that are subsets of a bigger sliced dataset: each subset might not have examples of every feature, hence the inferred shape might vary from one slice to another.

[X1, y1, …, Xn, yn]

where each (Xi, yi, [comment_i, query_id_i]) tuple is the result from load_svmlight_file(files[i]).

When fitting a model to a matrix X_train and evaluating it against a matrix X_test, it is essential that X_train and X_test have the same number of features (X_train.shape[1] == X_test.shape[1]). This may not be the case if you load them with load_svmlight_file separately.

load_svmlight_file

rankeval.dataset.write_json_dataset_catalogue module¶

rankeval.dataset.write_json_dataset_catalogue.main()[source]¶