rankeval.metrics package

The rankeval.metrics module includes the definition and implementation of the most common metrics adopted in the Learning to Rank community.

class rankeval.metrics.Metric(name)[source]

Bases: object

Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.

Some intuitions: https://stats.stackexchange.com/questions/159657/metrics-for-evaluating-ranking-algorithms

The constructor for any metric; it initializes that metric with the proper name.

name : string
Represents the name of that metric instance.
eval(dataset, y_pred)[source]

This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.

dataset : Dataset
Represents the Dataset object on which we want to apply the metric.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average values of a metric over all metric scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed metric scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This methods helps to evaluate the predicted scores for a specific query within the dataset.

y: numpy array
Represents the instance labels corresponding to the queries in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the metric score for one query.
query_iterator(dataset, y_pred)[source]

This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ———- dataset : Datatset y_pred : numpy array

: int
The query id.
: numpy.array
The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
: numpy.array
The predicted scores for the instances in the dataset belonging to the same query id.
class rankeval.metrics.Precision(name='P', cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string
P
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.

dataset : Dataset
Represents the Dataset object on which to apply Precision.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Precision score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats
The detailed Precision scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
precision: float
The precision per query.
class rankeval.metrics.Recall(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string
R
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.

dataset : Dataset
Represents the Dataset object on which to apply Recall.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Recall score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats
The detailed Recall scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
recall: float
The Recall score per query.
class rankeval.metrics.NDCG(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]

Bases: rankeval.metrics.metric.Metric

This class implements NDCG with several parameters.

This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values

  • cutoff: the top k results to be considered at per query level
  • no_relevant_results: is a float values indicating how to treat
    the cases where then are no relevant results
  • ties: indicates how we should consider the ties
  • implementation: indicates whether to consider the flat or the
    exponential NDCG formula
name: string
NDCG
cutoff: int
The top k results to be considered at per query level (e.g. 10)
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
implementation: string
Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).
eval(dataset, y_pred)[source]

The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.

dataset : Dataset
Represents the Dataset object on which to apply NDCG.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average NDCG over all NDCG scores per query.
detailed_scores: numpy array of floats
Represents the detailed NDCG scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the DCG score for one query.
class rankeval.metrics.DCG(name='DCG', cutoff=None, implementation='flat')[source]

Bases: rankeval.metrics.metric.Metric

This class implements DCG with several parameters.

This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.

name: string
DCG
cutoff: int
The top k results to be considered at per query level (e.g. 10).
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
implementation: string
Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).
eval(dataset, y_pred)[source]

The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.

dataset : Dataset
Represents the Dataset object on which to apply DCG.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average DCG over all DCG scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed DCG scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the DCG score for one query.
class rankeval.metrics.ERR(name='ERR', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf

This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.

name: string
ERR
cutoff: int
The top k results to be considered at per query level (e.g. 10)
eval(dataset, y_pred)[source]

The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.

dataset : Dataset
Represents the Dataset object on which to apply ERR.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average ERR over all ERR scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed ERR scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
err: float
Represents the ERR score for one query.
class rankeval.metrics.Kendalltau(name='K')[source]

Bases: rankeval.metrics.metric.Metric

This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.

This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.

name: string
K
eval(dataset, y_pred)[source]

This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.

dataset : Dataset
Represents the Dataset object on which to apply Kendall Tau.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Kendall tau score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats
The detailed Kendall tau scores for each query, an array with length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:

tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))

where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller pre-factor (so quicksort may be faster for small arrays).

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
kendalltau: float
The Kendall tau per query.
class rankeval.metrics.MAP(name='MAP', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf

This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.

name: string
MAP
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the AP@k for each query and calculates the average, thus MAP@k.

dataset : Dataset
Represents the Dataset object on which to apply MAP.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall MAP@k score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats
The detailed AP@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as

ap@k = sum( P(k) / min(m,n) ), for k=1,n

where:
  • P(k) means the precision at cut-off k in the item list. P(k)

equals 0 when the k-th item is not followed upon recommendation - m is the number of relevant documents - n is the number of predicted documents

If the denominator is zero, P(k)/min(m,n) is set to zero.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
map : float
The MAP per query.
class rankeval.metrics.MRR(name='MRR', cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Mean Reciprocal Rank.

This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.

name: string
MRR
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.

dataset : Dataset
Represents the Dataset object on which to apply MRR.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average MRR over all MRR scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed MRR scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.

We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
mrr: float
Represents the MRR score for one query.
class rankeval.metrics.Pfound(name='Pf', cutoff=None, p_abandonment=0.15)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Pfound with several parameters.

The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].

In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.

This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.

name: string
Pf
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
p_abandonment: float
This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1-y of abandoning.
eval(dataset, y_pred)[source]

The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.

dataset : Dataset
Represents the Dataset object on which to apply Pfound.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average Pfound over all Pfound scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed Pfound scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array
Represents the predicted document scores obtained during the model scoring phase for that query.
pfound: float
Represents the Pfound score for one query.
class rankeval.metrics.RBP(name='RBP', cutoff=None, threshold=1, p=0.5)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.

Rank-biased precision for measurement of retrieval effectiveness.

ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952

RBP is an extension of P@k. User has certain chance to view each result.

RBP = E(# viewed relevant results) / E(# viewed results)

p is based on the user model perspective and allows simulating different types of users, e.g.:

p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky - P@1

The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low per-document utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.

This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.

name: string
RBP
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
p: float
This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.
eval(dataset, y_pred)[source]

This method takes the RBP for each query and calculates the average RBP.

dataset : Dataset
Represents the Dataset object on which to apply RBP.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall RBP score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats
The detailed RBP@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rbp: float
Represents the RBP score for one query.
class rankeval.metrics.MSE(name='MSE', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Mean squared error (MSE) with several parameters.

This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.

name: string
MSE
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the MSE for each query and calculates the average MSE.

dataset : Dataset
Represents the Dataset object on which to apply MSE.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall MSE score (summed over the detailed MSE scores).
detailed_scores: numpy 1d array of floats
The detailed MSE@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rmse: float
Represents the MSE score for one query.
class rankeval.metrics.RMSE(name='RMSE', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Root mean squared error (RMSE) with several parameters.

This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.

name: string
RMSE
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the RMSE for each query and calculates the average RMSE.

dataset : Dataset
Represents the Dataset object on which to apply RMSE.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall RMSE score (averages over the detailed RMSE scores).
detailed_scores: numpy 1d array of floats
The detailed RMSE@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rmse: float
Represents the RMSE score for one query.
class rankeval.metrics.SpearmanRho(name='Rho')[source]

Bases: rankeval.metrics.metric.Metric

This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.

This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.

name: string
Rho
eval(dataset, y_pred)[source]

This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.

dataset : Dataset
Represents the Dataset object on which to apply Spearman Rho.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Spearman Rho score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats
The detailed Spearman Rho scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rho: float
The Spearman Rho per query.

Submodules

rankeval.metrics.dcg module

class rankeval.metrics.dcg.DCG(name='DCG', cutoff=None, implementation='flat')[source]

Bases: rankeval.metrics.metric.Metric

This class implements DCG with several parameters.

This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.

name: string
DCG
cutoff: int
The top k results to be considered at per query level (e.g. 10).
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
implementation: string
Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).
eval(dataset, y_pred)[source]

The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.

dataset : Dataset
Represents the Dataset object on which to apply DCG.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average DCG over all DCG scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed DCG scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the DCG score for one query.

rankeval.metrics.err module

class rankeval.metrics.err.ERR(name='ERR', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf

This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.

name: string
ERR
cutoff: int
The top k results to be considered at per query level (e.g. 10)
eval(dataset, y_pred)[source]

The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.

dataset : Dataset
Represents the Dataset object on which to apply ERR.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average ERR over all ERR scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed ERR scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
err: float
Represents the ERR score for one query.

rankeval.metrics.kendall_tau module

class rankeval.metrics.kendall_tau.Kendalltau(name='K')[source]

Bases: rankeval.metrics.metric.Metric

This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.

This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.

name: string
K
eval(dataset, y_pred)[source]

This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.

dataset : Dataset
Represents the Dataset object on which to apply Kendall Tau.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Kendall tau score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats
The detailed Kendall tau scores for each query, an array with length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:

tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))

where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller pre-factor (so quicksort may be faster for small arrays).

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
kendalltau: float
The Kendall tau per query.

rankeval.metrics.map module

class rankeval.metrics.map.MAP(name='MAP', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf

This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.

name: string
MAP
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the AP@k for each query and calculates the average, thus MAP@k.

dataset : Dataset
Represents the Dataset object on which to apply MAP.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall MAP@k score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats
The detailed AP@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as

ap@k = sum( P(k) / min(m,n) ), for k=1,n

where:
  • P(k) means the precision at cut-off k in the item list. P(k)

equals 0 when the k-th item is not followed upon recommendation - m is the number of relevant documents - n is the number of predicted documents

If the denominator is zero, P(k)/min(m,n) is set to zero.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
map : float
The MAP per query.

rankeval.metrics.metric module

class rankeval.metrics.metric.Metric(name)[source]

Bases: object

Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.

Some intuitions: https://stats.stackexchange.com/questions/159657/metrics-for-evaluating-ranking-algorithms

The constructor for any metric; it initializes that metric with the proper name.

name : string
Represents the name of that metric instance.
eval(dataset, y_pred)[source]

This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.

dataset : Dataset
Represents the Dataset object on which we want to apply the metric.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average values of a metric over all metric scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed metric scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This methods helps to evaluate the predicted scores for a specific query within the dataset.

y: numpy array
Represents the instance labels corresponding to the queries in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the metric score for one query.
query_iterator(dataset, y_pred)[source]

This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ———- dataset : Datatset y_pred : numpy array

: int
The query id.
: numpy.array
The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
: numpy.array
The predicted scores for the instances in the dataset belonging to the same query id.

rankeval.metrics.mrr module

class rankeval.metrics.mrr.MRR(name='MRR', cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Mean Reciprocal Rank.

This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.

name: string
MRR
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.

dataset : Dataset
Represents the Dataset object on which to apply MRR.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average MRR over all MRR scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed MRR scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.

We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
mrr: float
Represents the MRR score for one query.

rankeval.metrics.mse module

class rankeval.metrics.mse.MSE(name='MSE', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Mean squared error (MSE) with several parameters.

This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.

name: string
MSE
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the MSE for each query and calculates the average MSE.

dataset : Dataset
Represents the Dataset object on which to apply MSE.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall MSE score (summed over the detailed MSE scores).
detailed_scores: numpy 1d array of floats
The detailed MSE@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rmse: float
Represents the MSE score for one query.

rankeval.metrics.ndcg module

class rankeval.metrics.ndcg.NDCG(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]

Bases: rankeval.metrics.metric.Metric

This class implements NDCG with several parameters.

This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values

  • cutoff: the top k results to be considered at per query level
  • no_relevant_results: is a float values indicating how to treat
    the cases where then are no relevant results
  • ties: indicates how we should consider the ties
  • implementation: indicates whether to consider the flat or the
    exponential NDCG formula
name: string
NDCG
cutoff: int
The top k results to be considered at per query level (e.g. 10)
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
implementation: string
Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).
eval(dataset, y_pred)[source]

The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.

dataset : Dataset
Represents the Dataset object on which to apply NDCG.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average NDCG over all NDCG scores per query.
detailed_scores: numpy array of floats
Represents the detailed NDCG scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
dcg: float
Represents the DCG score for one query.

rankeval.metrics.pfound module

class rankeval.metrics.pfound.Pfound(name='Pf', cutoff=None, p_abandonment=0.15)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Pfound with several parameters.

The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].

In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.

This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.

name: string
Pf
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
p_abandonment: float
This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1-y of abandoning.
eval(dataset, y_pred)[source]

The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.

dataset : Dataset
Represents the Dataset object on which to apply Pfound.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
Represents the average Pfound over all Pfound scores per query.
detailed_scores: numpy 1d array of floats
Represents the detailed Pfound scores for each query. It has the length of n_queries.
eval_per_query(y, y_pred)[source]

This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array
Represents the predicted document scores obtained during the model scoring phase for that query.
pfound: float
Represents the Pfound score for one query.

rankeval.metrics.precision module

class rankeval.metrics.precision.Precision(name='P', cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string
P
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.

dataset : Dataset
Represents the Dataset object on which to apply Precision.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Precision score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats
The detailed Precision scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
precision: float
The precision per query.

rankeval.metrics.rbp module

class rankeval.metrics.rbp.RBP(name='RBP', cutoff=None, threshold=1, p=0.5)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.

Rank-biased precision for measurement of retrieval effectiveness.

ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952

RBP is an extension of P@k. User has certain chance to view each result.

RBP = E(# viewed relevant results) / E(# viewed results)

p is based on the user model perspective and allows simulating different types of users, e.g.:

p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky - P@1

The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low per-document utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.

This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.

name: string
RBP
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
p: float
This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.
eval(dataset, y_pred)[source]

This method takes the RBP for each query and calculates the average RBP.

dataset : Dataset
Represents the Dataset object on which to apply RBP.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall RBP score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats
The detailed RBP@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rbp: float
Represents the RBP score for one query.

rankeval.metrics.recall module

class rankeval.metrics.recall.Recall(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string
R
no_relevant_results: float
Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
cutoff: int
The top k results to be considered at per query level (e.g. 10)
threshold: float
This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
eval(dataset, y_pred)[source]

This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.

dataset : Dataset
Represents the Dataset object on which to apply Recall.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Recall score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats
The detailed Recall scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
recall: float
The Recall score per query.

rankeval.metrics.rmse module

class rankeval.metrics.rmse.RMSE(name='RMSE', cutoff=None)[source]

Bases: rankeval.metrics.metric.Metric

This class implements Root mean squared error (RMSE) with several parameters.

This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.

name: string
RMSE
cutoff: int
The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
eval(dataset, y_pred)[source]

This method takes the RMSE for each query and calculates the average RMSE.

dataset : Dataset
Represents the Dataset object on which to apply RMSE.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall RMSE score (averages over the detailed RMSE scores).
detailed_scores: numpy 1d array of floats
The detailed RMSE@k scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rmse: float
Represents the RMSE score for one query.

rankeval.metrics.spearman_rho module

class rankeval.metrics.spearman_rho.SpearmanRho(name='Rho')[source]

Bases: rankeval.metrics.metric.Metric

This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.

This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.

name: string
Rho
eval(dataset, y_pred)[source]

This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.

dataset : Dataset
Represents the Dataset object on which to apply Spearman Rho.
y_pred : numpy 1d array of float
Represents the predicted document scores for each instance in the dataset.
avg_score: float
The overall Spearman Rho score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats
The detailed Spearman Rho scores for each query, an array of length of the number of queries.
eval_per_query(y, y_pred)[source]

This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).

y: numpy array
Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.
Represents the predicted document scores obtained during the model scoring phase for that query.
rho: float
The Spearman Rho per query.