rankeval.metrics package¶
The rankeval.metrics
module includes the definition and implementation of
the most common metrics adopted in the Learning to Rank community.

class
rankeval.metrics.
Metric
(name)[source]¶ Bases:
object
Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.
Some intuitions: https://stats.stackexchange.com/questions/159657/metricsforevaluatingrankingalgorithms
The constructor for any metric; it initializes that metric with the proper name.
 name : string
 Represents the name of that metric instance.

eval
(dataset, y_pred)[source]¶ This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.
 dataset : Dataset
 Represents the Dataset object on which we want to apply the metric.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average values of a metric over all metric scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed metric scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This methods helps to evaluate the predicted scores for a specific query within the dataset.
 y: numpy array
 Represents the instance labels corresponding to the queries in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the metric score for one query.

query_iterator
(dataset, y_pred)[source]¶ This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ——— dataset : Datatset y_pred : numpy array
 : int
 The query id.
 : numpy.array
 The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
 : numpy.array
 The predicted scores for the instances in the dataset belonging to the same query id.

class
rankeval.metrics.
Precision
(name='P', cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.
It allows setting custom values for cutoff and threshold, otherwise it uses the default values.
This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.
 name: string
 P
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.
 dataset : Dataset
 Represents the Dataset object on which to apply Precision.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Precision score (averages over the detailed precision scores).
 detailed_scores: numpy 1d array of floats
 The detailed Precision scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 precision: float
 The precision per query.

class
rankeval.metrics.
Recall
(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.
It allows setting custom values for cutoff and threshold, otherwise it uses the default values.
This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.
 name: string
 R
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.
 dataset : Dataset
 Represents the Dataset object on which to apply Recall.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Recall score (averages over the detailed precision scores).
 detailed_scores: numpy 1d array of floats
 The detailed Recall scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 recall: float
 The Recall score per query.

class
rankeval.metrics.
NDCG
(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements NDCG with several parameters.
This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values
 cutoff: the top k results to be considered at per query level
 no_relevant_results: is a float values indicating how to treat
 the cases where then are no relevant results
 ties: indicates how we should consider the ties
 implementation: indicates whether to consider the flat or the
 exponential NDCG formula
 name: string
 NDCG
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
 implementation: string
 Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).

eval
(dataset, y_pred)[source]¶ The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply NDCG.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average NDCG over all NDCG scores per query.
 detailed_scores: numpy array of floats
 Represents the detailed NDCG scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.
It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the DCG score for one query.

class
rankeval.metrics.
DCG
(name='DCG', cutoff=None, implementation='flat')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements DCG with several parameters.
This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.
 name: string
 DCG
 cutoff: int
 The top k results to be considered at per query level (e.g. 10).
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
 implementation: string
 Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).

eval
(dataset, y_pred)[source]¶ The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply DCG.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average DCG over all DCG scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed DCG scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the DCG score for one query.

class
rankeval.metrics.
ERR
(name='ERR', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf
This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.
 name: string
 ERR
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)

eval
(dataset, y_pred)[source]¶ The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply ERR.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average ERR over all ERR scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed ERR scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 err: float
 Represents the ERR score for one query.

class
rankeval.metrics.
Kendalltau
(name='K')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.
This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.
 name: string
 K

eval
(dataset, y_pred)[source]¶ This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.
 dataset : Dataset
 Represents the Dataset object on which to apply Kendall Tau.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Kendall tau score (averages over the detailed scores).
 detailed_scores: numpy 1d array of floats
 The detailed Kendall tau scores for each query, an array with length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:
tau = (P  Q) / sqrt((P + Q + T) * (P + Q + U))where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller prefactor (so quicksort may be faster for small arrays).
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 kendalltau: float
 The Kendall tau per query.

class
rankeval.metrics.
MAP
(name='MAP', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/200409.pdf
This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.
 name: string
 MAP
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the AP@k for each query and calculates the average, thus MAP@k.
 dataset : Dataset
 Represents the Dataset object on which to apply MAP.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall MAP@k score (averages over the detailed MAP scores).
 detailed_scores: numpy 1d array of floats
 The detailed AP@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as
ap@k = sum( P(k) / min(m,n) ), for k=1,n
 where:
 P(k) means the precision at cutoff k in the item list. P(k)
equals 0 when the kth item is not followed upon recommendation  m is the number of relevant documents  n is the number of predicted documents
If the denominator is zero, P(k)/min(m,n) is set to zero.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 map : float
 The MAP per query.

class
rankeval.metrics.
MRR
(name='MRR', cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Mean Reciprocal Rank.
This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.
 name: string
 MRR
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.
The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.
 dataset : Dataset
 Represents the Dataset object on which to apply MRR.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average MRR over all MRR scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed MRR scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.
We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 mrr: float
 Represents the MRR score for one query.

class
rankeval.metrics.
Pfound
(name='Pf', cutoff=None, p_abandonment=0.15)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Pfound with several parameters.
The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].
In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.
This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.
 name: string
 Pf
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
 p_abandonment: float
 This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1y of abandoning.

eval
(dataset, y_pred)[source]¶ The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply Pfound.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average Pfound over all Pfound scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed Pfound scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array
 Represents the predicted document scores obtained during the model scoring phase for that query.
 pfound: float
 Represents the Pfound score for one query.

class
rankeval.metrics.
RBP
(name='RBP', cutoff=None, threshold=1, p=0.5)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.
Rankbiased precision for measurement of retrieval effectiveness.ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952
RBP is an extension of P@k. User has certain chance to view each result.
RBP = E(# viewed relevant results) / E(# viewed results)
p is based on the user model perspective and allows simulating different types of users, e.g.:
p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky  P@1The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low perdocument utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.
This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.
 name: string
 RBP
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
 p: float
 This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.

eval
(dataset, y_pred)[source]¶ This method takes the RBP for each query and calculates the average RBP.
 dataset : Dataset
 Represents the Dataset object on which to apply RBP.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall RBP score (averages over the detailed MAP scores).
 detailed_scores: numpy 1d array of floats
 The detailed RBP@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rbp: float
 Represents the RBP score for one query.

class
rankeval.metrics.
MSE
(name='MSE', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Mean squared error (MSE) with several parameters.
This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.
 name: string
 MSE
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the MSE for each query and calculates the average MSE.
 dataset : Dataset
 Represents the Dataset object on which to apply MSE.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall MSE score (summed over the detailed MSE scores).
 detailed_scores: numpy 1d array of floats
 The detailed MSE@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rmse: float
 Represents the MSE score for one query.

class
rankeval.metrics.
RMSE
(name='RMSE', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Root mean squared error (RMSE) with several parameters.
This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.
 name: string
 RMSE
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the RMSE for each query and calculates the average RMSE.
 dataset : Dataset
 Represents the Dataset object on which to apply RMSE.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall RMSE score (averages over the detailed RMSE scores).
 detailed_scores: numpy 1d array of floats
 The detailed RMSE@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rmse: float
 Represents the RMSE score for one query.

class
rankeval.metrics.
SpearmanRho
(name='Rho')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.
This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.
 name: string
 Rho

eval
(dataset, y_pred)[source]¶ This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.
 dataset : Dataset
 Represents the Dataset object on which to apply Spearman Rho.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Spearman Rho score (averages over the detailed scores).
 detailed_scores: numpy 1d array of floats
 The detailed Spearman Rho scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rho: float
 The Spearman Rho per query.
Submodules¶
rankeval.metrics.dcg module¶

class
rankeval.metrics.dcg.
DCG
(name='DCG', cutoff=None, implementation='flat')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements DCG with several parameters.
This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.
 name: string
 DCG
 cutoff: int
 The top k results to be considered at per query level (e.g. 10).
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
 implementation: string
 Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).

eval
(dataset, y_pred)[source]¶ The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply DCG.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average DCG over all DCG scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed DCG scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the DCG score for one query.
rankeval.metrics.err module¶

class
rankeval.metrics.err.
ERR
(name='ERR', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf
This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.
 name: string
 ERR
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)

eval
(dataset, y_pred)[source]¶ The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply ERR.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average ERR over all ERR scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed ERR scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 err: float
 Represents the ERR score for one query.
rankeval.metrics.kendall_tau module¶

class
rankeval.metrics.kendall_tau.
Kendalltau
(name='K')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.
This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.
 name: string
 K

eval
(dataset, y_pred)[source]¶ This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.
 dataset : Dataset
 Represents the Dataset object on which to apply Kendall Tau.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Kendall tau score (averages over the detailed scores).
 detailed_scores: numpy 1d array of floats
 The detailed Kendall tau scores for each query, an array with length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:
tau = (P  Q) / sqrt((P + Q + T) * (P + Q + U))where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller prefactor (so quicksort may be faster for small arrays).
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 kendalltau: float
 The Kendall tau per query.
rankeval.metrics.map module¶

class
rankeval.metrics.map.
MAP
(name='MAP', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/200409.pdf
This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.
 name: string
 MAP
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the AP@k for each query and calculates the average, thus MAP@k.
 dataset : Dataset
 Represents the Dataset object on which to apply MAP.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall MAP@k score (averages over the detailed MAP scores).
 detailed_scores: numpy 1d array of floats
 The detailed AP@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as
ap@k = sum( P(k) / min(m,n) ), for k=1,n
 where:
 P(k) means the precision at cutoff k in the item list. P(k)
equals 0 when the kth item is not followed upon recommendation  m is the number of relevant documents  n is the number of predicted documents
If the denominator is zero, P(k)/min(m,n) is set to zero.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 map : float
 The MAP per query.
rankeval.metrics.metric module¶

class
rankeval.metrics.metric.
Metric
(name)[source]¶ Bases:
object
Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.
Some intuitions: https://stats.stackexchange.com/questions/159657/metricsforevaluatingrankingalgorithms
The constructor for any metric; it initializes that metric with the proper name.
 name : string
 Represents the name of that metric instance.

eval
(dataset, y_pred)[source]¶ This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.
 dataset : Dataset
 Represents the Dataset object on which we want to apply the metric.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average values of a metric over all metric scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed metric scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This methods helps to evaluate the predicted scores for a specific query within the dataset.
 y: numpy array
 Represents the instance labels corresponding to the queries in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the metric score for one query.

query_iterator
(dataset, y_pred)[source]¶ This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ——— dataset : Datatset y_pred : numpy array
 : int
 The query id.
 : numpy.array
 The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
 : numpy.array
 The predicted scores for the instances in the dataset belonging to the same query id.
rankeval.metrics.mrr module¶

class
rankeval.metrics.mrr.
MRR
(name='MRR', cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Mean Reciprocal Rank.
This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.
 name: string
 MRR
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.
The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.
 dataset : Dataset
 Represents the Dataset object on which to apply MRR.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average MRR over all MRR scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed MRR scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.
We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 mrr: float
 Represents the MRR score for one query.
rankeval.metrics.mse module¶

class
rankeval.metrics.mse.
MSE
(name='MSE', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Mean squared error (MSE) with several parameters.
This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.
 name: string
 MSE
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the MSE for each query and calculates the average MSE.
 dataset : Dataset
 Represents the Dataset object on which to apply MSE.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall MSE score (summed over the detailed MSE scores).
 detailed_scores: numpy 1d array of floats
 The detailed MSE@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rmse: float
 Represents the MSE score for one query.
rankeval.metrics.ndcg module¶

class
rankeval.metrics.ndcg.
NDCG
(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements NDCG with several parameters.
This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values
 cutoff: the top k results to be considered at per query level
 no_relevant_results: is a float values indicating how to treat
 the cases where then are no relevant results
 ties: indicates how we should consider the ties
 implementation: indicates whether to consider the flat or the
 exponential NDCG formula
 name: string
 NDCG
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
 implementation: string
 Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).

eval
(dataset, y_pred)[source]¶ The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply NDCG.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average NDCG over all NDCG scores per query.
 detailed_scores: numpy array of floats
 Represents the detailed NDCG scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.
It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 dcg: float
 Represents the DCG score for one query.
rankeval.metrics.pfound module¶

class
rankeval.metrics.pfound.
Pfound
(name='Pf', cutoff=None, p_abandonment=0.15)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Pfound with several parameters.
The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].
In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.
This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.
 name: string
 Pf
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
 p_abandonment: float
 This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1y of abandoning.

eval
(dataset, y_pred)[source]¶ The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.
 dataset : Dataset
 Represents the Dataset object on which to apply Pfound.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 Represents the average Pfound over all Pfound scores per query.
 detailed_scores: numpy 1d array of floats
 Represents the detailed Pfound scores for each query. It has the length of n_queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array
 Represents the predicted document scores obtained during the model scoring phase for that query.
 pfound: float
 Represents the Pfound score for one query.
rankeval.metrics.precision module¶

class
rankeval.metrics.precision.
Precision
(name='P', cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.
It allows setting custom values for cutoff and threshold, otherwise it uses the default values.
This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.
 name: string
 P
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.
 dataset : Dataset
 Represents the Dataset object on which to apply Precision.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Precision score (averages over the detailed precision scores).
 detailed_scores: numpy 1d array of floats
 The detailed Precision scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 precision: float
 The precision per query.
rankeval.metrics.rbp module¶

class
rankeval.metrics.rbp.
RBP
(name='RBP', cutoff=None, threshold=1, p=0.5)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.
Rankbiased precision for measurement of retrieval effectiveness.ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952
RBP is an extension of P@k. User has certain chance to view each result.
RBP = E(# viewed relevant results) / E(# viewed results)
p is based on the user model perspective and allows simulating different types of users, e.g.:
p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky  P@1The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low perdocument utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.
This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.
 name: string
 RBP
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
 p: float
 This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.

eval
(dataset, y_pred)[source]¶ This method takes the RBP for each query and calculates the average RBP.
 dataset : Dataset
 Represents the Dataset object on which to apply RBP.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall RBP score (averages over the detailed MAP scores).
 detailed_scores: numpy 1d array of floats
 The detailed RBP@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rbp: float
 Represents the RBP score for one query.
rankeval.metrics.recall module¶

class
rankeval.metrics.recall.
Recall
(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.
It allows setting custom values for cutoff and threshold, otherwise it uses the default values.
This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.
 name: string
 R
 no_relevant_results: float
 Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
 cutoff: int
 The top k results to be considered at per query level (e.g. 10)
 threshold: float
 This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval
(dataset, y_pred)[source]¶ This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.
 dataset : Dataset
 Represents the Dataset object on which to apply Recall.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Recall score (averages over the detailed precision scores).
 detailed_scores: numpy 1d array of floats
 The detailed Recall scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 recall: float
 The Recall score per query.
rankeval.metrics.rmse module¶

class
rankeval.metrics.rmse.
RMSE
(name='RMSE', cutoff=None)[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Root mean squared error (RMSE) with several parameters.
This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.
 name: string
 RMSE
 cutoff: int
 The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval
(dataset, y_pred)[source]¶ This method takes the RMSE for each query and calculates the average RMSE.
 dataset : Dataset
 Represents the Dataset object on which to apply RMSE.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall RMSE score (averages over the detailed RMSE scores).
 detailed_scores: numpy 1d array of floats
 The detailed RMSE@k scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rmse: float
 Represents the RMSE score for one query.
rankeval.metrics.spearman_rho module¶

class
rankeval.metrics.spearman_rho.
SpearmanRho
(name='Rho')[source]¶ Bases:
rankeval.metrics.metric.Metric
This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.
This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.
 name: string
 Rho

eval
(dataset, y_pred)[source]¶ This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.
 dataset : Dataset
 Represents the Dataset object on which to apply Spearman Rho.
 y_pred : numpy 1d array of float
 Represents the predicted document scores for each instance in the dataset.
 avg_score: float
 The overall Spearman Rho score (averages over the detailed scores).
 detailed_scores: numpy 1d array of floats
 The detailed Spearman Rho scores for each query, an array of length of the number of queries.

eval_per_query
(y, y_pred)[source]¶ This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).
 y: numpy array
 Represents the labels of instances corresponding to one query in the dataset (ground truth).
 y_pred: numpy array.
 Represents the predicted document scores obtained during the model scoring phase for that query.
 rho: float
 The Spearman Rho per query.