rankeval.metrics package¶

The rankeval.metrics module includes the definition and implementation of the most common metrics adopted in the Learning to Rank community.

class rankeval.metrics.Metric(name)[source]¶

Bases: object

Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.

Some intuitions: https://stats.stackexchange.com/questions/159657/metrics-for-evaluating-ranking-algorithms

The constructor for any metric; it initializes that metric with the proper name.

name : string: Represents the name of that metric instance.

eval(dataset, y_pred)[source]¶

This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.

dataset : Dataset: Represents the Dataset object on which we want to apply the metric.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average values of a metric over all metric scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed metric scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This methods helps to evaluate the predicted scores for a specific query within the dataset.

y: numpy array: Represents the instance labels corresponding to the queries in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the metric score for one query.

query_iterator(dataset, y_pred)[source]¶

This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ———- dataset : Datatset y_pred : numpy array

: int: The query id.
: numpy.array: The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
: numpy.array: The predicted scores for the instances in the dataset belonging to the same query id.

class rankeval.metrics.Precision(name='P', cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string: P
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.

dataset : Dataset: Represents the Dataset object on which to apply Precision.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Precision score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats: The detailed Precision scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

precision: float: The precision per query.

class rankeval.metrics.Recall(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string: R
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.

dataset : Dataset: Represents the Dataset object on which to apply Recall.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Recall score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats: The detailed Recall scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

recall: float: The Recall score per query.

class rankeval.metrics.NDCG(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements NDCG with several parameters.

This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values

cutoff: the top k results to be considered at per query level

no_relevant_results: is a float values indicating how to treat

the cases where then are no relevant results

ties: indicates how we should consider the ties

implementation: indicates whether to consider the flat or the

exponential NDCG formula

name: string: NDCG
cutoff: int: The top k results to be considered at per query level (e.g. 10)
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
implementation: string: Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).

eval(dataset, y_pred)[source]¶

The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.

dataset : Dataset: Represents the Dataset object on which to apply NDCG.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average NDCG over all NDCG scores per query.
detailed_scores: numpy array of floats: Represents the detailed NDCG scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the DCG score for one query.

class rankeval.metrics.DCG(name='DCG', cutoff=None, implementation='flat')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements DCG with several parameters.

This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.

name: string: DCG
cutoff: int: The top k results to be considered at per query level (e.g. 10).
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
implementation: string: Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).

eval(dataset, y_pred)[source]¶

The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.

dataset : Dataset: Represents the Dataset object on which to apply DCG.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average DCG over all DCG scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed DCG scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the DCG score for one query.

class rankeval.metrics.ERR(name='ERR', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf

This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.

name: string: ERR
cutoff: int: The top k results to be considered at per query level (e.g. 10)

eval(dataset, y_pred)[source]¶

The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.

dataset : Dataset: Represents the Dataset object on which to apply ERR.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average ERR over all ERR scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed ERR scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

err: float: Represents the ERR score for one query.

class rankeval.metrics.Kendalltau(name='K')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.

This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.

name: string: K

eval(dataset, y_pred)[source]¶

This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.

dataset : Dataset: Represents the Dataset object on which to apply Kendall Tau.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Kendall tau score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats: The detailed Kendall tau scores for each query, an array with length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:

tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))

where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller pre-factor (so quicksort may be faster for small arrays).

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

kendalltau: float: The Kendall tau per query.

class rankeval.metrics.MAP(name='MAP', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf

This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.

name: string: MAP
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the AP@k for each query and calculates the average, thus MAP@k.

dataset : Dataset: Represents the Dataset object on which to apply MAP.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall MAP@k score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats: The detailed AP@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as

ap@k = sum( P(k) / min(m,n) ), for k=1,n

where:

P(k) means the precision at cut-off k in the item list. P(k)

equals 0 when the k-th item is not followed upon recommendation - m is the number of relevant documents - n is the number of predicted documents

If the denominator is zero, P(k)/min(m,n) is set to zero.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

map : float: The MAP per query.

class rankeval.metrics.MRR(name='MRR', cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Mean Reciprocal Rank.

This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.

name: string: MRR
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.

dataset : Dataset: Represents the Dataset object on which to apply MRR.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average MRR over all MRR scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed MRR scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.

We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

mrr: float: Represents the MRR score for one query.

class rankeval.metrics.Pfound(name='Pf', cutoff=None, p_abandonment=0.15)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Pfound with several parameters.

The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].

In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.

This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.

name: string: Pf
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
p_abandonment: float: This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1-y of abandoning.

eval(dataset, y_pred)[source]¶

The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.

dataset : Dataset: Represents the Dataset object on which to apply Pfound.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average Pfound over all Pfound scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed Pfound scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array: Represents the predicted document scores obtained during the model scoring phase for that query.

pfound: float: Represents the Pfound score for one query.

class rankeval.metrics.RBP(name='RBP', cutoff=None, threshold=1, p=0.5)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.

Rank-biased precision for measurement of retrieval effectiveness.

ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952

RBP is an extension of P@k. User has certain chance to view each result.

RBP = E(# viewed relevant results) / E(# viewed results)

p is based on the user model perspective and allows simulating different types of users, e.g.:

p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky - P@1

The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low per-document utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.

This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.

name: string: RBP
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
p: float: This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.

eval(dataset, y_pred)[source]¶

This method takes the RBP for each query and calculates the average RBP.

dataset : Dataset: Represents the Dataset object on which to apply RBP.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall RBP score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats: The detailed RBP@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rbp: float: Represents the RBP score for one query.

class rankeval.metrics.MSE(name='MSE', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Mean squared error (MSE) with several parameters.

This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.

name: string: MSE
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the MSE for each query and calculates the average MSE.

dataset : Dataset: Represents the Dataset object on which to apply MSE.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall MSE score (summed over the detailed MSE scores).
detailed_scores: numpy 1d array of floats: The detailed MSE@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rmse: float: Represents the MSE score for one query.

class rankeval.metrics.RMSE(name='RMSE', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Root mean squared error (RMSE) with several parameters.

This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.

name: string: RMSE
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the RMSE for each query and calculates the average RMSE.

dataset : Dataset: Represents the Dataset object on which to apply RMSE.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall RMSE score (averages over the detailed RMSE scores).
detailed_scores: numpy 1d array of floats: The detailed RMSE@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rmse: float: Represents the RMSE score for one query.

class rankeval.metrics.SpearmanRho(name='Rho')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.

This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.

name: string: Rho

eval(dataset, y_pred)[source]¶

This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.

dataset : Dataset: Represents the Dataset object on which to apply Spearman Rho.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Spearman Rho score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats: The detailed Spearman Rho scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rho: float: The Spearman Rho per query.

Submodules¶

rankeval.metrics.dcg module¶

class rankeval.metrics.dcg.DCG(name='DCG', cutoff=None, implementation='flat')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements DCG with several parameters.

This is the constructor of DCG, an object of type Metric, with the name DCG. The constructor also allows setting custom values in the following parameters.

name: string: DCG
cutoff: int: The top k results to be considered at per query level (e.g. 10).
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.5).
implementation: string: Indicates whether to consider the flat or the exponential DCG formula (e.g. {“flat”, “exp”}).

eval(dataset, y_pred)[source]¶

The method computes DCG by taking as input the dataset and the predicted document scores. It returns the averaged DCG score over the entire dataset and the detailed DCG scores per query.

dataset : Dataset: Represents the Dataset object on which to apply DCG.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average DCG over all DCG scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed DCG scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the DCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the DCG score for one query.

rankeval.metrics.err module¶

class rankeval.metrics.err.ERR(name='ERR', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Expected Reciprocal Rank as proposed in http://olivier.chapelle.cc/pub/err.pdf

This is the constructor of ERR, an object of type Metric, with the name ERR. The constructor also allows setting custom values in the following parameters.

name: string: ERR
cutoff: int: The top k results to be considered at per query level (e.g. 10)

eval(dataset, y_pred)[source]¶

The method computes ERR by taking as input the dataset and the predicted document scores. It returns the averaged ERR score over the entire dataset and the detailed ERR scores per query.

dataset : Dataset: Represents the Dataset object on which to apply ERR.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average ERR over all ERR scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed ERR scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the ERR score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

err: float: Represents the ERR score for one query.

rankeval.metrics.kendall_tau module¶

class rankeval.metrics.kendall_tau.Kendalltau(name='K')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Kendall’s Tau. We use the Kendall tau coefficient implementation from scipy.

This is the constructor of Kendall Tau, an object of type Metric, with the name K. The constructor also allows setting custom values in the following parameters.

name: string: K

eval(dataset, y_pred)[source]¶

This method computes the Kendall tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Kendall tau score.

dataset : Dataset: Represents the Dataset object on which to apply Kendall Tau.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Kendall tau score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats: The detailed Kendall tau scores for each query, an array with length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Kendall tau at per query level (on the instances belonging to a specific query). The Kendall tau per query is calculated as:

tau = (P - Q) / sqrt((P + Q + T) * (P + Q + U))

where P is the number of concordant pairs, Q the number of discordant pairs, T the number of ties only in x, and U the number of ties only in y. If a tie occurs for the same pair in both x and y, it is not added to either T or U. s Whether to use lexsort or quicksort as the sorting method for the initial sort of the inputs. Default is lexsort (True), for which kendalltau is of complexity O(n log(n)). If False, the complexity is O(n^2), but with a smaller pre-factor (so quicksort may be faster for small arrays).

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

kendalltau: float: The Kendall tau per query.

rankeval.metrics.map module¶

class rankeval.metrics.map.MAP(name='MAP', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements MAP with several parameters. We implemented MAP as in https://www.kaggle.com/wiki/MeanAveragePrecision, adapted from: http://en.wikipedia.org/wiki/Information_retrieval http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf

This is the constructor of MAP, an object of type Metric, with the name MAP. The constructor also allows setting custom values in the following parameters.

name: string: MAP
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the AP@k for each query and calculates the average, thus MAP@k.

dataset : Dataset: Represents the Dataset object on which to apply MAP.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall MAP@k score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats: The detailed AP@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes AP@k at per query level (on the instances belonging to a specific query). The AP@k per query is calculated as

ap@k = sum( P(k) / min(m,n) ), for k=1,n

where:

P(k) means the precision at cut-off k in the item list. P(k)

equals 0 when the k-th item is not followed upon recommendation - m is the number of relevant documents - n is the number of predicted documents

If the denominator is zero, P(k)/min(m,n) is set to zero.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

map : float: The MAP per query.

rankeval.metrics.metric module¶

class rankeval.metrics.metric.Metric(name)[source]¶

Bases: object

Metric is an abstract class which provides an interface for specific metrics. It also offers 2 methods, one for iterating over the indeces for a certain query and another for iterating over the entire dataset based on those indices.

Some intuitions: https://stats.stackexchange.com/questions/159657/metrics-for-evaluating-ranking-algorithms

The constructor for any metric; it initializes that metric with the proper name.

name : string: Represents the name of that metric instance.

eval(dataset, y_pred)[source]¶

This abstract method computes a specific metric over the predicted scores for a test dataset. It calls the eval_per query method for each query in order to get the detailed metric score.

dataset : Dataset: Represents the Dataset object on which we want to apply the metric.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average values of a metric over all metric scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed metric scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This methods helps to evaluate the predicted scores for a specific query within the dataset.

y: numpy array: Represents the instance labels corresponding to the queries in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the metric score for one query.

query_iterator(dataset, y_pred)[source]¶

This method iterates over dataset document scores and predicted scores in blocks of instances which belong to the same query. Parameters ———- dataset : Datatset y_pred : numpy array

: int: The query id.
: numpy.array: The document scores of the instances in the labeled dataset (instance labels) belonging to the same query id.
: numpy.array: The predicted scores for the instances in the dataset belonging to the same query id.

rankeval.metrics.mrr module¶

class rankeval.metrics.mrr.MRR(name='MRR', cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Mean Reciprocal Rank.

This is the constructor of MRR, an object of type Metric, with the name MRR. The constructor also allows setting custom values in the following parameters.

name: string: MRR
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

The method computes MRR by taking as input the dataset and the predicted document scores. It returns the averaged MRR score over the entire dataset and the detailed MRR scores per query.

The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries.

dataset : Dataset: Represents the Dataset object on which to apply MRR.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average MRR over all MRR scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed MRR scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the MRR score per query. It is called by the eval function which averages and aggregates the scores for each query.

We compute the reciprocal rank. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

mrr: float: Represents the MRR score for one query.

rankeval.metrics.mse module¶

class rankeval.metrics.mse.MSE(name='MSE', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Mean squared error (MSE) with several parameters.

This is the constructor of MSE, an object of type Metric, with the name MSE. The constructor also allows setting custom values in the following parameters.

name: string: MSE
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the MSE for each query and calculates the average MSE.

dataset : Dataset: Represents the Dataset object on which to apply MSE.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall MSE score (summed over the detailed MSE scores).
detailed_scores: numpy 1d array of floats: The detailed MSE@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the MSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rmse: float: Represents the MSE score for one query.

rankeval.metrics.ndcg module¶

class rankeval.metrics.ndcg.NDCG(name='NDCG', cutoff=None, no_relevant_results=1.0, implementation='exp')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements NDCG with several parameters.

This is the constructor of NDCG, an object of type Metric, with the name NDCG. The constructor also allows setting custom values

cutoff: the top k results to be considered at per query level

no_relevant_results: is a float values indicating how to treat

the cases where then are no relevant results

ties: indicates how we should consider the ties

implementation: indicates whether to consider the flat or the

exponential NDCG formula

name: string: NDCG
cutoff: int: The top k results to be considered at per query level (e.g. 10)
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.5). Default is 1.0.
implementation: string: Indicates whether to consider the flat or the exponential DCG formula: “flat” or “exp” (default).

eval(dataset, y_pred)[source]¶

The method computes NDCG by taking as input the dataset and the predicted document scores (obtained with the scoring methods). It returns the averaged NDCG score over the entire dataset and the detailed NDCG scores per query.

dataset : Dataset: Represents the Dataset object on which to apply NDCG.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average NDCG over all NDCG scores per query.
detailed_scores: numpy array of floats: Represents the detailed NDCG scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the NDCG score per query. It is called by the eval function which averages and aggregates the scores for each query.

It calculates NDCG per query as <dcg_score/idcg_score>. If there are no relevant results, NDCG returns the values set by default or by the user when creating the metric.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

dcg: float: Represents the DCG score for one query.

rankeval.metrics.pfound module¶

class rankeval.metrics.pfound.Pfound(name='Pf', cutoff=None, p_abandonment=0.15)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Pfound with several parameters.

The ERR metric is very similar to the pFound metric used by Yandex (Segalovich, 2010). [http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf].

In fact pFound is identical to the ERR variant described in (Chapelle et al., 2009, Section 7.2). We implemented pFound similar to ERR in section 7.2 of http://olivier.chapelle.cc/pub/err.pdf.

This is the constructor of Pfound, an object of type Metric, with the name Pf. The constructor also allows setting custom values in the following parameters.

name: string: Pf
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.
p_abandonment: float: This parameter indicates the probability of abandonment, i.e. the user stops looking a the ranked list due to an external reason. The original cascade model of ERR has later been extended to include an abandonment probability: if the user is not satisfied at a given position, he will examine the next url with probability y, but has a probability 1-y of abandoning.

eval(dataset, y_pred)[source]¶

The method computes Pfound by taking as input the dataset and the predicted document scores. It returns the averaged Pfound score over the entire dataset and the detailed Pfound scores per query.

dataset : Dataset: Represents the Dataset object on which to apply Pfound.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: Represents the average Pfound over all Pfound scores per query.
detailed_scores: numpy 1d array of floats: Represents the detailed Pfound scores for each query. It has the length of n_queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the Pfound score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array: Represents the predicted document scores obtained during the model scoring phase for that query.

pfound: float: Represents the Pfound score for one query.

rankeval.metrics.precision module¶

class rankeval.metrics.precision.Precision(name='P', cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Precision as: (relevant docs & retrieved docs) / retrieved docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Precision, an object of type Metric, with the name P. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string: P
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

This method computes the Precision score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Precision score.

dataset : Dataset: Represents the Dataset object on which to apply Precision.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Precision score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats: The detailed Precision scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Precision at per query level (on the instances belonging to a specific query). The Precision per query is calculated as <(relevant docs & retrieved docs) / retrieved docs>.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

precision: float: The precision per query.

rankeval.metrics.rbp module¶

class rankeval.metrics.rbp.RBP(name='RBP', cutoff=None, threshold=1, p=0.5)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Ranked biased Precision (RBP) with several parameters. We implemented RBP as in: Alistair Moffat and Justin Zobel. 2008.

Rank-biased precision for measurement of retrieval effectiveness.

ACM Trans. Inf. Syst. 27, 1, Article 2 (December 2008), 27 pages. DOI=http://dx.doi.org/10.1145/1416950.1416952

RBP is an extension of P@k. User has certain chance to view each result.

RBP = E(# viewed relevant results) / E(# viewed results)

p is based on the user model perspective and allows simulating different types of users, e.g.:

p = 0.95 for persistent user p = 0.8 for patient users p = 0.5 for impatient users p = 0 for i’m feeling lucky - P@1

The use of different values of p reflects different ways in which ranked lists can be used. Values close to 1.0 are indicative of highly persistent users, who scrutinize many answers before ceasing their search. For example, at p = 0.95, there is a roughly 60% likelihood that a user will enter a second page of 10 results, and a 35% chance that they will go to a third page. Such users obtain a relatively low per-document utility from a search unless a high number of relevant documents are encountered, scattered through a long prefix of the ranking.

This is the constructor of RBP, an object of type Metric, with the name RBP. The constructor also allows setting custom values in the following parameters.

name: string: RBP
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.
p: float: This parameter which simulates user type, and consequently the probability that a viewer actually inspects the document at rank k.

eval(dataset, y_pred)[source]¶

This method takes the RBP for each query and calculates the average RBP.

dataset : Dataset: Represents the Dataset object on which to apply RBP.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall RBP score (averages over the detailed MAP scores).
detailed_scores: numpy 1d array of floats: The detailed RBP@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the RBP score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rbp: float: Represents the RBP score for one query.

rankeval.metrics.recall module¶

class rankeval.metrics.recall.Recall(name='R', no_relevant_results=0.0, cutoff=None, threshold=1)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Recall as: (relevant docs & retrieved docs) / relevant docs.

It allows setting custom values for cutoff and threshold, otherwise it uses the default values.

This is the constructor of Recall, an object of type Metric, with the name R. The constructor also allows setting custom values for cutoff and threshold, otherwise it uses the default values.

name: string: R
no_relevant_results: float: Float indicating how to treat the cases where then are no relevant results (e.g. 0.0).
cutoff: int: The top k results to be considered at per query level (e.g. 10)
threshold: float: This parameter considers relevant results all instances with labels different from 0, thus with a minimum label value of 1. It can be set to other values as well (e.g. 3), in the range of possible labels.

eval(dataset, y_pred)[source]¶

This method computes the Recall score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Recall score.

dataset : Dataset: Represents the Dataset object on which to apply Recall.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Recall score (averages over the detailed precision scores).
detailed_scores: numpy 1d array of floats: The detailed Recall scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Recall at per query level (on the instances belonging to a specific query). The Recall per query is calculated as <(relevant docs & retrieved docs) / relevant docs>.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

recall: float: The Recall score per query.

rankeval.metrics.rmse module¶

class rankeval.metrics.rmse.RMSE(name='RMSE', cutoff=None)[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Root mean squared error (RMSE) with several parameters.

This is the constructor of RMSE, an object of type Metric, with the name RMSE. The constructor also allows setting custom values in the following parameters.

name: string: RMSE
cutoff: int: The top k results to be considered at per query level (e.g. 10), otherwise the default value is None and is computed on all the instances of a query.

eval(dataset, y_pred)[source]¶

This method takes the RMSE for each query and calculates the average RMSE.

dataset : Dataset: Represents the Dataset object on which to apply RMSE.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall RMSE score (averages over the detailed RMSE scores).
detailed_scores: numpy 1d array of floats: The detailed RMSE@k scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This method helps compute the RMSE score per query. It is called by the eval function which averages and aggregates the scores for each query.

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rmse: float: Represents the RMSE score for one query.

rankeval.metrics.spearman_rho module¶

class rankeval.metrics.spearman_rho.SpearmanRho(name='Rho')[source]¶

Bases: rankeval.metrics.metric.Metric

This class implements Spearman’s Rho. We use the Spearman Rho coefficient implementation from scipy.

This is the constructor of Spearman Rho, an object of type Metric, with the name Rho. The constructor also allows setting custom values in the following parameters.

name: string: Rho

eval(dataset, y_pred)[source]¶

This method computes the Spearman Rho tau score over the entire dataset and the detailed scores per query. It calls the eval_per query method for each query in order to get the detailed Spearman Rho score.

dataset : Dataset: Represents the Dataset object on which to apply Spearman Rho.
y_pred : numpy 1d array of float: Represents the predicted document scores for each instance in the dataset.

avg_score: float: The overall Spearman Rho score (averages over the detailed scores).
detailed_scores: numpy 1d array of floats: The detailed Spearman Rho scores for each query, an array of length of the number of queries.

eval_per_query(y, y_pred)[source]¶

This methods computes Spearman Rho at per query level (on the instances belonging to a specific query).

y: numpy array: Represents the labels of instances corresponding to one query in the dataset (ground truth).
y_pred: numpy array.: Represents the predicted document scores obtained during the model scoring phase for that query.

rho: float: The Spearman Rho per query.