Towards Realistic Known-item Topics for the ClueWeb

Claudia Hauff; Matthias Hagen; Anna Beyer; B. Stein

doi:10.1145/2362724.2362773

Towards Realistic Known-item Topics for the ClueWeb

Claudia Hauff, Matthias Hagen, Anna Beyer, B. Stein

Web Information Systems

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

8 Citations (Scopus)

Abstract

Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.

In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.

Original language	English
Title of host publication	IIIX'12 Proceedings of the 4th Information Interaction in Context Symposium
Place of Publication	New York
Publisher	Association for Computing Machinery (ACM)
Pages	274-277
Number of pages	4
ISBN (Electronic)	978-1-4503-1282-0
DOIs	https://doi.org/10.1145/2362724.2362773
Publication status	Published - 21 Aug 2012
Event	The 4th Information Interaction in Context Symposium: IIIX'12 - Nijmegen, Netherlands Duration: 21 Aug 2012 → 24 Aug 2012

Conference

Conference	The 4th Information Interaction in Context Symposium
Country/Territory	Netherlands
City	Nijmegen
Period	21/08/12 → 24/08/12

Keywords

ClueWeb
known-item

Access to Document

10.1145/2362724.2362773

Cite this

@inproceedings{e3de9e6f01344cb1b376f8013432acb7,

title = "Towards Realistic Known-item Topics for the ClueWeb",

abstract = "Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.",

keywords = "ClueWeb, known-item",

author = "Claudia Hauff and Matthias Hagen and Anna Beyer and B. Stein",

year = "2012",

month = aug,

day = "21",

doi = "10.1145/2362724.2362773",

language = "English",

pages = "274--277",

booktitle = "IIIX'12 Proceedings of the 4th Information Interaction in Context Symposium",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

note = "The 4th Information Interaction in Context Symposium : IIIX'12 ; Conference date: 21-08-2012 Through 24-08-2012",

}

Hauff, C, Hagen, M, Beyer, A & Stein, B 2012, Towards Realistic Known-item Topics for the ClueWeb. in IIIX'12 Proceedings of the 4th Information Interaction in Context Symposium. Association for Computing Machinery (ACM), New York, pp. 274-277, The 4th Information Interaction in Context Symposium, Nijmegen, Netherlands, 21/08/12. https://doi.org/10.1145/2362724.2362773

Towards Realistic Known-item Topics for the ClueWeb. / Hauff, Claudia; Hagen, Matthias; Beyer, Anna et al.
IIIX'12 Proceedings of the 4th Information Interaction in Context Symposium. New York: Association for Computing Machinery (ACM), 2012. p. 274-277.

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Towards Realistic Known-item Topics for the ClueWeb

AU - Hauff, Claudia

AU - Hagen, Matthias

AU - Beyer, Anna

AU - Stein, B.

PY - 2012/8/21

Y1 - 2012/8/21

N2 - Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.

AB - Known-item finding is the task of re-finding and re-accessing an item previously seen. Typical examples of known items include accessed Web sites, received emails, or documents on one's personal desktop. Current research on known-item finding heavily relies on corpora of known-item queries and the respective known items. However, many existing corpora are proprietary and not available to the public (in particular those derived from Web query logs), a fact which does not allow for repeatable research. The existing publicly available corpora either contain automatically generated queries or queries that were manually generated while seeing the known item itself. Hence, we consider these public corpora to be rather artificial in nature.In this paper, we propose a methodology to create a known-item topic set that is much more realistic and that is built on top of a large-scale public test corpus. From know-item questions posted on the popular Yahoo! Answers platform we extract queries for known-items in a crowdsourcing setup. Since we ensure that all the known-items correspond to Web pages in the publicly available ClueWeb09 corpus (a large static Web crawl), we provide an environment for repeatable realistic Web-scale known-item searches.

KW - ClueWeb

KW - known-item

U2 - 10.1145/2362724.2362773

DO - 10.1145/2362724.2362773

M3 - Conference contribution

SP - 274

EP - 277

BT - IIIX'12 Proceedings of the 4th Information Interaction in Context Symposium

PB - Association for Computing Machinery (ACM)

CY - New York

T2 - The 4th Information Interaction in Context Symposium

Y2 - 21 August 2012 through 24 August 2012

ER -

Towards Realistic Known-item Topics for the ClueWeb

Abstract

Conference

Keywords

Access to Document

Fingerprint

Cite this