Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of the wikipedia dataset + creating the track. #429

Merged
merged 23 commits into from
Sep 18, 2023
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
0cf3250
First version of the wikipedia dataset + creating the track.
afoucret Jul 3, 2023
110ee76
Update wikipedia doc generation script
afoucret Jul 4, 2023
94209ab
Update dataset size.
afoucret Jul 6, 2023
da9fd9c
Add readme to the track.
afoucret Jul 6, 2023
4759ab3
Rename parse documents script.
afoucret Jul 10, 2023
aa3324c
Refactoring challenges and operations.
afoucret Jul 11, 2023
3697a2e
Adding support for different index mappings.
afoucret Jul 11, 2023
5b612e7
Calculate the probability distribution of wikipedia title clickstream…
saikatsarkar056 Jul 11, 2023
afd9e5f
[Click Dataset] Fix a failure when the download dir already exists.
afoucret Jul 13, 2023
a2d2d97
Add queries.json to the repo.
afoucret Jul 13, 2023
bb3a67b
Renaming queries.json into queries.csv
afoucret Jul 13, 2023
3c7d643
Rename queries.csv column
afoucret Jul 13, 2023
a86694b
Started the implementation of search application benchmarking.
afoucret Jul 17, 2023
e0525c8
Implemented the search application search endpoint benchmarking.
afoucret Jul 18, 2023
66d6aea
Change the track.py
saikatsarkar056 Jul 18, 2023
15c2aaa
Change the track.py
saikatsarkar056 Jul 18, 2023
37a276b
Change the track.py
saikatsarkar056 Jul 18, 2023
e3550e4
Refactor BM25 and Search Application search.
afoucret Jul 19, 2023
e25e9f1
Mention number_of_shards param in the settings
saikatsarkar056 Aug 7, 2023
86771b0
Update wikipedia/README.md
saikatsarkar056 Aug 14, 2023
861afe4
Apply suggestions from code review
afoucret Sep 5, 2023
380516f
Improved track:
afoucret Sep 5, 2023
58a44c9
Use a param for the random seed.
afoucret Sep 18, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 56 additions & 0 deletions wikipedia/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
## Wikipedia Search track

This track benchmarks

The dataset is derived from a dump of wikipedia availaible here:
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2.

Each page is formatted into a JSON document with the following fields:

title: Page title
namespace: Optional namespace for the page. [Namespaces](https://en.wikipedia.org/wiki/Wikipedia:Namespace) allow for the organization and separation of content pages from administration pages.
content: Page content.
redirect: If the page is a redirect, the target of the redirection. In this case content is empty.

Fields that do not have values have been left out.

### Generating the documents dataset

To regenerate the dataset from scratch, first download and unzip an archive
of Wikipedia dumps from [this link](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2) (~21GB).

Then run this command:

```bash
python _tools/parse_documents.py <path_to_xml_file> | bzip2 --best > documents.json.bz2
afoucret marked this conversation as resolved.
Show resolved Hide resolved
```

### Generating clickstream probability ditribution

To generate the probability distribution of the most frequent queries in a specific month from the Wikimedia clickstream, please execute the following command

```bash
python3 _tools/parse_clicks.py --year 2023 --month 6 --lang en > queries.csv
```

### Example Document

```json
{
"title": "Anarchism",
"content": "{{short description|Political philosophy and movement}}\n{{other uses}}\n{{redirect2|Anarchist|Anarchists|other uses|Anarchist (disambiguation)}}\n{{distinguish|Anarchy}}\n{{good article}}\n{{pp-semi-indef}}\n{{use British English|date=August 2021}}\n{{use dmy dates|date=August 2021}}\n{{Use shortened footnotes|date=May 2023}}\n{{anarchism sidebar}}\n{{basic forms of government}}\n\n'''Anarchism''' is a [[political philosophy]] and [[Political movement|movement]] that is skeptical of all justifications for [[authority]] and seeks to abolish the [[institutions]] it claims maintain unnecessary [[coercion]] and [[Social hierarchy|hierarchy]], typically including [[government]]s,<ref name=\":0\">{{Cite book |title=The Desk Encyclopedia of World History |publisher=[[Oxford University Press]] |year=2006 |isbn=978-0-7394-7809-7 |editor-last=Wright |editor-first=Edmund |location=New York |pages=20\u201321}}</ref> [[State (polity)|nation states]],{{sfn|Suissa|2019b|ps=: \"...as many anarchists have stressed, it is not government as such that they find objectionable, but the hierarchical forms of government associated with the nation state.\"}} [[law]] and [[law enforcement]],<ref name=\":0\" /> and [[capitalism]]. Anarchism advocates for the replacement of the state with [[Stateless society|stateless societies]] or other forms of [[Free association (communism and anarchism)|free associations]]. As a historically [[left-wing]] movement, this reading of anarchism is placed on the [[Far-left politics|farthest left]] of the [[political spectrum]], usually described as the [[libertarian]] wing of the [[socialist movement]] ([ ..."
}
```

### Parameters

This track accepts the following parameters with Rally 0.8.0+ using `--track-params`:

- `bulk_size` (default: 500)
- `bulk_indexing_clients` (default: 1)
- `ingest_percentage` (default: 100)
saikatsarkar056 marked this conversation as resolved.
Show resolved Hide resolved

### License

We use the same license for the data as the original data: [CC-SA-3.0](http://creativecommons.org/licenses/by-sa/3.0/).
More details can be found on [this page](https://en.wikipedia.org/wiki/Wikipedia:Copyrights).
101 changes: 101 additions & 0 deletions wikipedia/_tools/parse_clicks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import argparse
import csv
import gzip
import logging
import os
import pickle
import sys
from collections import Counter

import requests

# Set up the logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Define a handler to output log messages to the console
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.INFO)

# Define a formatter for the log messages
formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")
console_handler.setFormatter(formatter)

# Add the console handler to the logger
logger.addHandler(console_handler)


class ClickStreamDist:
def __init__(self, year, month, lang):
self.filename = f"clickstream-{lang}wiki-{year}-{month:02d}.tsv.gz"
self.url = f"https://dumps.wikimedia.org/other/clickstream/{year:d}-{month:02d}/{self.filename}"
self.clickstream_output_file = os.path.expanduser(f"~/.rally/benchmarks/data/wikipedia/{self.filename}")

def download(self):
if os.path.exists(self.clickstream_output_file):
logger.info("File already exists. Skipping download.")
return

logger.info("Downloading the clickstream file...")
response = requests.get(self.url)

if response.status_code == 200:
# Create the file if it is missing
os.makedirs(os.path.dirname(self.clickstream_output_file), exist_ok=True)

# Write the content to the file
with open(self.clickstream_output_file, "wb") as file:
file.write(response.content)
logger.info("File downloaded successfully.")
else:
logger.info("Failed to download the file.")

def analyze(self):
logger.info("Analyzing...")

word_freq = self.calculate_word_frequency()
word_prob = self.calculate_word_probability(word_freq)

self.dump_probability_distribution(word_prob)

logger.info("Analysis completed.")

def calculate_word_frequency(self):
logger.info("Calculating word frequency...")

word_freq = Counter()
with gzip.open(self.clickstream_output_file, "rt", encoding="utf-8") as file:
# Documentation for clickstream format: https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream
for row in csv.reader(file, delimiter="\t"):
prev, curr, count = row[0], row[1].replace("_", " ").strip().replace('"', ""), int(row[3])
if prev != "other-search" and curr != "Main Page":
word_freq[curr] += count

sorted_word_freq = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)

return sorted_word_freq[:10000]

def calculate_word_probability(self, word_freq):
logger.info("Calculating word probability...")

total_words = sum(count for _, count in word_freq)

return [(word, count / total_words) for word, count in word_freq]

def dump_probability_distribution(self, prob_dist):
logger.info("Dumping probability distribution...")

writer = csv.writer(sys.stdout)
writer.writerow(["query", "probability"])
writer.writerows(prob_dist)


parser = argparse.ArgumentParser()
parser.add_argument("--year", type=int, default=2023, help="Year")
parser.add_argument("--month", type=int, default=6, help="Month")
parser.add_argument("--lang", type=str, default="en", help="Language")
args = parser.parse_args()

click_stream = ClickStreamDist(year=args.year, month=args.month, lang=args.lang)
click_stream.download()
click_stream.analyze()
54 changes: 54 additions & 0 deletions wikipedia/_tools/parse_documents.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
import bz2
import json
import sys
from xml.etree import cElementTree

PAGE_TAG = "page"
SITEINFO_TAG = "siteinfo"
XML_NAMESPACES = {"": "http://www.mediawiki.org/xml/export-0.10/"}


def doc_generator(f):
namespaces = dict()
for _, element in cElementTree.iterparse(f):
_, tag = element.tag.split("}")
if tag == PAGE_TAG:
yield parse_page(element, namespaces)
element.clear()
if tag == SITEINFO_TAG:
namespaces = parse_namespaces(element)


def to_json(f):
with bz2.BZ2File(f, "r") as fp:
for doc in doc_generator(fp):
print(json.dumps(doc))


def parse_namespaces(element) -> dict:
namespaces = dict()
for namespace_element in element.findall("namespaces/namespace", XML_NAMESPACES):
namespaces[namespace_element.get("key")] = namespace_element.text
return namespaces


def parse_page(element, namespaces):
page_data = {
"title": element.find("title", XML_NAMESPACES).text,
}

redirect = element.find("redirect", XML_NAMESPACES)
if redirect is not None:
page_data["redirect"] = redirect.get("title")
else:
page_data["content"] = element.find("revision/text", XML_NAMESPACES).text

namespace = namespaces[element.find("ns", XML_NAMESPACES).text]
if namespace is not None:
page_data["namespace"] = namespace

return page_data


for file_name in sys.argv[1:]:
to_json(file_name)
44 changes: 44 additions & 0 deletions wikipedia/challenges/default.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"name": "index-and-search",
"description": "Indexes wikipedia data, then executes searches.",
"default": true,
"schedule": [
{
"name": "delete-index",
"operation": "delete-index"
},
{
"name": "create-index",
"operation": "create-index"
},
{
"name": "check-cluster-health",
"operation": "check-cluster-health"
},
{
"name": "index-documents",
"operation": "index-documents",
"warmup-time-period": {{ bulk_warmup | default(40) | int }},
"clients": {{bulk_indexing_clients | default(5)}}
},
{
"name": "refresh-after-index",
"operation": "refresh-after-index"
},
{
"name": "query-string-search",
"operation": "query-string-search",
"clients": {{search_clients | default(10)}},
"warmup-iterations": 500
},
{
"name": "create-default-search-application",
"operation": "create-default-search-application"
},
{
"name": "default-search-application-search",
"operation": "default-search-application-search",
"clients": {{search_clients | default(10)}}
}
]
}
2 changes: 2 additions & 0 deletions wikipedia/files.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
documents.json
documents-1k.json
46 changes: 46 additions & 0 deletions wikipedia/operations/default.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"name": "delete-index",
"operation-type": "delete-index"
},
{
"name": "create-index",
"operation-type": "create-index"
},
{
"name": "check-cluster-health",
"operation-type": "cluster-health",
"request-params": {
"wait_for_status": "green"
},
"retry-until-success": true
},
{
"name": "index-documents",
"operation-type": "bulk",
"bulk-size": {{bulk_size | default(500)}},
"ingest-percentage": {{ingest_percentage | default(100)}}
},
{
"name": "refresh-after-index",
"operation-type": "refresh",
"request-timeout": 1000,
"include-in-reporting": true
},
{
"name": "create-default-search-application",
"operation-type": "raw-request",
"param-source": "create-search-application-param-source"
},
{
"name": "default-search-application-search",
"operation-type": "raw-request",
"param-source": "search-application-search-param-source",
"iterations": 10000
},
{
"name": "query-string-search",
"operation-type": "search",
"param-source": "query-string-search",
"size" : {{search_size | default(20)}},
"search-fields" : "{{search_fields | default("*")}}"
}
Loading