Arachne 1.0
Arachne - the perpetual stitcher of Wikidata entities.
Loading...
Searching...
No Matches
pheidippides.hpp
Go to the documentation of this file.
1/*
2 * The MIT License (MIT)
3 *
4 * Copyright (c) 2025 Yaroslav Riabtsev <yaroslav.riabtsev@rwth-aachen.de>
5 *
6 * Permission is hereby granted, free of charge, to any person obtaining a copy
7 * of this software and associated documentation files (the "Software"), to deal
8 * in the Software without restriction, including without limitation the rights
9 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 * copies of the Software, and to permit persons to whom the Software is
11 * furnished to do so, subject to the following conditions:
12 *
13 * The above copyright notice and this permission notice shall be included
14 * in all copies or substantial portions of the Software.
15 *
16 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE
19 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 * SOFTWARE.
23 */
24
25#ifndef ARACHNE_PHEIDIPPIDES_HPP
26#define ARACHNE_PHEIDIPPIDES_HPP
27#include "http_client.hpp"
28
29#include <nlohmann/json.hpp>
30#include <unordered_set>
31
32namespace arachnespace {
33/**
34 * @class pheidippides
35 * @brief Batch courier for Wikidata/Commons: collects IDs, issues HTTP
36 * requests, and returns a merged JSON payload.
37 *
38 * Responsibilities:
39 * - Pick the endpoint based on entity kind:
40 * * Q/P/L/E -> https://www.wikidata.org/w/api.php
41 * * M (mediainfo) -> https://commons.wikimedia.org/w/api.php
42 * - Build request parameters:
43 * * for E (EntitySchema): `action=query`, `titles=EntitySchema:<id>`,
44 * `prop=<joined opt.prop>`
45 * * for others: `action=wbgetentities`, `ids=<id>|<id>...`,
46 * `props=<joined opt.props>`
47 * - Split the input set into chunks up to `batch_threshold`.
48 * - Filter IDs by expected kind using `arachne::identify(id)`.
49 * - Merge per-chunk JSON responses using `merge_patch`.
50 *
51 * Thread-safety:
52 * - Not thread-safe; the instance owns a reusable `http_client` (single easy
53 * handle). Use one instance per calling thread.
54 *
55 * @note The implementation currently issues requests even when a chunk becomes
56 * empty after filtering (for example when `kind == entity_kind::any`).
57 * The server response for an empty identifier list is merged as-is.
58 */
60public:
61 /**
62 * @brief Fetch metadata for a set of entity IDs and return a merged JSON
63 * object.
64 *
65 * Behavior:
66 * - Empty `batch` results in an empty JSON object.
67 * - For `kind == entity_kind::entity_schema`, IDs are prefixed with
68 * `EntitySchema:` and fields come from `opt.prop`.
69 * - For other kinds, fields come from `opt.props`.
70 * - Only elements where `arachne::identify(id) == kind` are included in a
71 * request chunk; if the filter removes every element the request still
72 * executes with an empty identifier list and the response is merged.
73 * - Chunk responses are merged into a single object via `merge_patch`.
74 *
75 * Errors:
76 * - Transport or HTTP errors are handled by the internal `http_client`
77 * retry policy; terminal failures throw `std::runtime_error`.
78 * - Invalid JSON payloads propagate `nlohmann::json::parse_error` from
79 * `nlohmann::json::parse`.
80 *
81 * @param batch Set of full IDs (e.g., "Q123", "L7-F1", "E42").
82 * @param kind Target entity kind (selects API, fields, and filtering).
83 * @return Merged JSON object with fetched data.
84 */
86 const std::unordered_set<std::string>& batch,
88 );
89
90 /**
91 * @brief Access aggregated network metrics of the underlying client.
92 * @return Const reference to metrics snapshot.
93 */
94 [[nodiscard]] const corespace::network_metrics& metrics_info() const;
95 /**
96 * @brief Join a span of strings with a separator (no encoding or
97 * validation).
98 *
99 * Edge cases:
100 * - Empty input yields an empty string.
101 * - Separator defaults to "|" (useful for MediaWiki multi-ID parameters).
102 *
103 * @param ids Input strings to join.
104 * @param separator Separator between elements (default: "|").
105 * @return Concatenated string.
106 */
107 static std::string join_str(
108 std::span<const std::string> ids, std::string_view separator = "|"
109 );
110
111private:
113 opt {}; ///< Request shaping parameters (chunking, fields, base params).
115 client {}; ///< Reused HTTP client (not thread-safe across threads).
116};
117}
118#endif // ARACHNE_PHEIDIPPIDES_HPP
Accumulates entity IDs into per-kind batches and organizes groups.
Definition arachne.hpp:47
std::unordered_map< std::string, int > candidates
Definition arachne.hpp:280
std::array< std::unordered_set< std::string >, batched_kind_count > extra_batches
Definition arachne.hpp:273
bool touch_entity(const std::string &id_with_prefix) noexcept
Increment the touch counter for a single full ID (prefix REQUIRED).
Definition arachne.cpp:224
static std::string entity_root(const std::string &id)
Extract the lexeme root from a full ID string.
Definition arachne.cpp:74
std::string current_group
Definition arachne.hpp:290
int touch_ids(std::span< const int > ids, corespace::entity_kind kind)
Batch variant of touch for numeric IDs.
Definition arachne.cpp:59
static bool parse_id(const std::string &entity, size_t &pos, int &id)
Parse a full ID string and extract the numeric portion.
Definition arachne.cpp:149
bool new_group(std::string name="")
Create or select a group and make it current.
Definition arachne.cpp:31
size_t add_entity(const std::string &id_with_prefix, bool force=false, std::string name="")
Enqueue a full (prefixed) ID string and add it to a group.
Definition arachne.cpp:235
std::unordered_map< std::string, std::unordered_set< std::string > > groups
Definition arachne.hpp:277
std::chrono::milliseconds staleness_threshold
Definition arachne.hpp:291
bool enqueue(std::string_view id, corespace::entity_kind kind, bool interactive) const
Decide whether an entity should be enqueued for fetching.
Definition arachne.cpp:201
const size_t batch_threshold
Typical unauthenticated entity-per-request cap.
Definition arachne.hpp:284
pheidippides phe_client
Definition arachne.hpp:293
const int candidates_threshold
Intentional high bar for curiosity-driven candidates.
Definition arachne.hpp:286
static std::string normalize(int id, corespace::entity_kind kind)
Normalize a numeric ID with the given kind to a prefixed string.
Definition arachne.cpp:165
static bool ask_update(std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
Placeholder for interactive staleness confirmation.
Definition arachne.cpp:194
void select_group(std::string name)
Select an existing group or create it on demand.
Definition arachne.cpp:184
std::array< std::unordered_set< std::string >, batched_kind_count > main_batches
Definition arachne.hpp:271
int queue_size(corespace::entity_kind kind) const noexcept
Get the number of queued (pending) entities tracked in the main batch containers.
Definition arachne.cpp:107
corespace::interface ui
Definition arachne.hpp:292
static corespace::entity_kind identify(const std::string &entity) noexcept
Determine the kind of a full ID string.
Definition arachne.cpp:122
bool flush(corespace::entity_kind kind=corespace::entity_kind::any)
Flush (send) up to batch_threshold entities of a specific kind.
Definition arachne.cpp:99
size_t add_ids(std::span< const int > ids, corespace::entity_kind kind, std::string name="")
Enqueue numeric IDs with a given kind and add them to a group.
Definition arachne.cpp:42
Batch courier for Wikidata/Commons: collects IDs, issues HTTP requests, and returns a merged JSON pay...
corespace::http_client client
Reused HTTP client (not thread-safe across threads).
static std::string join_str(std::span< const std::string > ids, std::string_view separator="|")
Join a span of strings with a separator (no encoding or validation).
const corespace::network_metrics & metrics_info() const
Access aggregated network metrics of the underlying client.
nlohmann::json fetch_json(const std::unordered_set< std::string > &batch, corespace::entity_kind kind=corespace::entity_kind::any)
Fetch metadata for a set of entity IDs and return a merged JSON object.
corespace::options opt
Request shaping parameters (chunking, fields, base params).
std::unique_ptr< CURLU, decltype(&curl_url_cleanup)> curl_url_ptr
Unique pointer type for CURLU with proper deleter.
static constexpr std::string prefixes
Definition arachne.cpp:29
constexpr std::size_t batched_kind_count
Number of batchable kinds (Q, P, L, M, E, form, sense).
Definition arachne.hpp:33
entity_kind
Wikidata entity kind.
Definition utils.hpp:46
@ any
API selector (e.g., flush(any)); not directly batchable.
Definition utils.hpp:54
@ lexeme
IDs prefixed with 'L'.
Definition utils.hpp:49
@ form
Lexeme form IDs such as "L<lexeme>-F<form>".
Definition utils.hpp:52
@ unknown
Unrecognized/invalid identifier.
Definition utils.hpp:55
@ sense
Lexeme sense IDs such as "L<lexeme>-S<sense>".
Definition utils.hpp:53
std::string random_hex(const std::size_t n)
Return exactly n random hexadecimal characters (lowercase).
Definition rng.cpp:33
Configuration for fetching entities via MediaWiki/Wikibase API.
Definition utils.hpp:75