Arachne 1.0
Arachne - the perpetual stitcher of Wikidata entities.
Loading...
Searching...
No Matches
arachne.hpp
Go to the documentation of this file.
1/*
2 * The MIT License (MIT)
3 *
4 * Copyright (c) 2025 Yaroslav Riabtsev <yaroslav.riabtsev@rwth-aachen.de>
5 *
6 * Permission is hereby granted, free of charge, to any person obtaining a copy
7 * of this software and associated documentation files (the "Software"), to deal
8 * in the Software without restriction, including without limitation the rights
9 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 * copies of the Software, and to permit persons to whom the Software is
11 * furnished to do so, subject to the following conditions:
12 *
13 * The above copyright notice and this permission notice shall be included
14 * in all copies or substantial portions of the Software.
15 *
16 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE
19 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 * SOFTWARE.
23 */
24
25#ifndef ARACHNE_ARACHNE_HPP
26#define ARACHNE_ARACHNE_HPP
27#include "pheidippides.hpp"
28
29namespace arachnespace {
30using std::chrono_literals::operator""h;
31
32/** @brief Number of batchable kinds (Q, P, L, M, E, form, sense). */
33inline constexpr std::size_t batched_kind_count = 7;
34
35/**
36 * @class arachne
37 * @brief Accumulates entity IDs into per-kind batches and organizes groups.
38 *
39 * Invariants:
40 * - Queues store normalized ID strings per kind ("Q123", "P45", "L7", "M9",
41 * "E2", "L7-F1", "L7-S2").
42 * - For numeric add/touch with kind = form or sense, normalization produces
43 * "L<id>" (warning), because numeric IDs for forms/senses are not
44 * representable; string APIs keep the exact ID.
45 * - Deduplication is by string identity in the respective containers.
46 */
47class arachne {
48public:
49 /// @name Public API
50 /// @{
51
52 /**
53 * @brief Create or select a group and make it current.
54 *
55 * If @p name is empty, creates a new anonymous group with a random name and
56 * makes it current. If @p name exists, it becomes current but is NOT
57 * cleared. If it doesn't exist, the group is created and then selected.
58 *
59 * @param name Group name or empty for an anonymous group.
60 * @return true if a new group was created; false if the group already
61 * existed.
62 *
63 * @note The current group's name is intentionally not exposed; anonymous
64 * groups cannot be addressed explicitly.
65 */
66 bool new_group(std::string name = "");
67
68 /**
69 * @brief Enqueue numeric IDs with a given kind and add them to a group.
70 *
71 * Numeric IDs are normalized by adding the kind prefix.
72 * - If @p kind is form or sense, normalization maps to the lexeme prefix
73 * ("L<id>"); no warning is emitted yet (logging TODO).
74 * - Freshness checks are stubbed; the helper `enqueue` always asks for a
75 * fetch, and the underlying sets deduplicate repeated IDs automatically.
76 *
77 * @param ids Span of numeric IDs.
78 * @param kind Entity kind (must NOT be any/unknown).
79 * @param name Group name; empty targets the current/anonymous group
80 * (auto-created if needed).
81 * @return The resulting size of the target group after insertions.
82 * @throws std::invalid_argument if @p kind is any/unknown.
83 */
85 std::span<const int> ids, corespace::entity_kind kind,
86 std::string name = ""
87 );
88
89 /**
90 * @brief Batch variant of touch for numeric IDs.
91 *
92 * Each numeric ID is normalized using @p kind.
93 * If @p kind is form/sense, a warning is recorded and normalization yields
94 * "L<id>" (lexeme).
95 *
96 * @param ids Span of numeric IDs.
97 * @param kind Normalization kind (must not be any/unknown).
98 * @return The number of entities for which `touch_entity()` returned true.
99 * @throws std::invalid_argument if @p kind is any/unknown.
100 */
101 int touch_ids(std::span<const int> ids, corespace::entity_kind kind);
102 /**
103 * @brief Extract the lexeme root from a full ID string.
104 *
105 * For IDs beginning with "L" followed by digits, returns "L<digits>". For
106 * other prefixes or malformed strings, returns an empty string.
107 *
108 * @param id Identifier to inspect (e.g., "L7-F1").
109 * @return Lexeme root ("L7") or empty on failure.
110 */
111 static std::string entity_root(const std::string& id);
112
113 /**
114 * @brief Flush (send) up to `batch_threshold` entities of a specific kind.
115 *
116 * For `kind != any`, attempts a single-batch flush for that kind (up to the
117 * threshold). For `kind == any`, a round-robin strategy over batchable
118 * kinds is used.
119 *
120 * @param kind Entity kind selector or entity_kind::any.
121 * @return true if at least one entity was flushed; false otherwise.
122 */
124
125 /**
126 * @brief Get the number of queued (pending) entities tracked in the main
127 * batch containers.
128 *
129 * @param kind Specific kind, or entity_kind::any to return the sum across
130 * all batchable kinds.
131 * @return Count of queued entities for the requested kind. Values are
132 * narrowed to `int` via `static_cast` and therefore inherit the
133 * implementation-defined behavior of signed narrowing conversions
134 * when the count exceeds `INT_MAX`.
135 */
136 int queue_size(corespace::entity_kind kind) const noexcept;
137
138 /**
139 * @brief Determine the kind of a full ID string.
140 *
141 * Accepts prefixed IDs (e.g., "Q123", "L77-F2"). Returns `unknown` if the
142 * string is not a valid ID. The function does not throw.
143 *
144 * @param entity Full ID with prefix.
145 * @return Detected kind (may be `unknown`).
146 */
147 static corespace::entity_kind identify(const std::string& entity) noexcept;
148 /**
149 * @brief Parse a full ID string and extract the numeric portion.
150 *
151 * @param entity Full ID (e.g., "Q123", "L7-F1", "L7-S2").
152 * @param pos In/out index of the first digit within @p entity. On
153 * success the index is advanced past the number.
154 * @param id Out parameter for the parsed integer portion.
155 * @return true on successful parse; false otherwise. Never throws.
156 *
157 * @internal Helper used by ID validation and normalization routines.
158 */
159 static bool parse_id(const std::string& entity, size_t& pos, int& id);
160 /**
161 * @brief Normalize a numeric ID with the given kind to a prefixed string.
162 *
163 * Examples:
164 * - (123, item) -> "Q123"
165 * - (45, property) -> "P45"
166 * - (7, lexeme) -> "L7"
167 * - (9, mediainfo) -> "M9"
168 * - (2, entity_schema) -> "E2"
169 * - (7, form) -> "L7" (mapped to lexeme)
170 * - (7, sense) -> "L7" (mapped to lexeme)
171 *
172 * @param id Numeric identifier.
173 * @param kind Kind to prefix with (must not be any/unknown).
174 * @return Prefixed ID string.
175 * @throws std::invalid_argument if @p id is negative or @p kind is
176 * any/unknown.
177 *
178 * @note Form and sense identifiers are currently coerced to the lexeme
179 * prefix without emitting diagnostics; logging is a planned
180 * enhancement.
181 */
182 static std::string normalize(int id, corespace::entity_kind kind);
183 /// @}
184
185private:
186 /**
187 * @brief Select an existing group or create it on demand.
188 *
189 * An empty @p name selects/creates the anonymous group. A non-empty name is
190 * delegated to `new_group`, which creates the group if necessary.
191 *
192 * @param name Group name to activate; empty targets the anonymous group.
193 */
194 void select_group(std::string name);
195
196 /**
197 * @brief Placeholder for interactive staleness confirmation.
198 *
199 * The current implementation is non-interactive and always returns false.
200 * A future version is expected to prompt the user when cached data is
201 * stale and return the user's decision.
202 *
203 * @param id Entity identifier under consideration.
204 * @param kind Detected kind of the entity.
205 * @param age Age of the cached entry.
206 * @return Currently always false; future behavior should reflect user
207 * confirmation.
208 */
209 static bool ask_update(
210 std::string_view id, corespace::entity_kind kind,
211 std::chrono::milliseconds age
212 );
213
214 /**
215 * @brief Decide whether an entity should be enqueued for fetching.
216 *
217 * This placeholder implementation always returns true, effectively
218 * requesting a fetch for every entity. The expected behavior is to consult
219 * storage state (`exist`, `last`) and return true only when an update is
220 * required.
221 *
222 * @param id Canonical identifier (e.g., "Q123" or "L7").
223 * @param kind Entity kind (lexeme for forms/senses).
224 * @return true if the caller should enqueue the entity; placeholder always
225 * true.
226 */
227 bool enqueue(
228 std::string_view id, corespace::entity_kind kind, bool interactive
229 ) const;
230
231 /**
232 * @brief Increment the touch counter for a single full ID (prefix
233 * REQUIRED).
234 *
235 * If the entity is already queued or already has data, returns false (no
236 * increment). If the counter reaches `candidates_threshold` and the entity
237 * is not queued, it is moved into the queue. For "L…-F…"/"L…-S…", the exact
238 * ID is enqueued (no mapping).
239 *
240 * @param id_with_prefix Full ID with prefix.
241 * @return true if the counter was incremented; false otherwise.
242 */
243 bool touch_entity(const std::string& id_with_prefix) noexcept;
244
245 /**
246 * @brief Enqueue a full (prefixed) ID string and add it to a group.
247 *
248 * The ID must include its prefix (e.g., "Q123", "L77-F2").
249 * Validation is performed via `identify()`. Invalid IDs cause an exception.
250 * For "L...-F..."/"L...-S...", the group receives the verbatim string while
251 * the batch queue stores the lexeme root ("L...") so fetches target the
252 * parent lexeme.
253 *
254 * @param id_with_prefix Full ID with prefix.
255 * @param force If true, bypass freshness/existence checks and enqueue
256 * anyway.
257 * @param name Group name; empty targets the current/anonymous group
258 * (auto-created if needed).
259 * @return The resulting size of the target group after insertion.
260 * @throws std::invalid_argument if the ID is invalid or has an unknown
261 * prefix.
262 */
264 const std::string& id_with_prefix, bool force = false,
265 std::string name = ""
266 );
267 // Queues (batches) per batchable kind; elements are expected to be
268 // normalized IDs such as "Q123", "P45", "L7", "M9", or "E2". Forms and
269 // senses contribute their lexeme root ("L<id>").
274
275 // Groups: group name -> set of entity IDs as provided by callers (verbatim;
276 // includes "L...-F..." and "L...-S...").
278
279 // Touch candidates: full ID string -> touch count.
281
282 // Thresholds (kept constant for now; make configurable later if needed).
284 = 50; ///< Typical unauthenticated entity-per-request cap.
286 = 50; ///< Intentional high bar for curiosity-driven candidates.
287
288 // Current group name (private by design; anonymous groups cannot be
289 // addressed explicitly).
290 std::string current_group;
291 std::chrono::milliseconds staleness_threshold = 24h;
294};
295}
296#endif // ARACHNE_ARACHNE_HPP
Accumulates entity IDs into per-kind batches and organizes groups.
Definition arachne.hpp:47
std::unordered_map< std::string, int > candidates
Definition arachne.hpp:280
std::array< std::unordered_set< std::string >, batched_kind_count > extra_batches
Definition arachne.hpp:273
bool touch_entity(const std::string &id_with_prefix) noexcept
Increment the touch counter for a single full ID (prefix REQUIRED).
Definition arachne.cpp:224
static std::string entity_root(const std::string &id)
Extract the lexeme root from a full ID string.
Definition arachne.cpp:74
std::string current_group
Definition arachne.hpp:290
int touch_ids(std::span< const int > ids, corespace::entity_kind kind)
Batch variant of touch for numeric IDs.
Definition arachne.cpp:59
static bool parse_id(const std::string &entity, size_t &pos, int &id)
Parse a full ID string and extract the numeric portion.
Definition arachne.cpp:149
bool new_group(std::string name="")
Create or select a group and make it current.
Definition arachne.cpp:31
size_t add_entity(const std::string &id_with_prefix, bool force=false, std::string name="")
Enqueue a full (prefixed) ID string and add it to a group.
Definition arachne.cpp:235
std::unordered_map< std::string, std::unordered_set< std::string > > groups
Definition arachne.hpp:277
std::chrono::milliseconds staleness_threshold
Definition arachne.hpp:291
bool enqueue(std::string_view id, corespace::entity_kind kind, bool interactive) const
Decide whether an entity should be enqueued for fetching.
Definition arachne.cpp:201
const size_t batch_threshold
Typical unauthenticated entity-per-request cap.
Definition arachne.hpp:284
pheidippides phe_client
Definition arachne.hpp:293
const int candidates_threshold
Intentional high bar for curiosity-driven candidates.
Definition arachne.hpp:286
static std::string normalize(int id, corespace::entity_kind kind)
Normalize a numeric ID with the given kind to a prefixed string.
Definition arachne.cpp:165
static bool ask_update(std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
Placeholder for interactive staleness confirmation.
Definition arachne.cpp:194
void select_group(std::string name)
Select an existing group or create it on demand.
Definition arachne.cpp:184
std::array< std::unordered_set< std::string >, batched_kind_count > main_batches
Definition arachne.hpp:271
int queue_size(corespace::entity_kind kind) const noexcept
Get the number of queued (pending) entities tracked in the main batch containers.
Definition arachne.cpp:107
corespace::interface ui
Definition arachne.hpp:292
static corespace::entity_kind identify(const std::string &entity) noexcept
Determine the kind of a full ID string.
Definition arachne.cpp:122
bool flush(corespace::entity_kind kind=corespace::entity_kind::any)
Flush (send) up to batch_threshold entities of a specific kind.
Definition arachne.cpp:99
size_t add_ids(std::span< const int > ids, corespace::entity_kind kind, std::string name="")
Enqueue numeric IDs with a given kind and add them to a group.
Definition arachne.cpp:42
Batch courier for Wikidata/Commons: collects IDs, issues HTTP requests, and returns a merged JSON pay...
static constexpr std::string prefixes
Definition arachne.cpp:29
constexpr std::size_t batched_kind_count
Number of batchable kinds (Q, P, L, M, E, form, sense).
Definition arachne.hpp:33
entity_kind
Wikidata entity kind.
Definition utils.hpp:46
@ any
API selector (e.g., flush(any)); not directly batchable.
Definition utils.hpp:54
@ lexeme
IDs prefixed with 'L'.
Definition utils.hpp:49
@ form
Lexeme form IDs such as "L<lexeme>-F<form>".
Definition utils.hpp:52
@ unknown
Unrecognized/invalid identifier.
Definition utils.hpp:55
@ sense
Lexeme sense IDs such as "L<lexeme>-S<sense>".
Definition utils.hpp:53
std::string random_hex(const std::size_t n)
Return exactly n random hexadecimal characters (lowercase).
Definition rng.cpp:33