Arachne 1.0
Arachne - the perpetual stitcher of Wikidata entities.
Loading...
Searching...
No Matches
arachnespace::arachne Class Reference

Accumulates entity IDs into per-kind batches and organizes groups. More...

#include <include/arachne.hpp>

Collaboration diagram for arachnespace::arachne:

Private Member Functions

void select_group (std::string name)
 Select an existing group or create it on demand.
bool enqueue (std::string_view id, corespace::entity_kind kind, bool interactive) const
 Decide whether an entity should be enqueued for fetching.
bool touch_entity (const std::string &id_with_prefix) noexcept
 Increment the touch counter for a single full ID (prefix REQUIRED).
size_t add_entity (const std::string &id_with_prefix, bool force=false, std::string name="")
 Enqueue a full (prefixed) ID string and add it to a group.

Static Private Member Functions

static bool ask_update (std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
 Placeholder for interactive staleness confirmation.

Private Attributes

std::array< std::unordered_set< std::string >, batched_kind_countmain_batches
std::array< std::unordered_set< std::string >, batched_kind_countextra_batches
std::unordered_map< std::string, std::unordered_set< std::string > > groups
std::unordered_map< std::string, int > candidates
const size_t batch_threshold = 50
 Typical unauthenticated entity-per-request cap.
const int candidates_threshold = 50
 Intentional high bar for curiosity-driven candidates.
std::string current_group
std::chrono::milliseconds staleness_threshold = 24h
corespace::interface ui = corespace::interface::command_line
pheidippides phe_client

Public API

bool new_group (std::string name="")
 Create or select a group and make it current.
size_t add_ids (std::span< const int > ids, corespace::entity_kind kind, std::string name="")
 Enqueue numeric IDs with a given kind and add them to a group.
int touch_ids (std::span< const int > ids, corespace::entity_kind kind)
 Batch variant of touch for numeric IDs.
bool flush (corespace::entity_kind kind=corespace::entity_kind::any)
 Flush (send) up to batch_threshold entities of a specific kind.
int queue_size (corespace::entity_kind kind) const noexcept
 Get the number of queued (pending) entities tracked in the main batch containers.
static std::string entity_root (const std::string &id)
 Extract the lexeme root from a full ID string.
static corespace::entity_kind identify (const std::string &entity) noexcept
 Determine the kind of a full ID string.
static bool parse_id (const std::string &entity, size_t &pos, int &id)
 Parse a full ID string and extract the numeric portion.
static std::string normalize (int id, corespace::entity_kind kind)
 Normalize a numeric ID with the given kind to a prefixed string.

Detailed Description

Accumulates entity IDs into per-kind batches and organizes groups.

Invariants:

  • Queues store normalized ID strings per kind ("Q123", "P45", "L7", "M9", "E2", "L7-F1", "L7-S2").
  • For numeric add/touch with kind = form or sense, normalization produces "L<id>" (warning), because numeric IDs for forms/senses are not representable; string APIs keep the exact ID.
  • Deduplication is by string identity in the respective containers.

Definition at line 47 of file arachne.hpp.

Member Function Documentation

◆ add_entity()

size_t arachnespace::arachne::add_entity ( const std::string & id_with_prefix,
bool force = false,
std::string name = "" )
private

Enqueue a full (prefixed) ID string and add it to a group.

The ID must include its prefix (e.g., "Q123", "L77-F2"). Validation is performed via identify(). Invalid IDs cause an exception. For "L...-F..."/"L...-S...", the group receives the verbatim string while the batch queue stores the lexeme root ("L...") so fetches target the parent lexeme.

Parameters
id_with_prefixFull ID with prefix.
forceIf true, bypass freshness/existence checks and enqueue anyway.
nameGroup name; empty targets the current/anonymous group (auto-created if needed).
Returns
The resulting size of the target group after insertion.
Exceptions
std::invalid_argumentif the ID is invalid or has an unknown prefix.

Definition at line 235 of file arachne.cpp.

237 {
238 const std::string canonical = entity_root(id_with_prefix);
239 select_group(std::move(name));
240 auto& group = groups[current_group];
241 group.insert(id_with_prefix);
242 if (corespace::entity_kind kind = identify(canonical); force
243 || enqueue(canonical, kind, ui == corespace::interface::command_line)) {
244 auto& pool = main_batches[static_cast<size_t>(kind)];
245 pool.insert(canonical);
246 if (pool.size() >= batch_threshold) {
247 flush(kind);
248 }
249 }
250 return group.size();
251}
static std::string entity_root(const std::string &id)
Extract the lexeme root from a full ID string.
Definition arachne.cpp:74
std::string current_group
Definition arachne.hpp:290
std::unordered_map< std::string, std::unordered_set< std::string > > groups
Definition arachne.hpp:277
bool enqueue(std::string_view id, corespace::entity_kind kind, bool interactive) const
Decide whether an entity should be enqueued for fetching.
Definition arachne.cpp:201
const size_t batch_threshold
Typical unauthenticated entity-per-request cap.
Definition arachne.hpp:284
void select_group(std::string name)
Select an existing group or create it on demand.
Definition arachne.cpp:184
std::array< std::unordered_set< std::string >, batched_kind_count > main_batches
Definition arachne.hpp:271
corespace::interface ui
Definition arachne.hpp:292
static corespace::entity_kind identify(const std::string &entity) noexcept
Determine the kind of a full ID string.
Definition arachne.cpp:122
bool flush(corespace::entity_kind kind=corespace::entity_kind::any)
Flush (send) up to batch_threshold entities of a specific kind.
Definition arachne.cpp:99
entity_kind
Wikidata entity kind.
Definition utils.hpp:47

References entity_root(), flush(), identify(), and select_group().

Here is the call graph for this function:

◆ add_ids()

size_t arachnespace::arachne::add_ids ( std::span< const int > ids,
corespace::entity_kind kind,
std::string name = "" )

Enqueue numeric IDs with a given kind and add them to a group.

Numeric IDs are normalized by adding the kind prefix.

  • If kind is form or sense, normalization maps to the lexeme prefix ("L<id>"); no warning is emitted yet (logging TODO).
  • Freshness checks are stubbed; the helper enqueue always asks for a fetch, and the underlying sets deduplicate repeated IDs automatically.
Parameters
idsSpan of numeric IDs.
kindEntity kind (must NOT be any/unknown).
nameGroup name; empty targets the current/anonymous group (auto-created if needed).
Returns
The resulting size of the target group after insertions.
Exceptions
std::invalid_argumentif kind is any/unknown.

Definition at line 42 of file arachne.cpp.

45 {
48 throw std::invalid_argument("unknown kind of numeric IDs");
49 }
50 select_group(std::move(name));
51 size_t last_size = groups[current_group].size();
52 for (const int id : ids) {
53 std::string id_with_prefix = normalize(id, kind);
54 last_size = add_entity(id_with_prefix, false, current_group);
55 }
56 return last_size;
57}
size_t add_entity(const std::string &id_with_prefix, bool force=false, std::string name="")
Enqueue a full (prefixed) ID string and add it to a group.
Definition arachne.cpp:235
static std::string normalize(int id, corespace::entity_kind kind)
Normalize a numeric ID with the given kind to a prefixed string.
Definition arachne.cpp:165
@ any
API selector (e.g., flush(any)); not directly batchable.
Definition utils.hpp:55
@ unknown
Unrecognized/invalid identifier.
Definition utils.hpp:56

References corespace::any, select_group(), and corespace::unknown.

Here is the call graph for this function:

◆ ask_update()

bool arachnespace::arachne::ask_update ( std::string_view id,
corespace::entity_kind kind,
std::chrono::milliseconds age )
staticprivate

Placeholder for interactive staleness confirmation.

The current implementation is non-interactive and always returns false. A future version is expected to prompt the user when cached data is stale and return the user's decision.

Parameters
idEntity identifier under consideration.
kindDetected kind of the entity.
ageAge of the cached entry.
Returns
Currently always false; future behavior should reflect user confirmation.

Definition at line 194 of file arachne.cpp.

196 {
197 // UI/UX: todo: ask user if update is needed
198 return false;
199}

◆ enqueue()

bool arachnespace::arachne::enqueue ( std::string_view id,
corespace::entity_kind kind,
bool interactive ) const
private

Decide whether an entity should be enqueued for fetching.

This placeholder implementation always returns true, effectively requesting a fetch for every entity. The expected behavior is to consult storage state (exist, last) and return true only when an update is required.

Parameters
idCanonical identifier (e.g., "Q123" or "L7").
kindEntity kind (lexeme for forms/senses).
Returns
true if the caller should enqueue the entity; placeholder always true.

Definition at line 201 of file arachne.cpp.

204 {
205 // ariadne.entity_status(id)
206 auto [exist, last] = std::pair<bool, long long>(false, -1);
207 if (!exist || last < 0) {
208 return true;
209 }
210 const auto now_ms = std::chrono::duration_cast<std::chrono::milliseconds>(
211 std::chrono::system_clock::now().time_since_epoch()
212 )
213 .count();
214 const auto age = std::chrono::milliseconds { now_ms - last };
215 if (age > staleness_threshold) {
216 return true;
217 }
218 if (interactive) {
219 return ask_update(id, kind, age);
220 }
221 return false;
222}
std::chrono::milliseconds staleness_threshold
Definition arachne.hpp:291
static bool ask_update(std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
Placeholder for interactive staleness confirmation.
Definition arachne.cpp:194

◆ entity_root()

std::string arachnespace::arachne::entity_root ( const std::string & id)
static

Extract the lexeme root from a full ID string.

For IDs beginning with "L" followed by digits, returns "L<digits>". For other prefixes or malformed strings, returns an empty string.

Parameters
idIdentifier to inspect (e.g., "L7-F1").
Returns
Lexeme root ("L7") or empty on failure.

Definition at line 74 of file arachne.cpp.

74 {
75 const corespace::entity_kind kind = identify(id);
78 throw std::invalid_argument("invalid or unknown entity kind");
79 }
80
83 if (id.size() < 2 || id.front() != 'L') {
84 throw std::invalid_argument(
85 "bad root-lexeme prefix of the entity: " + id
86 );
87 }
88 int val {};
89 if (size_t pos = 1; !parse_id(id, pos, val)) {
90 throw std::invalid_argument(
91 "bad numeric identifier of the entity: " + id
92 );
93 }
94 return "L" + std::to_string(val);
95 }
96 return id;
97}
static bool parse_id(const std::string &entity, size_t &pos, int &id)
Parse a full ID string and extract the numeric portion.
Definition arachne.cpp:149
@ form
Lexeme form IDs such as "L<lexeme>-F<form>".
Definition utils.hpp:53
@ sense
Lexeme sense IDs such as "L<lexeme>-S<sense>".
Definition utils.hpp:54

References corespace::any, corespace::form, identify(), corespace::sense, and corespace::unknown.

Referenced by add_entity(), and touch_entity().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ flush()

bool arachnespace::arachne::flush ( corespace::entity_kind kind = corespace::entity_kind::any)

Flush (send) up to batch_threshold entities of a specific kind.

For kind != any, attempts a single-batch flush for that kind (up to the threshold). For kind == any, a round-robin strategy over batchable kinds is used.

Parameters
kindEntity kind selector or entity_kind::any.
Returns
true if at least one entity was flushed; false otherwise.

Definition at line 99 of file arachne.cpp.

99 {
100 const auto& batch = main_batches[static_cast<size_t>(kind)];
101 const size_t size = batch.size();
102 auto data = phe_client.fetch_json(batch, kind);
103 // ariadne.store(data);
104 return size > batch.size();
105}
pheidippides phe_client
Definition arachne.hpp:293

Referenced by add_entity().

Here is the caller graph for this function:

◆ identify()

corespace::entity_kind arachnespace::arachne::identify ( const std::string & entity)
staticnoexcept

Determine the kind of a full ID string.

Accepts prefixed IDs (e.g., "Q123", "L77-F2"). Returns unknown if the string is not a valid ID. The function does not throw.

Parameters
entityFull ID with prefix.
Returns
Detected kind (may be unknown).

Definition at line 122 of file arachne.cpp.

122 {
123 if (entity.size() < 2) {
125 }
126 size_t pos = 0;
127 size_t kind = prefixes.find(entity[pos++]);
128 int id {};
129 if (kind == std::string::npos || !parse_id(entity, pos, id)) {
131 }
132 if (pos == entity.size()) {
133 return static_cast<corespace::entity_kind>(kind);
134 }
135 if (kind != static_cast<size_t>(corespace::entity_kind::lexeme)
136 || pos >= entity.size() || entity[pos++] != '-'
137 || pos >= entity.size()) {
139 }
140 const char tag = entity[pos++];
141 if (tag != 'F' && tag != 'S' || !parse_id(entity, pos, id)
142 || pos != entity.size()) {
144 }
145 return tag == 'F' ? corespace::entity_kind::form
147}
static constexpr std::string prefixes
Definition arachne.cpp:29
@ lexeme
IDs prefixed with 'L'.
Definition utils.hpp:50

References corespace::form, corespace::lexeme, arachnespace::prefixes, corespace::sense, and corespace::unknown.

Referenced by add_entity(), entity_root(), and touch_entity().

Here is the caller graph for this function:

◆ new_group()

bool arachnespace::arachne::new_group ( std::string name = "")

Create or select a group and make it current.

If name is empty, creates a new anonymous group with a random name and makes it current. If name exists, it becomes current but is NOT cleared. If it doesn't exist, the group is created and then selected.

Parameters
nameGroup name or empty for an anonymous group.
Returns
true if a new group was created; false if the group already existed.
Note
The current group's name is intentionally not exposed; anonymous groups cannot be addressed explicitly.

Definition at line 31 of file arachne.cpp.

31 {
32 if (name.empty()) {
33 do {
34 name = "g_" + corespace::random_hex(8);
35 } while (groups.contains(name));
36 }
37 auto [it, inserted] = groups.try_emplace(name);
38 current_group = it->first;
39 return inserted;
40}
std::string random_hex(const std::size_t n)
Return exactly n random hexadecimal characters (lowercase).
Definition rng.cpp:33

References current_group, and corespace::random_hex().

Referenced by select_group().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ normalize()

std::string arachnespace::arachne::normalize ( int id,
corespace::entity_kind kind )
static

Normalize a numeric ID with the given kind to a prefixed string.

Examples:

  • (123, item) -> "Q123"
  • (45, property) -> "P45"
  • (7, lexeme) -> "L7"
  • (9, mediainfo) -> "M9"
  • (2, entity_schema) -> "E2"
  • (7, form) -> "L7" (mapped to lexeme)
  • (7, sense) -> "L7" (mapped to lexeme)
Parameters
idNumeric identifier.
kindKind to prefix with (must not be any/unknown).
Returns
Prefixed ID string.
Exceptions
std::invalid_argumentif id is negative or kind is any/unknown.
Note
Form and sense identifiers are currently coerced to the lexeme prefix without emitting diagnostics; logging is a planned enhancement.

Definition at line 165 of file arachne.cpp.

165 {
166 if (id < 0) {
167 throw std::invalid_argument("normalize: id must be non-negative");
168 }
171 throw std::invalid_argument(
172 "normalize: kind must be a concrete, known entity kind"
173 );
174 }
175 auto idx = static_cast<std::size_t>(kind);
176 if (idx >= static_cast<size_t>(corespace::entity_kind::form)) {
177 // Numeric Form/Sense are not representable; map to lexeme.
178 // TODO: emit warning via logging sink.
179 idx = static_cast<size_t>(corespace::entity_kind::lexeme);
180 }
181 return prefixes[idx] + std::to_string(id);
182}

References corespace::any, corespace::form, corespace::lexeme, and corespace::unknown.

◆ parse_id()

bool arachnespace::arachne::parse_id ( const std::string & entity,
size_t & pos,
int & id )
static

Parse a full ID string and extract the numeric portion.

Parameters
entityFull ID (e.g., "Q123", "L7-F1", "L7-S2").
posIn/out index of the first digit within entity. On success the index is advanced past the number.
idOut parameter for the parsed integer portion.
Returns
true on successful parse; false otherwise. Never throws.

Definition at line 149 of file arachne.cpp.

149 {
150 id = 0;
151 size_t len = 0;
152 try {
153 id = std::stoi(entity.substr(pos), &len);
154 } catch (...) {
155 return false;
156 }
157 if (id < 0 || len == 0 || std::to_string(id).size() != len) {
158 return false;
159 }
160 pos += len;
161 return true;
162}

◆ queue_size()

int arachnespace::arachne::queue_size ( corespace::entity_kind kind) const
noexcept

Get the number of queued (pending) entities tracked in the main batch containers.

Parameters
kindSpecific kind, or entity_kind::any to return the sum across all batchable kinds.
Returns
Count of queued entities for the requested kind. Values are narrowed to int via static_cast and therefore inherit the implementation-defined behavior of signed narrowing conversions when the count exceeds INT_MAX.

Definition at line 107 of file arachne.cpp.

107 {
108 if (kind == corespace::entity_kind::any) {
109 std::size_t sum = 0;
110 for (const auto& batch : main_batches) {
111 sum += batch.size();
112 }
113 return static_cast<int>(sum);
114 }
115 const auto idx = static_cast<std::size_t>(kind);
116 if (idx >= main_batches.size()) {
117 return 0;
118 }
119 return static_cast<int>(main_batches[idx].size());
120}

References corespace::any.

◆ select_group()

void arachnespace::arachne::select_group ( std::string name)
private

Select an existing group or create it on demand.

An empty name selects/creates the anonymous group. A non-empty name is delegated to new_group, which creates the group if necessary.

Parameters
nameGroup name to activate; empty targets the anonymous group.

Definition at line 184 of file arachne.cpp.

184 {
185 if (name.empty()) {
186 if (current_group.empty()) {
187 new_group();
188 }
189 return;
190 }
191 new_group(std::move(name));
192}
bool new_group(std::string name="")
Create or select a group and make it current.
Definition arachne.cpp:31

References current_group, and new_group().

Referenced by add_entity(), and add_ids().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ touch_entity()

bool arachnespace::arachne::touch_entity ( const std::string & id_with_prefix)
privatenoexcept

Increment the touch counter for a single full ID (prefix REQUIRED).

If the entity is already queued or already has data, returns false (no increment). If the counter reaches candidates_threshold and the entity is not queued, it is moved into the queue. For "L…-F…"/"L…-S…", the exact ID is enqueued (no mapping).

Parameters
id_with_prefixFull ID with prefix.
Returns
true if the counter was incremented; false otherwise.

Definition at line 224 of file arachne.cpp.

224 {
225 candidates[id_with_prefix]++;
226 if (candidates[id_with_prefix] >= candidates_threshold) {
227 const std::string canonical = entity_root(id_with_prefix);
228 corespace::entity_kind kind = identify(canonical);
229 extra_batches[static_cast<size_t>(kind)].insert(canonical);
230 return true;
231 }
232 return false;
233}
std::unordered_map< std::string, int > candidates
Definition arachne.hpp:280
std::array< std::unordered_set< std::string >, batched_kind_count > extra_batches
Definition arachne.hpp:273
const int candidates_threshold
Intentional high bar for curiosity-driven candidates.
Definition arachne.hpp:286

References entity_root(), and identify().

Here is the call graph for this function:

◆ touch_ids()

int arachnespace::arachne::touch_ids ( std::span< const int > ids,
corespace::entity_kind kind )

Batch variant of touch for numeric IDs.

Each numeric ID is normalized using kind. If kind is form/sense, a warning is recorded and normalization yields "L<id>" (lexeme).

Parameters
idsSpan of numeric IDs.
kindNormalization kind (must not be any/unknown).
Returns
The number of entities for which touch_entity() returned true.
Exceptions
std::invalid_argumentif kind is any/unknown.

Definition at line 59 of file arachne.cpp.

61 {
64 throw std::invalid_argument("unknown kind of numeric IDs");
65 }
66 int added = 0;
67 for (const int id : ids) {
68 std::string id_with_prefix = normalize(id, kind);
69 added += touch_entity(id_with_prefix);
70 }
71 return added;
72}
bool touch_entity(const std::string &id_with_prefix) noexcept
Increment the touch counter for a single full ID (prefix REQUIRED).
Definition arachne.cpp:224

References corespace::any, and corespace::unknown.

Member Data Documentation

◆ batch_threshold

const size_t arachnespace::arachne::batch_threshold = 50
private

Typical unauthenticated entity-per-request cap.

Definition at line 283 of file arachne.hpp.

◆ candidates

std::unordered_map<std::string, int> arachnespace::arachne::candidates
private

Definition at line 280 of file arachne.hpp.

◆ candidates_threshold

const int arachnespace::arachne::candidates_threshold = 50
private

Intentional high bar for curiosity-driven candidates.

Definition at line 285 of file arachne.hpp.

◆ current_group

std::string arachnespace::arachne::current_group
private

Definition at line 290 of file arachne.hpp.

Referenced by new_group(), and select_group().

◆ extra_batches

std::array<std::unordered_set<std::string>, batched_kind_count> arachnespace::arachne::extra_batches
private

Definition at line 273 of file arachne.hpp.

◆ groups

std::unordered_map<std::string, std::unordered_set<std::string> > arachnespace::arachne::groups
private

Definition at line 277 of file arachne.hpp.

◆ main_batches

std::array<std::unordered_set<std::string>, batched_kind_count> arachnespace::arachne::main_batches
private

Definition at line 271 of file arachne.hpp.

◆ phe_client

pheidippides arachnespace::arachne::phe_client
private

Definition at line 293 of file arachne.hpp.

◆ staleness_threshold

std::chrono::milliseconds arachnespace::arachne::staleness_threshold = 24h
private

Definition at line 291 of file arachne.hpp.

◆ ui

corespace::interface arachnespace::arachne::ui = corespace::interface::command_line
private

Definition at line 292 of file arachne.hpp.


The documentation for this class was generated from the following files: