Arachne 1.0
Arachne - the perpetual stitcher of Wikidata entities.
Loading...
Searching...
No Matches
http_client.hpp
Go to the documentation of this file.
1/*
2 * The MIT License (MIT)
3 *
4 * Copyright (c) 2025 Yaroslav Riabtsev <yaroslav.riabtsev@rwth-aachen.de>
5 *
6 * Permission is hereby granted, free of charge, to any person obtaining a copy
7 * of this software and associated documentation files (the "Software"), to deal
8 * in the Software without restriction, including without limitation the rights
9 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 * copies of the Software, and to permit persons to whom the Software is
11 * furnished to do so, subject to the following conditions:
12 *
13 * The above copyright notice and this permission notice shall be included
14 * in all copies or substantial portions of the Software.
15 *
16 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE
19 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 * SOFTWARE.
23 */
24
25#ifndef ARACHNE_HTTP_CLIENT_HPP
26#define ARACHNE_HTTP_CLIENT_HPP
27
28#include "utils.hpp"
29
30#include <chrono>
31
32namespace corespace {
33/**
34 * @class http_client
35 * @brief Minimal, synchronous HTTP GET client built on libcurl.
36 *
37 * Responsibilities:
38 * - Build request URLs with encoded query parameters.
39 * - Issue HTTP GET requests with redirect following enabled.
40 * - Apply bounded exponential backoff with jitter for retryable outcomes:
41 * network errors, 408 (Request Timeout), 429 (Too Many Requests), and 5xx.
42 * - Aggregate lightweight, thread-safe network metrics.
43 *
44 * Lifetime and thread-safety:
45 * - A single easy handle (`CURL*`) is owned by the instance and reused
46 * across requests; therefore an `http_client` object is not thread-safe.
47 * Use one instance per calling thread.
48 * - `curl_global_init` is performed once per process via `std::call_once`.
49 */
50class http_client final {
51public:
52 /**
53 * @brief Construct a client and initialize libcurl.
54 *
55 * Effects:
56 * - Ensures `curl_global_init` is called exactly once process-wide.
57 * - Creates an easy handle and installs default options: user agent,
58 * `Accept` header, redirect following, transparent decoding,
59 * timeouts, and signal suppression.
60 *
61 * @throws std::runtime_error if libcurl initialization fails or
62 * header allocation fails.
63 */
64 explicit http_client();
65
66 /**
67 * @brief Perform an HTTP GET to @p url with optional query @p params.
68 *
69 * Behavior:
70 * - Builds a `CURLU` URL with `params` URL-encoded and appended.
71 * - Executes the request; on non-2xx or transport errors, applies the
72 * retry policy up to `opt.max_retries` with jittered backoff and
73 * an optional server `Retry-After` hint.
74 * - On success (2xx + `CURLE_OK`) returns the populated response.
75 *
76 * Failure:
77 * - If all attempts fail with a libcurl error, throws
78 * `std::runtime_error("curl error: ...")`.
79 * - If all attempts return non-success HTTP codes, throws
80 * `std::runtime_error("http error: <status>")`.
81 *
82 * Metrics:
83 * - Updates `metrics` after each attempt (including retries).
84 *
85 * @param url Absolute or base URL.
86 * @param params Optional list of query parameters to append.
87 * @param override
88 * @return http_response on success (2xx).
89 * @throws std::runtime_error on terminal failure as described above.
90 */
92 get(std::string_view url, const parameter_list& params = {},
93 std::string_view override = {});
95 std::string_view url, const parameter_list& form,
96 const parameter_list& query = {}, std::string_view override = {}
97 );
99 std::string_view url, std::string_view body,
100 std::string_view content_type, const parameter_list& query = {},
101 std::string_view override = {}
102 );
103
104 /**
105 * @brief Access aggregated network metrics.
106 * @return Const reference to the metrics snapshot.
107 */
108 [[nodiscard]] const network_metrics& metrics_info() const;
109
110private:
111 /// Unique pointer type for `CURLU` with proper deleter.
113 /**
114 * @brief Construct a `CURLU` handle from @p url and append @p params.
115 *
116 * Each parameter is URL-encoded and appended via `CURLU_APPENDQUERY`.
117 *
118 * @param url Base URL.
119 * @param params Query parameters.
120 * @return Owning smart pointer to a configured `CURLU` handle.
121 * @throws std::runtime_error if allocation or URL assembly fails.
122 */
123 static curl_url_ptr
124 build_url(std::string_view url, const parameter_list& params);
125 /**
126 * @brief Execute a single HTTP GET using the prepared URL handle.
127 *
128 * Side effects:
129 * - Installs write callback to accumulate the response body.
130 * - Measures elapsed steady-clock time and returns it via @p elapsed.
131 * - Reads HTTP status and headers after the transfer.
132 *
133 * @param url_handle Prepared `CURLU` handle (owned by caller).
134 * @param elapsed Out: time spent in `curl_easy_perform`.
135 * @param override
136 * @return Populated `http_response` (may carry a libcurl error).
137 */
139 CURLU* url_handle, std::chrono::milliseconds& elapsed,
140 std::string_view override = {}
141 ) const;
143 CURLU* url_handle, std::chrono::milliseconds& elapsed,
144 std::string_view content_type, std::string_view body,
145 std::string_view override
146 ) const;
147 std::string build_form_body(const parameter_list& form) const;
148
149 /**
150 * @brief Refresh the header multimap from the last transfer.
151 *
152 * Enumerates headers via `curl_easy_nextheader` and fills
153 * `response.header`.
154 *
155 * @param response Response object to update.
156 */
157 void update_headers(http_response& response) const;
158 /**
159 * @brief Update counters and histograms after an attempt.
160 *
161 * Increments `requests`, accumulates `network_ms`, bumps status
162 * histogram (if within bounds), and adds to `bytes_received`.
163 *
164 * @param response Result of the attempt.
165 * @param elapsed Duration spent in libcurl during the attempt.
166 */
167 void update_metrics(
168 const http_response& response, std::chrono::milliseconds elapsed
169 );
170 /**
171 * @brief Success predicate: transport OK and HTTP 2xx.
172 * @param response Response to check.
173 * @return true if `CURLE_OK` and 200 <= status < 300.
174 */
175 [[nodiscard]] static bool status_good(const http_response& response);
176 /**
177 * @brief Retry predicate for transient outcomes.
178 *
179 * Retries on:
180 * - any libcurl error (i.e., `!net_ok`),
181 * - HTTP 408 (Request Timeout),
182 * - HTTP 429 (Too Many Requests),
183 * - HTTP 5xx.
184 *
185 * @param response Response to inspect.
186 * @param net_ok Whether the transport completed without libcurl error.
187 * @return true if another attempt should be made.
188 */
189 [[nodiscard]] static bool
190 status_retry(const http_response& response, bool net_ok);
191 /**
192 * @brief Compute the next backoff delay for @p attempt (1-based).
193 *
194 * Strategy: exponential backoff with full jitter. The base grows as
195 * `retry_base_ms * 2^(attempt-1)` and a uniform random component in
196 * `[0, base]` is added; the result is capped at `retry_max_ms`.
197 *
198 * @param attempt Attempt number starting from 1.
199 * @return Milliseconds to sleep before the next attempt.
200 */
201 [[nodiscard]] long long next_delay(int attempt) const;
202 /**
203 * @brief Apply server-provided retry hint if present.
204 *
205 * If `CURLINFO_RETRY_AFTER` yields a non-negative value, interpret it
206 * as seconds and raise @p sleep_ms to at least that many milliseconds.
207 *
208 * @param sleep_ms In/out: proposed client backoff in milliseconds.
209 */
210 void apply_server_retry_hint(long long& sleep_ms) const;
211
212 /**
213 * @brief libcurl write callback: append chunk to response body.
214 * @param ptr Pointer to received data.
215 * @param size Element size.
216 * @param n Number of elements.
217 * @param data `std::string*` accumulator (response body).
218 * @return Number of bytes consumed (size * n).
219 */
220 static size_t
221 write_callback(const char* ptr, size_t size, size_t n, void* data);
222
223 const network_options opt {}; ///< Fixed options installed at construction.
224 network_metrics metrics; ///< Aggregated metrics (atomic counters).
225 mutable std::mutex mu;
227 nullptr, &curl_easy_cleanup
228 }; ///< Reused easy handle (not thread-safe).
230 nullptr, &curl_slist_free_all
231 }; ///< Owned request header list.
232};
233}
234#endif // ARACHNE_HTTP_CLIENT_HPP
Accumulates entity IDs into per-kind batches and organizes groups.
Definition arachne.hpp:47
std::unordered_map< std::string, int > candidates
Definition arachne.hpp:280
std::array< std::unordered_set< std::string >, batched_kind_count > extra_batches
Definition arachne.hpp:273
bool touch_entity(const std::string &id_with_prefix) noexcept
Increment the touch counter for a single full ID (prefix REQUIRED).
Definition arachne.cpp:224
static std::string entity_root(const std::string &id)
Extract the lexeme root from a full ID string.
Definition arachne.cpp:74
std::string current_group
Definition arachne.hpp:290
int touch_ids(std::span< const int > ids, corespace::entity_kind kind)
Batch variant of touch for numeric IDs.
Definition arachne.cpp:59
static bool parse_id(const std::string &entity, size_t &pos, int &id)
Parse a full ID string and extract the numeric portion.
Definition arachne.cpp:149
bool new_group(std::string name="")
Create or select a group and make it current.
Definition arachne.cpp:31
size_t add_entity(const std::string &id_with_prefix, bool force=false, std::string name="")
Enqueue a full (prefixed) ID string and add it to a group.
Definition arachne.cpp:235
std::unordered_map< std::string, std::unordered_set< std::string > > groups
Definition arachne.hpp:277
std::chrono::milliseconds staleness_threshold
Definition arachne.hpp:291
bool enqueue(std::string_view id, corespace::entity_kind kind, bool interactive) const
Decide whether an entity should be enqueued for fetching.
Definition arachne.cpp:201
const size_t batch_threshold
Typical unauthenticated entity-per-request cap.
Definition arachne.hpp:284
pheidippides phe_client
Definition arachne.hpp:293
const int candidates_threshold
Intentional high bar for curiosity-driven candidates.
Definition arachne.hpp:286
static std::string normalize(int id, corespace::entity_kind kind)
Normalize a numeric ID with the given kind to a prefixed string.
Definition arachne.cpp:165
static bool ask_update(std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
Placeholder for interactive staleness confirmation.
Definition arachne.cpp:194
void select_group(std::string name)
Select an existing group or create it on demand.
Definition arachne.cpp:184
std::array< std::unordered_set< std::string >, batched_kind_count > main_batches
Definition arachne.hpp:271
int queue_size(corespace::entity_kind kind) const noexcept
Get the number of queued (pending) entities tracked in the main batch containers.
Definition arachne.cpp:107
corespace::interface ui
Definition arachne.hpp:292
static corespace::entity_kind identify(const std::string &entity) noexcept
Determine the kind of a full ID string.
Definition arachne.cpp:122
bool flush(corespace::entity_kind kind=corespace::entity_kind::any)
Flush (send) up to batch_threshold entities of a specific kind.
Definition arachne.cpp:99
size_t add_ids(std::span< const int > ids, corespace::entity_kind kind, std::string name="")
Enqueue numeric IDs with a given kind and add them to a group.
Definition arachne.cpp:42
Batch courier for Wikidata/Commons: collects IDs, issues HTTP requests, and returns a merged JSON pay...
corespace::http_client client
Reused HTTP client (not thread-safe across threads).
static std::string join_str(std::span< const std::string > ids, std::string_view separator="|")
Join a span of strings with a separator (no encoding or validation).
const corespace::network_metrics & metrics_info() const
Access aggregated network metrics of the underlying client.
nlohmann::json fetch_json(const std::unordered_set< std::string > &batch, corespace::entity_kind kind=corespace::entity_kind::any)
Fetch metadata for a set of entity IDs and return a merged JSON object.
corespace::options opt
Request shaping parameters (chunking, fields, base params).
static bool status_retry(const http_response &response, bool net_ok)
Retry predicate for transient outcomes.
http_response post_form(std::string_view url, const parameter_list &form, const parameter_list &query={}, std::string_view override={})
std::unique_ptr< curl_slist, decltype(&curl_slist_free_all)> header_list
Owned request header list.
http_response request_post(CURLU *url_handle, std::chrono::milliseconds &elapsed, std::string_view content_type, std::string_view body, std::string_view override) const
void update_headers(http_response &response) const
Refresh the header multimap from the last transfer.
http_client()
Construct a client and initialize libcurl.
const network_metrics & metrics_info() const
Access aggregated network metrics.
network_metrics metrics
Aggregated metrics (atomic counters).
long long next_delay(int attempt) const
Compute the next backoff delay for attempt (1-based).
const network_options opt
Fixed options installed at construction.
http_response request_get(CURLU *url_handle, std::chrono::milliseconds &elapsed, std::string_view override={}) const
Execute a single HTTP GET using the prepared URL handle.
static curl_url_ptr build_url(std::string_view url, const parameter_list &params)
Construct a CURLU handle from url and append params.
static bool status_good(const http_response &response)
Success predicate: transport OK and HTTP 2xx.
http_response post_raw(std::string_view url, std::string_view body, std::string_view content_type, const parameter_list &query={}, std::string_view override={})
void apply_server_retry_hint(long long &sleep_ms) const
Apply server-provided retry hint if present.
std::unique_ptr< CURLU, decltype(&curl_url_cleanup)> curl_url_ptr
Unique pointer type for CURLU with proper deleter.
std::string build_form_body(const parameter_list &form) const
void update_metrics(const http_response &response, std::chrono::milliseconds elapsed)
Update counters and histograms after an attempt.
std::unique_ptr< CURL, decltype(&curl_easy_cleanup)> curl
Reused easy handle (not thread-safe).
static size_t write_callback(const char *ptr, size_t size, size_t n, void *data)
libcurl write callback: append chunk to response body.
http_response get(std::string_view url, const parameter_list &params={}, std::string_view override={})
Perform an HTTP GET to url with optional query params.
static constexpr std::string prefixes
Definition arachne.cpp:29
constexpr std::size_t batched_kind_count
Number of batchable kinds (Q, P, L, M, E, form, sense).
Definition arachne.hpp:33
entity_kind
Wikidata entity kind.
Definition utils.hpp:46
@ any
API selector (e.g., flush(any)); not directly batchable.
Definition utils.hpp:54
@ lexeme
IDs prefixed with 'L'.
Definition utils.hpp:49
@ form
Lexeme form IDs such as "L<lexeme>-F<form>".
Definition utils.hpp:52
@ unknown
Unrecognized/invalid identifier.
Definition utils.hpp:55
@ sense
Lexeme sense IDs such as "L<lexeme>-S<sense>".
Definition utils.hpp:53
std::string random_hex(const std::size_t n)
Return exactly n random hexadecimal characters (lowercase).
Definition rng.cpp:33
Result object for an HTTP transfer.
Definition utils.hpp:145
Fixed runtime options for the HTTP client.
Definition utils.hpp:171
Configuration for fetching entities via MediaWiki/Wikibase API.
Definition utils.hpp:75