Arachne 1.0
Arachne - the perpetual stitcher of Wikidata entities.
Loading...
Searching...
No Matches
http_client.hpp
Go to the documentation of this file.
1/*
2 * The MIT License (MIT)
3 *
4 * Copyright (c) 2025 Yaroslav Riabtsev <yaroslav.riabtsev@rwth-aachen.de>
5 *
6 * Permission is hereby granted, free of charge, to any person obtaining a copy
7 * of this software and associated documentation files (the "Software"), to deal
8 * in the Software without restriction, including without limitation the rights
9 * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
10 * copies of the Software, and to permit persons to whom the Software is
11 * furnished to do so, subject to the following conditions:
12 *
13 * The above copyright notice and this permission notice shall be included
14 * in all copies or substantial portions of the Software.
15 *
16 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
17 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
18 * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE
19 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
20 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
21 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
22 * SOFTWARE.
23 */
24
25#ifndef ARACHNE_HTTP_CLIENT_HPP
26#define ARACHNE_HTTP_CLIENT_HPP
27
28#include "utils.hpp"
29
30#include <chrono>
31
32namespace corespace {
33/**
34 * @class http_client
35 * @brief Minimal, synchronous HTTP GET client built on libcurl.
36 *
37 * Responsibilities:
38 * - Build request URLs with encoded query parameters.
39 * - Issue HTTP GET requests with redirect following enabled.
40 * - Apply bounded exponential backoff with jitter for retryable outcomes:
41 * network errors, 408 (Request Timeout), 429 (Too Many Requests), and 5xx.
42 * - Aggregate lightweight, thread-safe network metrics.
43 *
44 * Lifetime and thread-safety:
45 * - A single easy handle (`CURL*`) is owned by the instance and reused
46 * across requests; therefore an `http_client` object is not thread-safe.
47 * Use one instance per calling thread.
48 * - `curl_global_init` is performed once per process via `std::call_once`.
49 */
50class http_client final {
51public:
52 /**
53 * @brief Construct a client and initialize libcurl.
54 *
55 * Effects:
56 * - Ensures `curl_global_init` is called exactly once process-wide.
57 * - Creates an easy handle and installs default options: user agent,
58 * `Accept` header, redirect following, transparent decoding,
59 * timeouts, and signal suppression.
60 *
61 * @throws std::runtime_error if libcurl initialization fails or
62 * header allocation fails.
63 */
64 explicit http_client();
65
66 /**
67 * @brief Perform an HTTP GET to @p url with optional query @p params.
68 *
69 * Behavior:
70 * - Builds a `CURLU` URL with `params` URL-encoded and appended.
71 * - Executes the request; on non-2xx or transport errors, applies the
72 * retry policy up to `opt.max_retries` with jittered backoff and
73 * an optional server `Retry-After` hint.
74 * - On success (2xx + `CURLE_OK`) returns the populated response.
75 *
76 * Failure:
77 * - If all attempts fail with a libcurl error, throws
78 * `std::runtime_error("curl error: ...")`.
79 * - If all attempts return non-success HTTP codes, throws
80 * `std::runtime_error("http error: <status>")`.
81 *
82 * Metrics:
83 * - Updates `metrics` after each attempt (including retries).
84 *
85 * @param url Absolute or base URL.
86 * @param params Optional list of query parameters to append.
87 * @param accept Optional Accept header value to override the default;
88 * empty string uses the client's configured accept header.
89 * @param timeout_sec Optional per-request timeout in seconds; if negative
90 * the client's default timeout is used.
91 * @return http_response on success (2xx).
92 * @throws std::runtime_error on terminal failure as described above.
93 */
95 get(std::string_view url, const parameter_list& params = {},
96 std::string_view accept = {}, int timeout_sec = -1);
97 /**
98 * @brief Perform an HTTP POST with form-encoded body.
99 *
100 * Builds a URL from @p url and @p query, serializes @p form as
101 * application/x-www-form-urlencoded and posts it. Retry behaviour and
102 * metrics follow the same semantics as get().
103 *
104 * @param url Endpoint URL.
105 * @param form Form key/value pairs to serialize in the body.
106 * @param query Optional query parameters appended to the URL.
107 * @param accept Optional Accept header override.
108 * @param timeout_sec Optional per-request timeout in seconds; negative to
109 * use default.
110 * @return http_response on success (2xx).
111 * @throws std::runtime_error on terminal failure.
112 */
114 std::string_view url, const parameter_list& form,
115 const parameter_list& query = {}, std::string_view accept = {},
116 int timeout_sec = -1
117 );
118 /**
119 * @brief Perform an HTTP POST with a raw body.
120 *
121 * Builds a URL from @p url and @p query and posts the raw @p body with
122 * Content-Type set to @p content_type. Retry behaviour and metrics follow
123 * the same semantics as get().
124 *
125 * @param url Endpoint URL.
126 * @param body Raw request body to send.
127 * @param content_type Content-Type header value for the body.
128 * @param query Optional query parameters appended to the URL.
129 * @param accept Optional Accept header override.
130 * @param timeout_sec Optional per-request timeout in seconds; negative to
131 * use default.
132 * @return http_response on success (2xx).
133 * @throws std::runtime_error on terminal failure.
134 */
136 std::string_view url, std::string_view body,
137 std::string_view content_type, const parameter_list& query = {},
138 std::string_view accept = {}, int timeout_sec = -1
139 );
140
141 /**
142 * @brief Access aggregated network metrics.
143 * @return Const reference to the metrics snapshot.
144 */
145 [[nodiscard]] const network_metrics& metrics_info() const;
146
147private:
148 /// Unique pointer type for `CURLU` with proper deleter.
150 /**
151 * @brief Construct a `CURLU` handle from @p url and append @p params.
152 *
153 * Each parameter is URL-encoded and appended via `CURLU_APPENDQUERY`.
154 *
155 * @param url Base URL.
156 * @param params Query parameters.
157 * @return Owning smart pointer to a configured `CURLU` handle.
158 * @throws std::runtime_error if allocation or URL assembly fails.
159 */
160 static curl_url_ptr
161 build_url(std::string_view url, const parameter_list& params);
162 /**
163 * @brief Execute a single HTTP GET using the prepared URL handle.
164 *
165 * Side effects:
166 * - Installs write callback to accumulate the response body.
167 * - Measures elapsed steady-clock time and returns it via @p elapsed.
168 * - Reads HTTP status and headers after the transfer.
169 *
170 * @param url_handle Prepared `CURLU` handle (owned by caller).
171 * @param elapsed Out: time spent in `curl_easy_perform`.
172 * @param accept Optional Accept header override; empty means use client
173 * default.
174 * @param timeout_sec Optional per-request timeout in seconds; negative to
175 * use client default.
176 * @return Populated `http_response` (may carry a libcurl error).
177 */
179 CURLU* url_handle, std::chrono::milliseconds& elapsed,
180 std::string_view accept = {}, int timeout_sec = -1
181 ) const;
182
183 /**
184 * @brief Execute a single HTTP POST with given body and content type.
185 *
186 * Sets temporary headers (Content-Type and Accept), posts the body,
187 * measures elapsed time in @p elapsed, reads status and headers, and
188 * records any libcurl error message.
189 *
190 * @param url_handle Prepared `CURLU` handle (owned by caller).
191 * @param elapsed Out: time spent in `curl_easy_perform`.
192 * @param content_type Content-Type header value.
193 * @param body Body bytes to send.
194 * @param accept Optional Accept header override; empty means use
195 * client default.
196 * @param timeout_sec Optional per-request timeout in seconds; negative to
197 * use client default.
198 * @return Populated `http_response` (may carry a libcurl error).
199 */
201 CURLU* url_handle, std::chrono::milliseconds& elapsed,
202 std::string_view content_type, std::string_view body,
203 std::string_view accept = {}, int timeout_sec = -1
204 ) const;
205 std::string build_form_body(const parameter_list& form) const;
206
207 /**
208 * @brief Refresh the header multimap from the last transfer.
209 *
210 * Enumerates headers via `curl_easy_nextheader` and fills
211 * `response.header`.
212 *
213 * @param response Response object to update.
214 */
215 void update_headers(http_response& response) const;
216 /**
217 * @brief Update counters and histograms after an attempt.
218 *
219 * Increments `requests`, accumulates `network_ms`, bumps status
220 * histogram (if within bounds), and adds to `bytes_received`.
221 *
222 * @param response Result of the attempt.
223 * @param elapsed Duration spent in libcurl during the attempt.
224 */
225 void update_metrics(
226 const http_response& response, std::chrono::milliseconds elapsed
227 );
228 /**
229 * @brief Success predicate: transport OK and HTTP 2xx.
230 * @param response Response to check.
231 * @return true if `CURLE_OK` and 200 <= status < 300.
232 */
233 [[nodiscard]] static bool status_good(const http_response& response);
234 /**
235 * @brief Retry predicate for transient outcomes.
236 *
237 * Retries on:
238 * - any libcurl error (i.e., `!net_ok`),
239 * - HTTP 408 (Request Timeout),
240 * - HTTP 429 (Too Many Requests),
241 * - HTTP 5xx.
242 *
243 * @param response Response to inspect.
244 * @param net_ok Whether the transport completed without libcurl error.
245 * @return true if another attempt should be made.
246 */
247 [[nodiscard]] static bool
248 status_retry(const http_response& response, bool net_ok);
249 /**
250 * @brief Compute the next backoff delay for @p attempt (1-based).
251 *
252 * Strategy: exponential backoff with full jitter. The base grows as
253 * `retry_base_ms * 2^(attempt-1)` and a uniform random component in
254 * `[0, base]` is added; the result is capped at `retry_max_ms`.
255 *
256 * @param attempt Attempt number starting from 1.
257 * @return Milliseconds to sleep before the next attempt.
258 */
259 [[nodiscard]] long long next_delay(int attempt) const;
260 /**
261 * @brief Apply server-provided retry hint if present.
262 *
263 * If `CURLINFO_RETRY_AFTER` yields a non-negative value, interpret it
264 * as seconds and raise @p sleep_ms to at least that many milliseconds.
265 *
266 * @param sleep_ms In/out: proposed client backoff in milliseconds.
267 */
268 void apply_server_retry_hint(long long& sleep_ms) const;
269
270 /**
271 * @brief libcurl write callback: append chunk to response body.
272 * @param ptr Pointer to received data.
273 * @param size Element size.
274 * @param n Number of elements.
275 * @param data `std::string*` accumulator (response body).
276 * @return Number of bytes consumed (size * n).
277 */
278 static size_t
279 write_callback(const char* ptr, size_t size, size_t n, void* data);
280
281 const network_options opt {}; ///< Fixed options installed at construction.
282 network_metrics metrics; ///< Aggregated metrics (atomic counters).
283 mutable std::mutex mu;
285 nullptr, &curl_easy_cleanup
286 }; ///< Reused easy handle (not thread-safe).
288 nullptr, &curl_slist_free_all
289 }; ///< Owned request header list.
290};
291}
292#endif // ARACHNE_HTTP_CLIENT_HPP
Accumulates entity IDs into per-kind batches and organizes groups.
Definition arachne.hpp:47
std::unordered_map< std::string, int > candidates
Definition arachne.hpp:280
std::array< std::unordered_set< std::string >, batched_kind_count > extra_batches
Definition arachne.hpp:273
bool touch_entity(const std::string &id_with_prefix) noexcept
Increment the touch counter for a single full ID (prefix REQUIRED).
Definition arachne.cpp:224
static std::string entity_root(const std::string &id)
Extract the lexeme root from a full ID string.
Definition arachne.cpp:74
std::string current_group
Definition arachne.hpp:290
int touch_ids(std::span< const int > ids, corespace::entity_kind kind)
Batch variant of touch for numeric IDs.
Definition arachne.cpp:59
static bool parse_id(const std::string &entity, size_t &pos, int &id)
Parse a full ID string and extract the numeric portion.
Definition arachne.cpp:149
bool new_group(std::string name="")
Create or select a group and make it current.
Definition arachne.cpp:31
size_t add_entity(const std::string &id_with_prefix, bool force=false, std::string name="")
Enqueue a full (prefixed) ID string and add it to a group.
Definition arachne.cpp:235
std::unordered_map< std::string, std::unordered_set< std::string > > groups
Definition arachne.hpp:277
std::chrono::milliseconds staleness_threshold
Definition arachne.hpp:291
bool enqueue(std::string_view id, corespace::entity_kind kind, bool interactive) const
Decide whether an entity should be enqueued for fetching.
Definition arachne.cpp:201
const size_t batch_threshold
Typical unauthenticated entity-per-request cap.
Definition arachne.hpp:284
pheidippides phe_client
Definition arachne.hpp:293
const int candidates_threshold
Intentional high bar for curiosity-driven candidates.
Definition arachne.hpp:286
static std::string normalize(int id, corespace::entity_kind kind)
Normalize a numeric ID with the given kind to a prefixed string.
Definition arachne.cpp:165
static bool ask_update(std::string_view id, corespace::entity_kind kind, std::chrono::milliseconds age)
Placeholder for interactive staleness confirmation.
Definition arachne.cpp:194
void select_group(std::string name)
Select an existing group or create it on demand.
Definition arachne.cpp:184
std::array< std::unordered_set< std::string >, batched_kind_count > main_batches
Definition arachne.hpp:271
int queue_size(corespace::entity_kind kind) const noexcept
Get the number of queued (pending) entities tracked in the main batch containers.
Definition arachne.cpp:107
corespace::interface ui
Definition arachne.hpp:292
static corespace::entity_kind identify(const std::string &entity) noexcept
Determine the kind of a full ID string.
Definition arachne.cpp:122
bool flush(corespace::entity_kind kind=corespace::entity_kind::any)
Flush (send) up to batch_threshold entities of a specific kind.
Definition arachne.cpp:99
size_t add_ids(std::span< const int > ids, corespace::entity_kind kind, std::string name="")
Enqueue numeric IDs with a given kind and add them to a group.
Definition arachne.cpp:42
Batch courier for Wikidata/Commons: collects IDs, issues HTTP requests, and returns a merged JSON pay...
corespace::call_preview preview(const corespace::sparql_request &request) const
Produce a call preview describing the HTTP request that would be made.
corespace::http_client client
Reused HTTP client (not thread-safe across threads).
nlohmann::json wdqs(std::string query)
Convenience wrapper to run a raw SPARQL query string.
corespace::call_preview build_call_preview(const corespace::sparql_request &request) const
corespace::wdqs_options wdqs_opt
nlohmann::json sparql(const corespace::sparql_request &request)
Execute a SPARQL query according to the provided request.
static std::string join_str(std::span< const std::string > ids, std::string_view separator="|")
Join a span of strings with a separator (no encoding or validation).
const corespace::network_metrics & metrics_info() const
Access aggregated network metrics of the underlying client.
nlohmann::json fetch_json(const std::unordered_set< std::string > &batch, corespace::entity_kind kind=corespace::entity_kind::any)
Fetch metadata for a set of entity IDs and return a merged JSON object.
corespace::options opt
Request shaping parameters (chunking, fields, base params).
static bool status_retry(const http_response &response, bool net_ok)
Retry predicate for transient outcomes.
std::unique_ptr< curl_slist, decltype(&curl_slist_free_all)> header_list
Owned request header list.
void update_headers(http_response &response) const
Refresh the header multimap from the last transfer.
http_response request_get(CURLU *url_handle, std::chrono::milliseconds &elapsed, std::string_view accept={}, int timeout_sec=-1) const
Execute a single HTTP GET using the prepared URL handle.
http_client()
Construct a client and initialize libcurl.
const network_metrics & metrics_info() const
Access aggregated network metrics.
network_metrics metrics
Aggregated metrics (atomic counters).
http_response post_raw(std::string_view url, std::string_view body, std::string_view content_type, const parameter_list &query={}, std::string_view accept={}, int timeout_sec=-1)
Perform an HTTP POST with a raw body.
long long next_delay(int attempt) const
Compute the next backoff delay for attempt (1-based).
const network_options opt
Fixed options installed at construction.
static curl_url_ptr build_url(std::string_view url, const parameter_list &params)
Construct a CURLU handle from url and append params.
http_response get(std::string_view url, const parameter_list &params={}, std::string_view accept={}, int timeout_sec=-1)
Perform an HTTP GET to url with optional query params.
http_response request_post(CURLU *url_handle, std::chrono::milliseconds &elapsed, std::string_view content_type, std::string_view body, std::string_view accept={}, int timeout_sec=-1) const
Execute a single HTTP POST with given body and content type.
static bool status_good(const http_response &response)
Success predicate: transport OK and HTTP 2xx.
http_response post_form(std::string_view url, const parameter_list &form, const parameter_list &query={}, std::string_view accept={}, int timeout_sec=-1)
Perform an HTTP POST with form-encoded body.
void apply_server_retry_hint(long long &sleep_ms) const
Apply server-provided retry hint if present.
std::unique_ptr< CURLU, decltype(&curl_url_cleanup)> curl_url_ptr
Unique pointer type for CURLU with proper deleter.
std::string build_form_body(const parameter_list &form) const
void update_metrics(const http_response &response, std::chrono::milliseconds elapsed)
Update counters and histograms after an attempt.
std::unique_ptr< CURL, decltype(&curl_easy_cleanup)> curl
Reused easy handle (not thread-safe).
static size_t write_callback(const char *ptr, size_t size, size_t n, void *data)
libcurl write callback: append chunk to response body.
static constexpr std::string prefixes
Definition arachne.cpp:29
constexpr std::size_t batched_kind_count
Number of batchable kinds (Q, P, L, M, E, form, sense).
Definition arachne.hpp:33
entity_kind
Wikidata entity kind.
Definition utils.hpp:47
@ any
API selector (e.g., flush(any)); not directly batchable.
Definition utils.hpp:55
@ lexeme
IDs prefixed with 'L'.
Definition utils.hpp:50
@ form
Lexeme form IDs such as "L<lexeme>-F<form>".
Definition utils.hpp:53
@ unknown
Unrecognized/invalid identifier.
Definition utils.hpp:56
@ sense
Lexeme sense IDs such as "L<lexeme>-S<sense>".
Definition utils.hpp:54
std::string random_hex(const std::size_t n)
Return exactly n random hexadecimal characters (lowercase).
Definition rng.cpp:33
Result object for an HTTP transfer.
Definition utils.hpp:157
Fixed runtime options for the HTTP client.
Definition utils.hpp:183
Configuration for fetching entities via MediaWiki/Wikibase API.
Definition utils.hpp:87
Options specific to WDQS usage and heuristics.
Definition utils.hpp:249