Scroll to navigation

SC_HOIHO(1) General Commands Manual SC_HOIHO(1)

NAME

sc_hoihoHolistic Orthography of Internet Hostname Observations

SYNOPSIS

sc_hoiho [-6] [-d dump] [-D domain] [-f rtt-fudge] [-g dictionary] [-l light-speed] [-O option] [-r regex] [-R rtt-file] [-s stopid] [-S siblings] [-t threadc] public-suffix-list.dat routers.txt

DESCRIPTION

The sc_hoiho utility automatically learns regular expressions that extract router names, geolocation hints, autonomous system numbers (ASNs), and autonomous system names (AS names) from hostnames. A regular expression that extracts a router name from a router hostname can be used to infer which interfaces belong to the same router (are aliases). A regular expression that extracts an ASN or AS name from a router hostname can be used to infer which AS operates the router. A regular expression that extracts geolocation hints from a router hostname can be used to infer the approximate location of the router.

To learn these regular expressions, sc_hoiho uses a set of training routers whose interfaces were inferred to be aliases, optionally annotated with their operating ASN. The core of the technique is described in the paper "Learning to Extract Router Names from Hostnames" published in the ACM Internet Measurement Conference (IMC) 2019. The technique that learns to extract ASNs is described in the paper "Learning to Extract and Use ASNs in Hostnames" published in IMC 2020. The technique that learns to extract geolocation hints is described in the paper "Learning to Extract Geographic Hints from Internet Router Hostnames" published in the ACM Conference on emerging Networking EXperiments and Technologies (CoNEXT) 2021. The technique that learns to extract AS names is described in the paper "Learning Regexes to Extract Network Names from Hostnames" published in the Asian Internet Engineering Conference (AINTEC) 2021.

The supported options to sc_hoiho are as follows:

specifies that the input training set contains IPv6 addresses, and not IPv4 addresses.
dump
specifies the dump ID to use to analyze the collected data. Currently, ID values 1 (working-set), 2 (routers), 3 (best-regex), 4 (interfaces) are valid, which (1) dump the working set of regexes for each suffix, (2) the router-level result of applying the best regexes to the training data, (3) the best regex per domain, and (4) the interface-level result of applying the best regexes to the training data, respectively.
domain
specifies the domain suffix that sc_hoiho should operate on. If no suffix is specified, then sc_hoiho operates on all suffixes.
rtt-fudge
specifies a fudge value, in milliseconds, to add to measured RTTs before evaluating if the RTT between two lat/longs is feasible. By default, this is 1ms.
dictionary
specifies a filename containing a geolocation or AS name dictionary. The format of this file is specified below.
light-speed
specifies the distance covered in metres per millisecond, and is used to evaluate if the RTT between two lat/longs is feasible. By default, this is 204.190477 metres per millisecond, the speed of light in fiber optic cable.
option
allows the behavior of sc_hoiho to be further tailored. The current choices for this option are:
  • Show the outcome of applying the regular expressions to the application set.
  • output significant volumes of debugging information. Debugging information is only printed when a sc_hoiho is using a single thread.
  • Output inferences using json format. This option is applicable on dump IDs 2 and 3.
  • infer regexes to extract router names from hostnames.
  • infer regexes to extract AS numbers from hostnames.
  • infer regexes to extract AS names from hostnames.
  • infer regexes to extract geolocation hints from hostnames.
  • load the training data into memory and then exit.
  • do not emit geolocation information for CLLI code prefixes.
  • Do not infer true positives between extracted and inferred ASNs where the ASNs have an edit distance of one.
  • Do not infer the location of an embedded IP address portion, if any, in the hostname.
  • Do not use pcre(3) or pcre2(3) just in time complication to improve regex performance.
  • do not do any refinement of regular expressions.
  • do not, or do, execute the phase that builds literals into components that extract a string from the hostnames. This phase is described in phase 5.2 of the IMC 2019 paper.
  • do not, or do, execute the phase that embeds literals in a regex component that separates hostnames from their training routers. This phase is described in section 5.3 of the IMC 2019 paper.
  • do not, or do, execute the phase that embeds character classes in regular expressions. This phase is described in section 5.4 of the IMC 2019 paper.
  • do not, or do, execute the phase that builds additional regexes to use as a companion regex with another regex that did not cluster all interfaces on a router. This phase is described in section 5.5 of the IMC 2019 paper.
  • do not, or do, execute the phase that builds naming convention sets using the regexes in the working set. This phase is described in section 5.6 of the IMC 2019 paper.
  • do not, or do, execute the phase that builds regexes to filter hostnames with IP addresses in them, where the IP address would be part of an extracted name. This phase is described in section 5.7 of the IMC 2019 paper.
  • do not, or do, execute the phase that builds regexes to filter hostnames that, if matched, would result in false positives according to the training data. This phase is described in phase 5.7 of the IMC 2019 paper.
  • do not, or do, execute the phase that merges regular expressions that differ by a single simple string. This phase is described in section 3.3 of the IMC 2020 paper.
  • do not, or do, execute the phase that refines the geolocation dictionary. This phase is described in section 5.4 of the CoNEXT 2021 paper.
  • do not remove any redundant regexes at the end of each phase, as described in section 4.5 of the IMC 2019 paper.
  • do not, or do, remove any regexes that do not meet the minimum number of matches to be considered capturing a convention. The conditions are described in section 4.5 of the IMC 2019 paper.
  • do not, or do, remove redundant regexes that make the same inferences.
  • do not, or do, remove redundant regexes whose inferences are entirely contained in another regex, with no additional false positives.
  • compute the Rand Index metric on the clustering of hostnames by router name regexes according to the training data.
  • only show hostnames where sc_hoiho made a classification using a regex.
  • show regexes that sc_hoiho classifies as good.
  • show regexes that sc_hoiho classifies as promising.
  • show regexes that sc_hoiho classifies as poor.
  • identify LOCODEs that appear to be split into lengths of 2 (country-code) and 3 (location code) in hostnames.
regex
specifies the name of a file containing a working set of regexes, or a naming convention, to apply.
rtt-file
specifies the name of a file containing round trip time (RTT) measurements from systems with known locations towards routers.
stop-id
specifies the stage number to halt processing.
siblings
specifies the name of a file containing sibling ASes. Each line in the file contains a list of sibling ASes that belong to the same organization.
threadc
specifies the number of threads to use in the threadpool. By default, sc_hoiho will determine the number of processors online, and use all of them.

EXAMPLES

Given a set of routers in a file named routers.txt, and a copy of public_suffix_list.dat obtained from the Mozilla Foundation's https://publicsuffix.org/list/ website:

# node2id: 1
# node2as: 64496
192.0.2.1    esr1-ge-5-0-0.jfk2.example.net

192.0.2.10 esr1-ge-5-0-6.jfk2.example.net
192.0.31.60 # node2id: 2 # node2as: 64496 192.0.2.2 esr2-xe-4-0-0.lax.example.net
192.0.2.5 esr2-xe-4-0-1.lax.example.net
192.0.31.8 # node2id: 3 # node2as: 64496 192.0.2.6 das1-v3005.akl.example.net
192.0.2.9 das1-v3006.akl.example.net
192.0.2.44 44.2.0.192.example.net # node2id: 4 # node2as: 64496 192.0.2.13 esr1-xe-4-0-0.lax.example.net # node2id: 5 # node2as: 64496 192.0.2.17 esr1-xe-4-0-1.lax.example.net # node2id: 6 # node2as: 64496 192.0.2.21 esr1-xe-4-0-1.lax.example.net # node2id: 7 # node2as: 64500 192.0.2.25 as64500.cust.example.net # node2id: 8 # node2as: 64501 192.0.2.29 as64501.cust.example.net # node2id: 9 # node2as: 64502 192.0.2.33 as64502.cust.example.net # node2id: 10 # node2as: 64503 192.0.2.37 as64503.cust.example.net

Then the following command will build a base set of regular expressions that extract router names, as described in section 5.1 of the IMC 2019 paper, and output the working set of regexes inferred for each suffix at the end of that phase.

sc_hoiho -O learnalias -d working-set -O norefine public_suffix_list.dat routers.txt

To obtain the best selected regular expression that extracts router names for example.net, use:

sc_hoiho -O learnalias -d best-regex -D example.net public_suffix_list.dat routers.txt

To examine how the best regular expression that extracts router names applies to the training data for example.net, use:

sc_hoiho -O learnalias -d routers -D example.net public_suffix_list.dat routers.txt

To examine how the best regular expression that extracts router names applies to the training data, as well as interfaces in the application set, use:

sc_hoiho -O learnalias -d routers -D example.net -O application public_suffix_list.dat routers.txt

To see the working set of regular expressions that extract router names built after embedding literals in captures for example.net, use:

sc_hoiho -O learnalias -d working-set -D example.net -s 2 public_suffix_list.dat routers.txt

To see how a manually-derived regular expression clusters hostnames according to the extracted router name, use:

sc_hoiho -O learnalias -d routers -D example.net -r "^([a-z]+\d+)-.+\.([a-z\d]+)\.example\.net$"
public_suffix_list.dat routers.txt

To infer regular expressions that extract ASNs from hostnames, use:

sc_hoiho -O learnasn -d best-regex public_suffix_list.dat routers.txt

To infer regular expressions that extract geohints from hostnames, and formatting the output as JSON, use:

sc_hoiho -O learngeo -d best-regex -O json -R rtts.txt -g geohints.txt public_suffix_list.dat routers.txt

HINTS

sc_hoiho can take a long time to run when inferring regular expressions that extract router names, depending on the training set involved. One option to breaking up the runtime (but not reducing it) is to capture the output from one phase, and then use that as input to the next phase. For example, to run the first three phases:

sc_hoiho -O learnalias -d working-set -s 1 public_suffix_list.dat routers.txt >phase-1.re
sc_hoiho -O learnalias -d working-set -s 2 -r phase-1.re public_suffix_list.dat routers.txt >phase-2.re
sc_hoiho -O learnalias -d working-set -s 3 -r phase-2.re public_suffix_list.dat routers.txt >phase-3.re

NOTES

sc_hoiho follows the format of the hostnames files stored in CAIDA's Internet Topology Data Kit (ITDK) which stores hostnames in lower-case, and stores characters that do not form part of the DNS's alphabet (A-Z, a-z, - and .) as a hexadecimal escaped string. For example, if a hostname contains an underscore character, such as foo_bar, then encode the underscore using the hexadecimal dictionary in ascii(7) as follows: foo\x5fbar.

When learning ASN regexes, sc_hoiho can take an optional parameter that specifies which ASNs belong to the same organization (are siblings). Each line specifies ASNs that belong to the same organization. For example, a file with the following contents:

64504 64505 64506

64507 64508

defines the ASNs operated by two organizations: one organization with ASes 64504, 64505, and 64506, and the other organization with ASes 64507 and 64508.

When learning geohint regexes, sc_hoiho requires parameters that specify a geohint dictionary, and RTT measurements. For the geohint dictionary, the format of the file is as follows:

iata code lat lng cc-st "city"

icao code lat lng cc-st "city"
clli code lat lng cc-st "city"
place "city" cc-st lat lng population
locode code lat lng cc-st "city"
facility "city" cc-st lat lng "street" "name"
country iso3166-2 iso3166-3 "name"
state cc-st "name"
vp code lat lng

For example:

iata IAD 38.9445 -77.455803 US-VA "Washington, DC"

icao KIAD 38.9445 -77.455803 US-VA "Washington, DC"
clli washdc 38.89511 -77.03637 US-DC "Washington, D.C."
place "Washington" US-DC 38.89511 -77.03637 601723
locode USIAD 38.94877 -77.4491 US-VA "Dulles Int Apt/Washington"
facility "Washington" US-DC 38.902918 -77.029149 "1275 K Street, NW" "CoreSite - Washington,DC (DC1)"
country us usa "United States"
state US-DC "Washington, D.C."
vp hlz2-nz 37.86 175.33

The VP geohints in the file are used by sc_hoiho to record where a given VP is located, and are used to identify the location associated with an RTT sample. VP geohints are not used when searching a hostname for location identifier. If the RTT file contains VP codes without a VP geohint, the VP code must match one of the other codes with a lat/lng.

For the RTT measurements, the format of each line of the file is as follows:

nodeid iata ms

For example:

N1 jfk 1

N1 iad 10
N1 yhu 20
N2 sjc 10
N2 san 12
N2 tij 16
N3 akl 2
N3 wlg 11
N3 syd 25

When learning AS name regexes, sc_hoiho can optionally be provided an AS name dictionary. If an AS name dictionary is not provided, then sc_hoiho will learn an AS name dictionary automatically. For an AS name dictionary that maps 64504 to foo, 64505 to bar, and 64506 to baz, the format of the file is as follows:

64504 foo

64505 bar
64506 baz

SEE ALSO

pcre(3), pcre2(3), sc_ally(1), sc_pinger(1), sc_radargun(1), sc_speedtrap(1),

M. Luckie, B. Huffaker, and k claffy, Learning to Extract Router Names from Hostnames, Proc. ACM Internet Measurement Conference (IMC) 2019.

M. Luckie, A. Marder, M. Fletcher, B. Huffaker, and k claffy, Learning to Extract and Use ASNs in Hostnames, Proc. ACM Internet Measurement Conference (IMC) 2020.

M. Luckie, B. Huffaker, A. Marder, Z. Bischof, M. Fletcher, and k claffy, Learning to Extract Geographic Information from Internet Router Hostnames, Proc. ACM Conference on emerging Networking EXperiments and Technologies (CoNEXT) 2021.

M. Luckie, A. Marder, B. Huffaker, and k claffy, Learning Regexes to Extract Network Names from Hostnames, Proc. Asian Internet Engineering Conference (AINTEC) 2021.

Mozilla Foundation, Public Suffix List, https://publicsuffix.org/list/.

Center for Applied Internet Data Analysis (CAIDA), Macroscopic Internet Topology Data Kit (ITDK), https://www.caida.org/data/internet-topology-data-kit/.

R. Govindan and H. Tangmunarunkit, Heuristics for Internet Map Discovery, Proc. IEEE INFOCOM 2000.

N. Spring, R. Mahajan, and D. Wetherall, Measuring ISP topologies with Rocketfuel, Proc. ACM SIGCOMM 2002.

A. Bender, R. Sherwood, and N. Spring, Fixing Ally's growing pains with velocity modeling, Proc. ACM/SIGCOMM Internet Measurement Conference 2008.

K. Keys, Y. Hyun, M. Luckie, and k claffy, Internet-Scale IPv4 Alias Resolution with MIDAR, IEEE/ACM Transactions on Networking 2013.

M. Luckie, R. Beverly, W. Brinkmeyer, and k claffy, Speedtrap: Internet-scale IPv6 Alias Resolution, Proc. ACM/SIGCOMM Internet Measurement Conference 2013.

A. Marder, M. Luckie, A. Dhamdhere, B. Huffaker, J. Smith, and k claffy, Pushing the Boundaries with bdrmapIT: Mapping Router Ownership at Internet Scale, Proc. ACM Internet Measurement Conference 2018.

AUTHORS

sc_hoiho was written by Matthew Luckie. Marianne Fletcher added support for inferring regexes that extract ASNs and geohints from hostnames.

May 8, 2023 Linux 5.14.21-150500.55.52-default