key: cord-0469967-k82otqeu
authors: Berkesewicz, Maciej; Cherniaiev, Herman; Pater, Robert
title: Estimating the number of entities with vacancies using administrative and online data
date: 2021-06-06
journal: nan
DOI: nan
sha: 159b166ec4320425a81d9dd91e4aedcaebe0bee7
doc_id: 469967
cord_uid: k82otqeu

In this article we describe a study aimed at estimating job vacancy statistics, in particular the number of entities with at least one vacancy. To achieve this goal, we propose an alternative approach to the methodology exploiting survey data, which is based solely on data from administrative registers and online sources and relies on dual system estimation (DSE). As these sources do not cover the whole reference population and the number of units appearing in all datasets is small, we have developed a DSE approach for negatively dependent sources based on a recent work by Chatterjee and Bhuyan (2020). To achieve the main goal we conducted a thorough data cleaning procedure in order to remove out-of-scope units, identify entities from the target population, and link them by identifiers to minimize linkage errors. We verified the effectiveness and sensitivity of the proposed estimator in simulation studies. From a practical point of view, our results show that the current vacancy survey in Poland underestimates the number of entities with at least one vacancy by about 10-15%. The main reasons for this discrepancy are non-sampling errors due to non-response and under-reporting, which is identified by comparing survey data with administrative data.

As these sources do not cover the whole reference population and the number of units appearing in all datasets is small, we have developed a DSE approach for negatively dependent sources based on a recent work by Chatterjee and Bhuyan (2020) . To achieve the main goal we conducted a thorough data cleaning procedure in order to remove out-of-scope units, identify entities from the target population, and link them by identifiers to minimize linkage errors. We verified the effectiveness and sensitivity of the proposed estimator in simulation studies.

From a practical point of view, our results show that the current vacancy survey in Poland underestimates the number of entities with at least one vacancy by about 10-15%. The main reasons for this discrepancy are non-sampling errors due to non-response and under-reporting, which is identified by comparing survey data with administrative data.

Keywords: administrative data; capture-recapture methods; online job vacancies; big data; non-probability samples 1 Introduction Statistics about job vacancies are important, yet notoriously hard to produce. From an economic point of view, job vacancies represent the portion of labor demand that has not been filled yet. After complementing the level of employment with vacancy statistics, it is possible to fully measure labor demand. Vacancy statistics have other useful functions as well. Vacancies can provide information about the skills gap experienced by companies or other structural characteristics, such as job contract types. Since vacancies represent an unmet demand, they can be treated as predictors of labor demand and can be used to construct a leading index for the labor market. At the same time vacancies can show various mismatches that occur in the labor market and explain trends in the structural unemployment. However, vacancies are not reported for tax purposes, even though employment is.

As a result, 'vacancies' are an intangible measure.

The main source of European official statistics about vacancies is the Job Vacancy Survey (JVS), conducted quarterly in all EU member states. However, job advertisements are commonly used as proxies of job vacancies. The earliest attempt to track job offers, called the Help-Wanted Index (HWI), was created in 1951 and restructured in 1987. For a long time it was based only on the classified pages of newspapers (Abraham and Wachter, 1987) . However, that dataset included only few variables. Only recently has there been a significant rise in the number of publications on how to extract more detailed information from online job offers, driven by the development of web-scraping and natural language processing methods, as well as the growing availability of open source software. Since 2005 the Conference Board publishes the Help Wanted OnLine, which is based on online job postings. Researchers who analyse job offers collect data either from one or from many websites. The advantage of the first approach is the more in-depth knowledge of a smaller amount of scrupulously classified data. The latter approach makes it possible to obtain a lot of information with potentially many variables, but with larger mis-classification error due to automatic procedures used for text classification.

In this study we focus on job vacancies in Poland. These data come from three distinct sources. The first one is the Demand for Labor (DL) survey, a complex random sample survey of businesses, conducted quarterly by Statistics Poland (in particular by the Statistical Office in Bydgoszcz). Its methodology is the same as that used in the pan-European JVS.

The second source is the Central Job Offers Database (CBOP), which contains all job offers registered by public employment offices (PEOs), regional employment offices (REOs) and

voluntary labor corps (VLCs) in Poland 1 . Inclusion in the database means that a job offer was officially submitted by a company to a PEO (or was acquired by a PEO from a REO or a VLC), and that a job offer was registered by completing a form with required details. The CBOP database does not contain all job offers in Poland, but those that are included have been classified and checked by a PEO worker. The third database contains job offers from an online job board Pracuj.pl (hereinafter Pracuj). Websites generally provide a richer (in terms of description) source of job postings than employment offices. However, online job offers appear on websites with various structures. They are not checked by clerks, as is the case with offers posted in PEOs, but instead are classified automatically using rule-based or machine learning algorithms. Pracuj is arguably the most recognizable job search websites in Poland (PBI/Gemius MegaPanel, 2021) . It specializes in advertising job offers and provides detailed information about vacancies obtained from employers. In order to post a job offer a company needs to pay a fee.

In this study we use the same definition of the target population as in the DL survey 1 In the end, we relied almost exclusively on data from PEOs, because the other sources contained foreign job offers. but we focus on economic entities with job vacancies (i.e. legal units and their local units together; see section 3 and Appendix A for exact definitions). In order to correctly identify units from the target population we established collaboration with the Statistical Office in Bydgoszcz (Poland), which is responsible for conducting the DL survey, to verify whether a given entity belongs to the target population. Our target quantity is the number of units with at least one vacancy.

Unfortunately, CBOP and Pracuj.pl do not fully cover the target population; neither are they random samples, which is why they cannot be directly used to estimate the target variable. To overcome this, we use dual system estymation (DSE, i.e. capture-recapture methods, CR), which make it possible to estimate the target quantity after integrating the two sources. CR methods are widely used in official statistics and are recommeded as a standard tool for assessing census quality (cf. United Nations, 2010) . CR methods have several assumptions, the most important being independence of data sources and absence of linkage and over-coverage errors (Wolf et al., 2019; Zhang, 2019) .

Our contribution can be summarized as follows. From the methodological point of view, we extend the model proposed by Chatterjee and Bhuyan (2019) and derive a point and variance estimator for negatively dependent sources. We verify performance and sensitivity analysis of the proposed method by conducting simulation studies. As regards the aspect of application, we contribute to the literature on job vacancy statistics by providing a novel method based solely on non-survey data sources by combining administrative data with online data. We focus on the quality aspects by providing rigorous data cleaning procedures and conducting clerical review of a selected sample of job offers to verify the linkage between companies and their identifiers. We believe that the proposed method can be applied in other countries where such data sources exist.

The article has the following structure. Section 2 describes international experiences regarding the production of job vacancy statistics and the use of job offers as vacancy proxies. Section 3 describes the data used in the study and the procedure for preparing and cleaning online data. Section 4 provides an overview of capture-recapture methods used in the study. Section 5 presents results from the study and compares them with existing results from the probability sample. Finally, section 6 summarizes the article and discusses further research steps. Additional details and results are presented in the Appendix.

Eurostat 2 defines a job vacancy as a paid post that is newly created, unoccupied, or about to become vacant under two conditions:

1. the employer is taking active steps and is prepared to take further steps to find a suitable candidate for a job from outside the enterprise concerned; and 2. the employer intends to fill the job position either immediately or within a specific period.

This definition is similar to others, but sometimes additional constraints are included.

For example Holt and David (1966) added a condition that a job vacancy should include employee requirements, and, above all, the wage rate. Jackman et al. (1989) reported the UK definition, which was very similar, but specified that the company should have taken a recruiting action within four weeks before posting a vacancy. The definition used by the Bureau of Labor Statistics 3 in the Job Openings and Labor Turnover Survey (JOLTS) 2 Job vacancy statistics; https://ec.europa.eu/eurostat/cache/metadata/en/jvs_esms.htm 3 Job Openings and Labor Turnover Survey; https://www.bls.gov/opub/hom/jlt/pdf/jlt.pdf contains information about possible types of contracts: full-time, part-time, short-term, permanent, or seasonal, but the requirement is that a job should start within 30 days.

The BLS definition of a job opening excludes internal transfers of employees (promotions, demotions) and recalls of workers from layoffs. Also excluded are positions that will be filled by employees from leasing companies, temporary help agencies, outside contractors, and consultants. The Australian Bureau of Statistics states that a vacancy must be available for immediate filling. The category excludes 4 :

• "jobs not available for immediate filling on the survey reference date;

• jobs for which no recruitment action has been taken;

• jobs which became vacant on the survey date and were filled on the same day;

• jobs of less than one day's duration;

• jobs only available to be filled by internal applicants within an organisation;

• jobs to be filled by employees returning from paid or unpaid leave or after industrial disputes;

• vacancies for work to be carried out by contractors; and

• jobs for which a person has been appointed but has not yet commenced duty."

Since the methodology of the Polish Labor demand survey is the same as that used in Job advertisements can include job positions that are not treated as employment according to labor law. In Europe, generally, the following job positions are not treated as employment contracts and are must be excluded from vacancy statistics:

• independent contractors (freelance contractors);

• business-to-business contracts (B2B);

• temporary employee staffing agreements (temporary staffing agreements, temporary services agreements);

• apprenticeships, traineeships;

• voluntary workers;

• contracts for specific work (which can informally be called 'work contracts', so possibly they might be mistaken for 'employment contracts');

• contracts of mandate (in Poland);

• contracts for the provision of independent services;

• association contracts (in Greece);

• representation contracts (in Greece).

An employer can take various 'active steps' to find an active candidate. Out of seven active steps listed by Eurostat, there are some that are connected to job offers. Thse include:

1. notifying public employment services about a job vacancy; and 2. advertising the vacancy in the media (for example online, in newspapers or magazines). 5

Other methods include using a human resources (HR) agency, approaching and recruiting a worker directly, using personal contacts or internships, and advertising the vacancy on a public notice board.

In this article, we use the same definition of a job vacancy as that adopted by Statistics

Poland and Eurostat, for the purpose of processing and cleaning data obtained from administrative and online sources. Despite some international differences between definitions of a job vacancy, the following conditions must be met for a workplace to be regarded as a job vacancy: i) the workplace is unoccupied, ii) the company is looking for an employee to fill it, and iii) the company is willing to fill the position as soon as they find a suitable candidate. These conditions correspond to the definition of an unemployed person, who does not have a job but is actively searching for one and is willing to start working as soon as they find a suitable job. Problems occur when one attempts to to empirically measure the number of vacancies. This issue will be considered in the next section.

Online data can supplement official statistics, but it is important to remember that they are non-probability samples. Beresewicz and Pater (2021) indicate and summarize various biases that can occur when online job offers are used as a measure of vacancies. They also point out certain advantages of using online job offers. So far, no article has been published that presents a thorough analysis of how to address all of these biases. The solutions presented in the literature do not propose a full statistical procedure for estimating the population of job vacancies based on job offers. Carnevale et al. (2014) estimate that in the US economy the share of online job offers in all vacancies in 2014 was between 60% and 70%, but they treat these figures as "back-ofthe-envelope" estimates. Acemoglu et al. (2020) , Deming and Noray (2020) , Forsythe et al. (2020) , Blair and Deming (2020) , and Modestino et al. (2020) use the Burning Glass Technologies (BGT) data, which are compiled using job offers collected from many US websites.

Jobs in their database accounted for 85% of jobs from the probability sample survey (Job Openings and Labor Turnover Survey, JOLTS) in 2016. However, there is little information about the procedure of data collecting, cleaning, and classifying. Representativeness of these data was analysed using an approach similar to that adopted in previous studies by Hershbein and Kahn (2018) and Deming and Kahn (2018) . The procedure involves crossvalidating the data (e.g. on skills) with other measures obtained from a probability-sample survey and then comparing the distributions of online job offers used, or their subsample, with the results of JOLTS and with the distributions of other measures of online job offers.

For example, Deming and Noray (2020) exclude vacancies where information about the employer is missing. Scrivner et al. (2020) exclude any unclassified job postings. Blair and Deming (2020) weight data by the size of the labor force and the share of employment by occupation and metropolitan statistical area. They use six-digit occupational codes in their study and include fixed effects for occupation, region, and firm. Using comparable data, Shen and Taska (2020) conduct a similar analysis for the Australian vacancy market. Marinescu and Wolthoff (2020) and Marinescu and Rathelot (2018) use a different approach and collect data about job offers from one US website (CareerBuilder). This website provided the authors with many variables as well as information for job seekers. The main disadvantage of this approach is the low coverage of the population of vacancies. When they compared their collected online job offers with the JOLTS survey, they concluded that CareerBuilder.com represents 35% of the total number of vacancies in the US economy.

The largest study of European job offers is conducted by Cedefop using data from all EU member states and a method proposed by Colombo et al. (2018) . It involves the use of automatic web crawling and scraping algorithms as well as data provided on the basis of agreements. These data show promising results that could supplement official statistics on vacancies, but the research is still in progress (Beresewicz and Pater, 2021 3 Data description

The Demand for Labor (DL) survey is carried out as a complex probability sample consisting of 100,000 units. The selection is made using stratified Poisson sampling, where the population is initially split into two groups: one containing entities with more than 9 employees (50,000), and the second, including companies with up to 9 employees (50,000).

In 2018 the sampling frame contained 844,280 entities, including 111,000 local units and 733,000 entities of the national economy (in total 734,000 enities with one or no local unit).

Regarding the entities with more than 9 employees, the objective of the survey is to obtain information about selected sectors of the economy (by NACE sections) in each province (NUTS2 level regions). As a result, this part of the population is divided into 304 separate subpopulations. The sample of about 50,000 entities is allocated between particular subpopulations in such a way as to obtain approximately the same level of precision of results for each subpopulation. Units in each subpopulation are sorted in a descending order according to the number of employees (according to information in the sampling frame). The largest units in each subpopulation are included in the survey without sampling. The target sample is allocated between subpopulations using the numerical optimisation method described in Lednicki and Wieczorkowski (2003) ; Kozak (2004) . In the case of units with up to 9 employees, the main objective of the survey is to obtain precise results for 19 NACE sections. Within these sections, units are stratified by province and selected using stratified, proportional sampling. Both types of entities report the For this study we obtained anonymized unit-level data for 2018 from Statistics Poland. Table 1 presents basic statistics regarding the data collection process and the estimated population size. The response rate was calculated as a share of companies that reported to the initial sample size (we excluded inactive or out-of-scope units, the number of which ranged from 3,000 in 2018Q1 to 5000 in 2018Q4). In each quarter, 63% of units on average responded, with the response rate varying considerably depending on unit size. There was no decline in the overall response rate but the over-coverage error rose from 3% to 5%.

The over-coverage error is the reason why the estimated population size differs from the initial figure of 884,000. Such discrepancies might have happened when a sampled economic unit ceased to exist soon before or during the survey. According to the survey, there were 789,000 and 745,000 units in the first and the last quarter of 2018 respectively.

As expected, most units were small (about 70%) and had the highest rate of non-response (about 65%). We use this estimated population size as a reference for our study. The structure of job offers from PEOs is different than that observed in the DL survey.

For one thing, there may be an over-representation of jobs from companies that have an incentive to advertise their vacancies through public employment offices, for example in the case of refunded internships or publicly-subsidised workplaces for the disabled. Public entities, in particular, are more willing to publish job offers in PEOs because they are often obliged to do so by their own internal regulations. Finally, low-paying jobs are more often sent to PEOs because people with lower qualifications often rely on public institutions to help them find a job. Better-paying jobs are more often advertised on job boards, in media or through private HR agencies, which charges fees for their services.

However, registered data also have valuable properties. They provide information about stocks and flows of job offers. The PEO register contains structured fields with detailed workplace description, including occupation, type of contract etc. Each submitted vacancy can be classified by a qualified PEO worker or by an employer. In the first case, one can be sure that industry and occupation is properly coded. In the second case, a PEO worker checks whether all information has been provided and whether it looks plausible (as required by the law, see a detailed description in Appendix B.2). If the employer has not included the NACE section or the ISCO occupational code, PEO staff are obligated to supplement this information based on the company tax identification number and job description 6 . If submitted information looks suspicious, e.g. if it contains many workplaces or the format of working hours seems to be wrong, the company is contacted directly with a request for clarification. This generally guarantees a higher quality of data compared to information obtained from commercial websites and automatic classification algorithms.

Job offers are also verified by PEO staff in terms of their expiration date, which reduces the over-coverage error. If the employer has indicated an expiration date on their offer, it is manually removed on that day. An employer who finds a suitable employee should immediately report this fact to the PEO so that the advert can be removed, together with the reason why the offer is no longer valid, which is also registered. While this requirement is not strictly obeyed, it is verified to some extent. On the basis of information provided by employers, PEO determine the frequency of contacts with particular employers. These contacts should not be less frequent than once a month. Usually, they happen once every two weeks. In the case of foreign job offers, the required frequency is every two weeks.

After this period, a PEO employee will contact the employer using the method and contact indicated earlier and asks if the workplace still has not been filled. Further description of the data is presented in Appendix B.2.

Since CBOP is a governmental service, we decided to use a commercial employment job board as an additional source of data about job offers. There are many websites which 6 Based on interviews with PEO staff we know that the majority of employers do not provide occupational codes advertise job offers, and their number cannot be accurately determined. Moreover, such websites differ from one another in terms of various features (e.g. paid vs. free websites, occupation-specialized etc.). These websites can be divided into the following groups:

1. country-wide online recruitment services, 2. industry-specific websites (limited to e.g. IT or financial occupations), 3. local job search websites (e.g. limited to jobs offered within a paricular local labor market defined by various (2-5) NUTS levels), 4. employers' websites, 5. Internet forums with job offers (e.g. Facebook groups), 6. job aggregators (e.g. Jooble).

A systematic collection of job offers is a non-trivial task. Unlike CBOP, online job boards usually do not have a public API, so it is necessary to use web scraping algorithms.

Another condition is that such websites contain an archive with job offers or else historical job offers cannot be accessed after they expire. The biggest difference between CBOP and Pracuj is that the latter, being a commercial website, charges a fee for posting a job offer. Like other nation-wide job boards, Pracuj requires registration, but unlike some of them, it provides open access to employers' profiles.

After opening a profile, users can track current and expired job offers.

Because collected job advertisements did not contain company tax identifiers (REGON or NIP numbers) we conducted a multi-step data cleaning procedure involving text mining techniques, matching entity names between CBOP and Pracuj as well as automated Google search queries to identify units that belong to our target population. A detailed description of all steps is given in Appendix B.3.1. In the next step, we focused on removing records with missing or erroneous data in employer fields. For instance, in CBOP there were 41 records without either REGON or NIP id, while in the case of Pracuj, the most frequent problem was missing company name (e.g. hidden recruitment) or the impossibility to assign an identifier based on the company name or Google search (see Appendix B.3.1 for details).

Next, we removed records containing missing or erroneous values in key variables, such as the number of vacancies (CBOP), dates (Pracuj) or location (region). The main difference between the two datasets were erroneous values in employer fields in Pracuj, and the missing number of reported vacancies in CBOP.

7 Note that each record represents a row in a data table, which contains one advertisement. Finally, we removed records that did not belong to the populations of job vacancies and entities. Note that we did not deduplicate data for both data sources as we were not interested in estimating the total number of vacancies. A comparable number of records was removed from both datasets owing to over-coverage. 10,000-15,000 records turned out not to be job offers (e.g. internships, non-job contracts, such as B2B contracts) and 10,000-15,000 records referred to employers that did not occur in the reference population defined by the DL survey. After completing the cleaning procedure our two datasets contained about 84,000 (CBOP) and 136,000 records (Pracuj). The CBOP database contained only 64 local units and Pracuj -only 47, which is significantly fewer than the number found in the sampling frame for the DL survey. That is why we decided to limit our analysis to entities of the national economy and not treat their local units separately. Table 3 contains a comparison between the three sources used in the study. Note that the DL survey contains a question about whether a given unit has reported a vacancy to a PEO. With this information we were able to assess the non-response or under-reporting error based on administrative data about all job vacancies registered by PEOs. Böhning et al., 2017) . The most popular ones include dual system estimation (DSE; two sources) or multiple system estimation (MSE; three or more sources), both of which are used for estimating hard-to-reach populations or register-based census statistics. These methods require access to unit-level data and are based on certain assumptions (the same probability of captures, no over-coverage, independent sources or perfect linkage), which might be difficult to meet in practice (cf. Zaslavsky and Wolfgang, 1993; Wolter, 1986; Zhang, 2019) .

To overcome these issues, a number of appropriate methods have been proposed in the literature. Stratification is used to account for between-group heterogeneity, under the assumption of within-group homogeneity (Cormack, 1989 ; Van der Heijden et al., 2012).

Zhang (2015) (2021) proposed estimators for dependent dual system estimators (DSE) and Gerritse et al. (2015) ;

Chatterjee and Mukherjee (2020) discussed scenarios for detecting dependence in DSE.

In this article we extend the method proposed by Chatterjee and Bhuyan (2020) assuming a negative correlation between the sources and truncation of data based on the number of days prior to the end of the quarter.

The starting point for DSE is an contingency table containing information from two sources, as shown in Table 4 Table 4 having first estimated the size of n 00 .

In order to estimate the total population size we can use the Lincoln-Petersen estimator,

given by the following equation: Yes (1) No (0) List 1 Yes (1) n 11 (p 11 ) n 10 (p 10 ) n 1. (p 1. )

In this article we will refer to this estimator as the naïve estimator, since it is based on assumptions that may not hold in real data applications.

In the setting with two dependent sources two scenarios can be considered: (1) a positive dependence, where units are more likely to be observed in the second source/time; and (2) a negative dependence, where units are less likely to be observed in the second source/time.

In the literature, these scenarios are often referred to as behavioral response effects. The first scenario is observed in post-enumeration surveys used for assessing census undercount (see Bell (1993) for USA and Chatterjee and Mukherjee (2016) for India), while the second can be observed in situations where both sources are exclusive (e.g. children injury data collected by hospitals and police stations) or re-identification is associated with social stigma (e.g. the population of drug users, patients infected with HIV). In such situations, the Lincoln-Petersen estimator, given by (1), is biased and Chatterjee and Mukherjee (2021) showed that the approximate bias is given by

where φ > 0 denotes a behavioural response effect, p = p 01 /p 0. = p 01 /(1 − p 1. ) is Pr(an individual is captured in List 2 | not captured in List 1), p 1. is probability of being captured in List 1 and N is the population size.

To verify whether two sources are dependent, Chatterjee and Mukherjee (2020) suggest calculating c = p 11 /p 1. , which is Pr(an individual is captured in List 2 | captured in List 1) and comparing it with p to verify whether there is φ that c = φp. The negative dependence would be represented by smaller values of c (closer to 0), while the positive dependenceby larger values of c (closer to 1). In the article, we calculate c to indicate the direction of the relationship. Below we focus on the method proposed by Chatterjee and Bhuyan (2020) to estimate the population size under dependence.

Let us consider population U of size N and let p j1. , p j.1 denote the capture probabilities of the j-th individual in the first (Y ) and in the second (X) list. Let α be a proportion of individuals for whom there is a behavioural dependence between List 1 and List 2.

Let Y denote inclusion in List 1, and Z inclusion in List 2. To capture the dependency structure we define a pair (X * 1j , X * 2j ), which represents the latent capture status of the j-th individual (j = 1, ..., N ) during the first and second attempt. The latent capture status X lj takes values {0, 1}, which denote the absence or presence of the j-th individual in the l-th list (l = 1, 2).

In this approach we assume that α is the proportion of individuals for whom there is a behavioural dependence i.e. the value X * 2j is the same as of X * 1j (i.e. X * 2j = X * 1j ). Now, note that (Y j , Z j ) is the manifestation of the latent structures (X * 1j , X * 2j ) for the j-th individual. Therefore, the positive dependence between sources can be formulated as follows

where, X * 1j and X * 2j are independently and identically distributed Bernoulli random variables with parameters p 1 and p 2 . For the negative dependence between sources the relationship can be represented as follows

Now, we can denote Pr(Y = y, Z = z) under model (4) for the contingency table in Table 4 p 11 = (1 − α)p 1 p 2 ,

where these probabilities are calculated using the law of total probability, that is

For example, p 11 is calculated as follows

because in the case of negative dependence Pr(X * 1j , 1 − X * 1j ) is 0, as α denotes the probability of shared units that are in one but not in other list. The corresponding marginal probabilities are given by

The parameters of the model (4) can be practically interpreted, i.e α represents the share of behaviourally dependent individuals in the population and p l is the capture probability of a causally independent individual in the l-th list.

However, the parameters of this model are not identifiable as their number exceeds the observed counts. To overcome this problem, we assume that population U can be stratified into two, mutually exclusive and exhaustive sub-populations, denoted by U A and U B .

In addition to the standard assumptions of capture-recapture, i.e. 1) the population is closed and 2) the probability of capture in each of the two attempts is the same, we specify the following assumptions that underlie model 2 proposed by Chatterjee and Bhuyan (2020): 1. initial (List 1) probabilities of capturing individuals belonging to both sub-populations are the same (i.e. p 1.

2. probability p 2 differs between populations i.e. there are two parameters p 2A and p 2A . We can establish a relationship between p 2A and p 2B based on the method of moments, starting with the following set of equations based on (4)

and keeping in mind that

Under these assumptions, we need to estimate 6 parameters: θ = (N A , N B , α, p 1 , p 2a , p 2b ) but this number can be reduced as N A = (x 1.A /x 1.B )N B and because of the relation between p 2a and p 2b .

In the next section we discuss how to estimate these parameters using the maximum likelihood method. For the model given by (4) we decided to use a constrained MLE that maximizes the following likelihood function (11)

under the following constraints

where θ = (N A , N B , α, p 1 , p 2a , p 2b ). In the constrained log-likelihood we defined the lower and upper bounds of population size for N A and N B . The lower bounds are equal to the observed counts, denoted by x 0A and x 0B respecitvely.

In the actual computation we minimize the negative log-likelihood function, which requires the calculation of log(N A !) etc. To overcome this, we used the approximation

For more details, see Appendix C.

Because (11) is sensitive to the starting points, we decided to use the non-linear Ipopt optimizer (Wächter and Biegler, 2006) implemented in the JuMP.jl module (Dunning et al., 2017) available in the Julia language (Bezanson et al., 2017) .

In the article, we calculated standard errors using two techniques. First, we derived a Hessian for log-likelihood (11) 1. draw, independently for each stratum s = A, B, a sample of (x * 11s , x * 10s , x * 01s , x * 00s ) of total sizeN s with probabilities calculated as (p 11s = x 11s /N s , p 10s = x 10s /N s , p 01s = x 01s /N s , p 00s =x 00s /N s ), 2. deriveN s from the model given by (11) 

For the total N we derived variance in the same way but instead of N

We also considered a 95% confidence interval suggested by Chatterjee and Bhuyan (2020, p. 11) x 0s + N s − x 0s /C, x 0s + N s − x 0s C,

where s denotes a stratum, C = exp 1.96 log(1 + σ 2 Ns /(N s − x 0s ) 2 ) , and σ 2 Ns is the estimated variance ofN s calculated using either the Hessian or the bootstrap approach.

We conducted simulation studies to compare the naive approach given by (1) with the proposed estimator obtained from the likelihood function given by (11). In the first study, we focused on the properties of estimators by evaluating the bias and coverage of the confidence intervals under the model assumptions. In the second one, we verified the bias when the main identification assumption, i.e. p 1A = p 1B = p 1 , is not met. Thses simulations show that the estimator is unbiased if p 1A is close to p 1B but as the difference increases, so does the bias. Given the constraints, the bias of the proposed method is lower than that of the naive approach. For details, please consult Appendix E. Table 5 contains descriptive statistics about the DL sampling frame (denoted as 'Frame'), the DL survey results averaged over the whole period (denoted as 'Survey'), and unique entities identified in CBOP and Pracuj for 2018. We present information about the number of entities (in thousands) and proportions by sector of ownership, size and selected NACE sections.

As expected, CBOP covers more public sector entities (almost 12%) in comparison with commercial portals, where this share is about 4%. Both sources differ with respect to the survey, but CBOP seems to be closer. Small ( As far as entity size is concerned, the majority of companies in the DL population and the survey are small (70% and 58% respectively), while the corresponding proportions in CBOP and Pracuj are considerably smaller: 45% and 24% respectively. This means that the two sources underrepresent small companies, which may have limited budgets for online activities or be less willing to search for employees via administrative sources or online services. The differences between CBOP and Pracuj suggest that both sources cover different sub-populations, which means that the information they provide is likely to be complementary. On the othjer hand, both sources contain a similar share of medium-sized companies and, compared with the survey results, overrepresent large entities.

Finally, we compared these sources across selected NACE sections. The main difference between them exists in the case of Construction (section F) and Professional, scientific, and technical activities (section M). This discrepancy is mainly due to the commercial character of Pracuj, which targets highly-skilled professionals. The distribution of companies advertising in CBOP is consistent with survey results, with small differences that may not be related to sampling and non-response error.

Based on this analysis, it seems that the differences between the three sources are mainly associated with company size. In our estimation, we will use company size as a post-stratification variable for the negative dependence model.

To verify over-coverage due to out-dated advertisements, we assessed the number of days between the publication date and the end of a given quarter. This result is reported in Table 6 . We counted the number of ads in CBOP and Pracuj according to the number of days where zero means that the publication date is exactly the same as the end of the quarter, up to 10 means that the ad was placed 10 days before the end of the quarter, and so on.

We found that most ads posted on Pracuj had been placed 30 days before the quarter In order to harmonize the periods and minimize the over-coverage error, we decided to disregard ads posted over 30 days earlier and then we counted the number of unique entities.

In the next step, we verified if false-negative and false-positive linkages exist by conducting a clerical review of linked units. We verified identifiers with different entity names and unique names with multiple ids but we did not find any units that had been omitted or wrongly linked. To clarify, it is possible that the sampling frame contains entities with the same name but with different ids if they operate in different regions. In addition, within each quarter we selected a sample of 50 entities from among the matches and of 150 units from the online source. Having reviewed this sample, we did not find any false-negative or The resulting number of unique entities and estimated values ofĉ = p 11 /p 1. are presented in Table 7 . First, the estimated value ofĉ is very small for both categories of companies,

in particular for small & medium-sized ones. This clearly indicates a negative dependence, which means that companies choosing to look for employees through PEOs are not willing to use Pracuj, which may be due to the fact that such services cost money and, also, their target group is different. Estimation based on the naïve Lincoln-Petersen estimator (1) will be biased because of the small number of units observed in both sources.

From a practical point of view and considering the possibility of using these sources in order to produce official statistics, the relationship within the groups is constant over time.

The estimated value ofĉ is close to 17% and 2% for the whole of 2018. This suggests that the estimated population sizes should be similar. We verified other stratification variables and concluded that the proposed grouping (large vs. other) yields the most reliable estimates in comparison with the DL survey and ensures stability over time. For more results see Appendix D. Tables 1 and 3) .

Secondly, estimates obtained from the proposed model indicate that around 10-20% 

In previous approaches that relied on data from online job boards the goal was to extract information about job vacancies from online sources in order to supplement data collected from probability-based surveys. This additional information can include required skills or more detailed elements of job description. Those approaches have compared distributions of online data with that observed in survey data to ensure representativeness of the former.

Our aim was to provide estimates of the number of economic entities with at least one job vacancy without relying on the business sample survey. At the same time, various studies take advantage of online repositories and web-scraping to show skills demand (see e.g. Colombo et al. (2018) ; Deming and Kahn (2018)). In this article we describe a procedure that can be applied with a view to using such data as a source of statistics about job vacancies. The precedure consists of data collection and preparation, as well as the development of an estimator of the population of economic entities with job vacancies.

We showed that job vacancy statistics can benefit from online and administrative sources, which not only provide additional variables, but also serve as the basis for estimating the total number of vacancies.

The advantage of using administrative and online data is the rich store of information provided by these sources and the lower cost of data collection in comparison with surveys.

The biggest challenge encountered at the initial stages of analysis is associated with various data structures that need to be processed. During the data preparation stage it is crucial to correctly identify job titles and company names. This proves difficult in practice because of differences between administrative and online sources. The former ones rely on institutional standards (e.g. job offer codes, official classifications), while the latter are more marketoriented (e.g. are more likely to reflect emerging market trends in skill or occupational terminology). This is why, we put much effort in data preparation and linkage.

We linked data from administrative records with online data and used dual system estimation in order to estimate the number of economic entities with job vacancies in Poland. We proposed a new capture-recapture estimator, built on the approach proposed by Chatterjee and Bhuyan (2020) , which accounts for negatively correlated sources. This approach enabled us to provide estimates of the number of entities with job vacancies based solely on non-probability sources. Additionaly, with our approach, we identified the level of bias due to non-response and under-reporting errors in the DL survey.

Our results suggest that the DL survey underestimates the number of economic entities with at least one job vacancy by 10-20%. The results differ significantly depending on the size of economic entity. The number of medium-sized and large companies with va-cancies is underestimated in the survey, while the number of small units is overestimated.

Our simulation studies show that the proposed model provides unbiased estimates for all parameters.

Since the methodology of the job vacancy survey in the EU is similar across member states, our approach can be applied in other countries. Moreover, online job boards in different countries rely on a similar technology, and at least some countries have detailed administrative data sources on job vacancies (see e.g. Bhuller et al. (2019) for Norway).

In our study we also carefully checked the collected job offers in terms of the definition 

Definitions used in the Demand for Labor survey in Poland

• Reporting unit (abbreviated form: unit) -an entity of the national economy or its local unit, from which data are collected.

• Local unit -An organized entity (an enterprise, a division, a branch, etc.) located in the place identified by a separate address, at which or from which the activity is managed by at least one working person. In the process of identifying local units the folllowing assumptions are adopted in the National Official Business Register There are large differences between countries regarding dates, because, as Eurostat points out, there is no international standard for recording job vacancies. The most common way is to provide the number of job vacancies at the last day of a quarter (12 countries).

Some countries report data for the middle of the quarter, as an average from three months, or provide information about the flow of vacancies throughout the quarter. Statistics

Poland includes the number of vacancies at the end of a quarter (stock of vacancies), the number of newly created jobs and the number of eliminated jobs in the quarter (flow of vacancies, but without the ISCO breakdown) and the number of newly created jobs in a given quarter at its end (stock of vacancies). Both the stock and flow of vacancies include valuable information. The stock does not include vacancies that appeared but were filled within a given quarter, and the flow does not include vacancies from previous periods. inhabitants revealed that about 95% job offers are submitted either personally or by phone.

CBOP contains domestic job offers registered in PEOs, as well as foreign job offers from the EURES website maintained by the European Commission, collected by REOs, and job offers for younger persons, obtained from the VLC. We are only interested in job offers that refer to workplaces available in Poland. The majority of such offers come from

PEOs. According to the ordinance that regulates the functioning of PEOs, every domestic job offer reported to a PEO should contain the following information:

1. employer data: name, address, telephone number, tax identification number, location, information whether the employer is a temporary work agency;

2. terms of employment: location and name of the workplace, the number of vacancies, a general description of responsibilities, type of contract (15 types according to Polish

Labor law), type and level of remuneration type, start date, weekly working hours, information whether the job is temporary;

3. job requirements: type and level of education, skills, qualifications, required languages with levels, work experience, whether foreign job seekers can apply for the position;

4. additional information: date of job offer registration, expiration date, frequency of contacts with the employer or an employee responsible for contacts regarding the job offer, the number of workplaces to be filled, is it a foreign job offer, an internship or an offer for a disabled person.

A company can also provide information including:

1. employer data: NACE section representing the company's main type of activity, Information about a job offer is directed to an internal PEO system called Syriusz.

This prevents the acceptance of data that do not meet certain requirements, for example offers with inappropriate PCOS codes, which should contain six digits. Nonetheless, it is possible that some detailed or sensitive data are entered incorrectly or happen to be outliers. These are usually wages and the working hours. This information may be entered in a wrong format (fraction of weekly hours instead of the number of hours) or be simply untrue (wages). Another important detail is the date when the vacancy opens and closes.

While the former is not usually problematic, the latter often is.

The CBOP database can be used by registered entities. It can be downloaded on a daily basis by means of a public API provided that an entity is granted access by the Polish Ministry of Development, Labor and Technology 9 .

During a basic exploratory analysis of raw CBOP data at the end of the quarter(s), we found that employers whose job offers were published in 2018 were evenly distributed in terms of legal type (natural vs. legal person). According to Polish legislation, a sole proprietorship is a type of enterprise owned and run by one natural person, but it can also employ other people. In our database the most frequent types of companies with legal personality were private and public limited companies. The least numerous category included public administration units, such as public kindergartens, schools and universities, municipalities, courts, research and development institutions. The least numerous group of entities with legal personality that are not part of public administration included cooperatives, foundations, associations, private schools etc.

The biggest advantage of CBOP (compared with other sources of online job offers) is that PEO staff usually manually classify job titles according to the ISCO-08 occupational classification at the lowest level of specific occupations. The distribution of job offers from CBOP across ISCO codes (aggregated into major 1-digit groups) differed from that observed in the DL survey. Using a structural index (see V 1 index in Jackman and Roper (1987) ) we estimated the mismatch index as being equal to 0.22, which means that 78% of both distributions were similar to each other. Comparing differences between distributions across major occupational groups we found higher shares of offers in the following major occupational groups in CBOP: "Service and sales workers", "Elementary occupations", "Technicians and associate professionals" and "Clerical support workers". In the DL survey there were relatively offers more in the following occupational groups: "Craft and related trades workers" "Professionals", "Plant and machine operators and assemblers" and "Managers".

In PEOs employers more often sought low-qualified workers than professionals with advanced skills. Taking this into account, it is not sufficient to make any statistical inferences about the population of job vacancies based only on job offers obtained from CBOP.

Another distinguishing feature of job offers from CBOP was the fact that they were usually poorly described, especially as regards job responsibilities and requirements. Sometimes employers did not even provide information about required qualifications and skills.

Additionally, job descriptions were usually unstructured.

To supplement data about vacancies we considered country-wide online recruitment services with job offers for professionals. We avoided local and employers' websites, because to capture job offers for the whole country we would need to scrape data from a large number of websites. For the same reason we excluded Internet forums. Data from occupationspecific websites (e.g. nofluffjobs.com for IT occupations) would not be representative for a given industry. Job aggregators contain job offers from various websites but of varying quality, which it would be difficult to evaluate.

Since our focus was on country-wide online services, the question was which websites to use. According to information obtained from artefakt.pl, about 97.3% of Internet users in Poland use the Google search engine. In addition to PBI/Gemius MegaPanel (2021), we used Google Trends to find the most popular Internet websites with job offers. We searched for "praca" (the Polish word for a "job") to examine Google results. Excluding non-country-wide websites, the moast frequently used website turned out to be https: //www.pracuj.pl/ (Pracuj).

Pracuj is owned by a digital recruitment agency "The Pracuj Group". It operates in Poland and Ukraine. According to the company's own data, in March 2018 their website was visited by 3.1 million users.

Pracuj.pl is a job board that offers services for a fee. The cost of running a job adver- Table 2 (first row).

The next step was to extract plain text from the offers. Text content was usually located in different places on the webpage, so in order to extract it we needed to identify these places. This could be done either by using the Chrome extension SelectorGadget or extracted directly from the HTML code. The disadvantage of Pracuj is that the same information for different job offers can be located in different places. Basic information about a vacancy, such as job title, company name, location and dates, is usually located in the same place, but the job description is frequently included within different "div" tags.

The web scraping algorithm needed to be prepared for a certain website type.

After we identified key elements/tags of a HTML page, we needed to extract crucial information about job vacancies from these tags. We parsed texts of all job offers to convert into a "data table" format. The crucial features included job title, employer (company) name, location (country, NUTS regions, city), publication and expiration dates, contract Because Polish Statistics provides quarterly estimates of labor demand, we calculated the number of job offers at the end of quarters to make the results comparable. We removed all job offers which did not meet the following conditions:

1. Publication date should be the last day of the quarter or earlier, 2. Expiration date should be the last day of the quarter or later.

The number of job offers after removing those that did not meet the required conditions is presented in Table 2 (second row). This procedure helped us to significantly reduce the amount of data that had to be analyzed.

As shown in Figure 1 , the next step involved checking whether collected job offers had been extracted properly. For example, all job offers needed to have publication and expiration dates. We made sure that jobs were located in Poland, that each location was appropriately described, etc. If at least one value in a given offer was missing, the offer was removed. In contrast to CBOP, in Pracuj we identified a small number of job offers with wrong dates. Job offers with missing or inappropriate values (e.g. cyrillic, bad encoding) in the company name and the job title were also removed, as were offers placed by foreign employers (even those located in Poland) without a fiscal identifier (NIP or REGON) . In Pracuj there were also job offers with hidden company names (i.e. hidden recruitment/anonymous advertisement). Since this made it impossible to identify the employer such offers were also removed. The number of job offers which were removed is presented in Table 2 (third row).

After removing incomplete job offers, we removed all those that could not be classified as employment offers. We also removed all offers of jobs outside Poland, regardless of whether the employer was domestic or foreign. Next, we removed offers of traineeship, internship and voluntary work.

The biggest problem we faced at this stage was related to contract type. For some reason, job offers which had expired (even a few months earlier) did not contain any information about contract type, unlike active ones (not expired), which contained such information. To identify contract type we analysed job descriptions. Using regular expressions we checked whether job descriptions contained types of contracts described in Section 3. If a job offer contained information about an employment and non-employment contract, it was classified as meeting the criteria of a vacancy. It is worth noting that employers frequently offer different types of contracts for employees to choose from. Finally, we removed job offers containing contracts for a specific work, contracts of mandate and B2B contracts, if there was also no mention of employment contract in the job description.

The number of non-job offers removed at this stage is shown in Table 2 (fifth row).

After removing non-job offers the final set of vacancies was ready for processing. Since the job offers had been composed by different people, they contained natural language, which is characterized by much redundancy. To make job offers more comparable across the databases (CBOP and Pracuj), the following steps were applied to all character strings within both CBOP and Pracuj databases: First of all, we converted character vectors (regions, company names, job titles etc.)

into the lowercase format. Since the R programming language is case sensitive, we needed to eliminate apparent differences between strings: for example, given two versions of one company name "NESTLE" and "Nestle", we converted them both to "nestle". In CBOP, among 10 most frequently occurring company names we identified 3 cases of the same company with different versions of the name resulting from the use of special symbols and different letter case. The third step of processing consisted in removing legal forms from company names.

We found that within the same source of data (e.g. CBOP) some company names contain different versions of legal form (e.g. public limited company or PLC). Since Pracuj requires registration, job offers were published from a verified account, so most of them did not contain such differences.

Since individual job features were extracted from the HTML code using regular expressions, some HTML tags (e.g. <h1 >... job title <h1 >) and their elements were also captured. All such tags were removed, as were all special symbols and unusual spaces (e.g.

double and triple spaces, white spaces etc). We also removed all numeric values from job titles, which usually (particularly in CBOP) contain information about salary/wage, number of working hours and the offer number. While some numbers in job titles may have been significant, the number of cases in which they were used inappropriately was much higher.

To reduce the size of vocabulary (the number of unique words) in the analysis of job titles and to eliminate differences between words due to declension, lemmatization was performed, which involves reducing inflectional forms of a word to one single form. We also removed some standard and custom stop words to make job titles across the databases more comparable. Job titles in CBOP usually contained a lot of unnecessary information about location, required skills, salary/wage, number of working hours, specific job offer number etc. Job titles in the Pracuj database were more laconic and accurate. We investigated the most and the least frequently occurring words in job titles and identified 300 useless words to be removed.

The last step of data processing consisted in replacing Polish letters with diacritics present in all character features (except for company names) with their Latin variants.

After this data preparation we made an attempt to compare job offers in the two data sources. To our surprise, even after data processing there was no guarantee that all job offers would me matched. We found that some offers of potentially the same jobs had different company names. Some examples of such job offers were presented in Table 9 . for each quarter of 2018. In this control sample we identified 9 job offers with wrongly assigned NIP numbers.

After we identified NIP numbers of all employers in Pracuj, we used them to remove job offers published by employers that were not included in the sampling frame of the DL survey. Final counts of job offers and employers taken into account in the estimation of the number of job vacancies is presented in Table 3 .

In the previous subsection we described the procedure of data preparation. At the stage of selecting data sources we only chose one commercial source. At the beginning of our study we considered various data sources, for example, OLX, an online marketplace with a section dovoted to recruitment. Unlike Pracuj, OLX does not specialise in employment classified ads. The cost of posting a job offer at OLX is much lower than at Pracuj, and basic service is free. This is why we expected to capture a variety of employers which cannot afford to or just do not want to place ads on Pracuj. The biggest advantage of OLX was that we already had job offers obtained directly from OLX. While Pracuj is a website for employers seeking highly-skilled candidates, OLX is mainly used for recruiting low-qualified persons. In this sense it is similar to CBOP but has a much lower share of offers placed by governmental organisations. During the stage of data preparation we identified 9.3 times more job advertisements on OLX than on Pracuj. In the end, we decided not to use OLX for the reasons we present below. While this website may be a rich source of data, they did not meet all our criteria for classifying job listings as job vacancies.

Unlike CBOP and Pracuj, job offers on OLX did not usually contain company names.

Instead, names of people responsible for handling inquiries from potential candidates were given (see examples in Table 10 ). Unfortunately, these names could not be used to identify specific employers since in the case of "Piotr" we identified 94,767 significantly different job offers. We could analyse job offers on the OLX website but could not estimate the number of employers that had posted vacancies or their characteristics. Only for some job offers could company names be extracted from the job description. The second problem we encountered with OLX job offers was the lack of information about job location. Since we were interested in job offers located in Poland, workplaces outside Poland had to be removed. We analysed some job offers where the indicated location was in Poland but found that in fact they described foreign jobs. Table 11 contains some examples of foreign job offers with a Polish location. Log-likelihood function is given by

(C.14)

For log(x!) we use the following approximation: x log(x) − x and the log-likelihood becomes:

where θ = (N A , N B , α, p 1 , p 2A , p 2B ). First derivatives of (C.15) are given by

Second derivatives of (C.15) are given by 

(C.54) ∂ 2 log L ∂p 2B ∂p 1 = 0, (C.55)

(C.57) Table 13 presents a sensitivity analysis with respect to the selection of stratification variables. Because stratification variables could only have two levels, in the case of variables with more levels, the other levels had to be aggregated (e.g. Size == "Large" and Size != "Large"). For some NACE sections, public sector of ownership and Mazowieckie province, the number of entities is high, even over 90-100k, which is significantly higher than that reported by the DL survey. On the other hand, for size == "Medium" and NACE == "G", the number is significantly underestimated. That is why, we decided to use size == "Large". In the first simulation, we verified the performance of the naive and the proposed estimator using data that resemble the target population. The simulation was conducted as follows:

1. we generate probabilities according to model (4) given by (5) with the following:

• N A = 50, 000, α = 0.05, p 1 = 0.15 and p 2a = 0.05,

• N A = 20, 000, α = 0.05, p 1 = 0.15 and p 2a = 0.15, 2. in each of 500 iterations:

• we independently generate counts (x 11 , x 10 , x 01 , x 00 ) for each sub-population A and B from the multinomial distribution of size defined by N A and N B respectively,

• we use vector (x 11A , x 10A , x 01A , x 11B , x 10B , x 01B ) to obtain two estimators -naïve and the proposed one -for N A and N B .

3. finally, we calculate the expected value, relative bias and the coefficient of variation.

Results of this simulation study are presented in Table 14 . As expected, when the observed counts are generated according to the assumption of negative dependence, the proposed model provides unbiased estimates of all parameters. The naive estimator is significantly biased, with relative bias increasing with the growing population size. Table 15 shows the degree of coverage provided by the confidence intervals based on estimated standard errors and the suggested interval given by (13). As can be seen, the confidence intervals have similar lower and upper bounds, and that their coverage, based on 10000 replications, is within the nominal 95%.

In the second simulation study we checked how the proposed method performs when the main assumption i.e. p 1a = p 1b = p 1 is violated and analysed its impact onbias and the root mean square error (RMSE) ofN A ,N B andN =N A +N B . We assumed the same population sizes and α as in Simulation 1 and we considered three scenarios:

• p 2a = p 2b = 0.15.

and p 1B . If the value of p 1A , p 1B is closer to the nominal value of 0.15, bias disappears.

On the other hand, when p 1A = p 1B = 0.15, bias is present even if p 1A = p 1B (this results holds also for other values of p 1A = p 1B )) as presented in Figure 4 . Thus, the estimator is not robust when the main assumption is not met, but as we use constrained MLE, bias is lower than the one of the naive estimator. 

Help-wanted advertising, job vacancies, and unemployment

Ai and jobs: Evidence from online vacancies

Using information from demographic analysis in post-enumeration survey estimation

Inferring job vacancies from online job advertisements, Publications Office of the European Union

Julia: A fresh approach to numerical computing

How broadband internet affects labor market matching

Structural increases in demand for skill after the great recession

Capture-recapture methods for the social and medical sciences

The identity of the zero-truncated, oneinflated likelihood and the zero-one-truncated likelihood for general count densities with an application to drink-driving in britain

Understanding online job ads data

On the estimation of population size from a dependent triple-record system

On the estimation of population size from a poststratified two-sample capture -recapture data under dependence

An improved estimator of omission rate for census count: with particular reference to india

A new integrated likelihood for estimating population size in dependent dual-record system

Identifying the Direction of Behavioral Dependence in Two-Sample Capture-Recapture Study

On the estimation of population size under dependent dual-record system: an adjusted profile-likelihood approach

Applying machine learning tools on web vacancies for labour market and skill analysis

Log-linear models for capture-recapture

Earnings dynamics, changing job skills, and stem careers

Skill requirements across firms and labor markets: Evidence from job postings for professionals

Coverage evaluation on probabilistically linked data

Population size estimation and linkage errors: the multiple lists case

Dual system estimation of census undercount in the presence of matching error

Jump: A modeling language for mathematical optimization

Labor demand in the time of covid-19: Evidence from vacancy postings and ui claims

Sensitivity of population size estimation for violating parametric assumptions in log-linear models

Job search during the covid-19 crisis

Do recessions accelerate routine-biased technological change? evidence from vacancy postings

The concept of job vacancies in a dynamic theory of the labor market, in 'The measurement and interpretation of job vacancies

On vacancies

Structural unemployment

Optimal stratification using random search method in agricultural surveys'

Optimal stratification and sample allocation between subpopulations and strata

Analysis of complex survey samples

Mismatch unemployment and the geography of job search

Opening the black box of the matching function: The power of words

Upskilling: Do employers demand greater skill when workers are plentiful?

United Nations. Department of Economic and Social Affairs

People born in the middle east but residing in the netherlands: Invariant population size estimates and the role of active and passive covariates

On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming

Connecting Correction Methods for Linkage Error in Capture-Recapture

Some coverage error models for census data

Triple-system modeling of census, postenumeration survey, and administrative-list data

On Modelling Register Coverage Errors

A Note on Dual System Population Size Estimator

Trimmed dual system estimation, in 'Capture-Recapture Methods for the Social and Medical Sciences