Paper Writing 2/Writing

[on-going] Project Guideline & My Proposal

밤 편지 2025. 4. 23. 07:09

Expected Time Line


Requirement

1. 20-minutes presentation

2. Report (5 pages excluding tables, figures, and references).

Formats: Applied Data Analysis

1. Formulate a causal research question using a publicly available dataset.

2. Choose and justify an appropriate method.

3. Analyze and interpret results.

4. Implement and discuss robustness check.

5. Relate findings to existing literature and articulate limitations.


proposal 

1. Background & Motivation

1) causal research question 

As the phenomenon of low birth rate intensifies, it has emerged as a major issue to the extent that even foreign media are reporting on Korea's potential disappearance. However, there is a lack of systematic efforts at the national level to establish policy alternatives to address this issue. To solve the problem of low birth rates, it is necessary to formulate and continuously implement policies based on a thorough analysis of how effective these policies are in increasing the birth rate. In this context, an analysis is needed to determine how much the recently implemented childbirth subsidy policies have actually contributed to an increase in the birth rate.

 

In this paper, we attempt to answer this causal question. Specifically, in order to see childbirth subsidy can boost the birth rate, we pay attention to Incheon City, which implemented an unconventional support policy - providing 100 million won per newborn.  We assume that Incheon City is a teatment group, and estimate the causal effect by comparing the birthrate between before and after intervention, and other cities. 

 

We can compare the birth rate before and after implementation of policy, and if there is an increase, it might be some evidence that the policy is effective. However, more careful approach is needed to prove that this increase is not a natural trend, but a policy effect. In other words, we need to estimate the counterfactual of what would have happened if Incheon city didn't implement its childbirth subsidy policy. In order to do this estimation, we leverages several powerful causal inference techniques - Difference-in-differences, synthetic control and recently developed method based on previous ones. Finally, we discuss the results from these methods. 

 

★ causal effect를 측정하려는 target population & treatment의 범위는 아직 명확하게 정하지 못함. data 보고 정할 거임. 도저히 각 안나오면 project 주제를 바꿀 수도 있음! 

2) related work (previous work)

* policy evaluation

* birth rate effect estimation 

(DiD, SC 등을 이용해서 policy evaluation한 논문 / 출산지원금 정책 효과에 대해서 분석한 선행 연구가 있는지 찾아봐야 함)


2. Methods

1) Difference-in-Differences (DiD)

When we have a period before and after the intervention and wish to untangle the impact of the intervention from a general trend, one technique to answer these type of questions is Difference-in-Differences(ref). Diff-in-diff is commonly used to estimate the effect of policy intervention, like the impact of a new healthcare law (ref), the effects of paid family leave (ref).

 

The idea is that we could use another city as a control group to estimate the counterfactual when compared to Incheon City. Specifically, DiD estimator is an imputation strategy of what would have been the birth rate of Incheon City had the policy not been implemented. The counterfactual outcome for Incheon City after the intervention can be imputted as the birth rate of Incheon City before the intervention plus growth factor. This growth factor is estimated in a control city. 

 

It is important to note that this asssumes that the trends in the treatment and control are the same. If the birth rate trend in Incheon city is different from the trend of the control, diff-in-diff will be biased. (ref: violation of parallel trend -> biased)

 

We check if this assumption is plausible or not by plotting the birth rate trend using past periods. Showing periods from before would reveal those trends and we would know Diff-in-Diff is not a reliable estimator for our case. (attach plot)

 

Another issues is that we are not able to place confidence intervals around Diff-in-Diff estimator, since we only have aggregated data. In our case, we only have access to the average birth rate at the city level, not individual level. Therefore, if we estimate the causal effect by Diff-in-Diff, we won't know the variance of it, and consequently we might not be able to say how robust our estimate is. 

2) Synthetic Control (SC)

To address the problem mentioned above, we leverage another powerful causal inference tool - synthetic control (ref). This method uses multiple cities to create a synthetic city that closely follows the trend of the city of interest - Incheon City. 

 

To answer the question of whether the childbirth subsidy policy had an effect on boosting birth rate, we use the pre-intervention period to build a synthetic control. We combine the other cities to build a fake city that resembles very closely the trend of Incheon City. Then, we can see how this synthetic control behaves after the intervention. A combination of units in the donor pool may approximate the characteristics of the treated unit much better than any untreated unit alone. The difference between synthetic control and Incheon City is the treatment effect. Treatment effect is defined for each period, which means it can change in time. It doesn't need to be instantaneous. 

 

With synthetic control, we can do the inference. Using Fisher's exact tests (ref), we make placebo effects, which we would observe without a treatment, and see if the treatment effect of Incheon City is statistically significant comparing these placebos. 

3) Synthetic-Diff-in-Diff (SDID)

Synthetic-Diff-in-Diff (SDID) (ref) draws inspiration from both Diff-in-diff and Synthetic Control, which brings advantages from both models. Like SC, SDID still works with multiple periods when pre-treatment trends are not parallel. From DID, SDID leverages time and unit fixed effects, which helps to explain a lot of the variance in the outcome, which in turn reduces the variance of the SDID estimator. On top of that, Synthetic-Diff-in-Diff introduces several new ideas - L2 penalty in the optimization of the unit weights which makes them more spread out across control units, and the time weights, which are not present in either DID nor SC. 

We can construct a confidence interval by placebo variance estimation. We run a series of placebo tests, where we pretend a unit from the control pool is treated, when it actually isn't. Then we use SDID to estimate the ATT of this placebo test. After doing this step multiple times, based on placebo variance of the SDID effect estimate, we can do inference. 


3. Data

We want to estimate the effect of childbirth subsidy policy on birth rate in 2023 (####~####), Incheon City. To assess its effect, we gather data on birthrate from multiple cities and across a number of years. In our case, we got data from the year #### to #### from ## cities. 

 

* Exploring - characteristics of Incheon / other cities 봐야 해

 

* Plot - 연도별 출산률 인천 vs other cities

By examining the plot, we can guess that the childbirth subsidy policy is effective. It looks like after the intervention, the birth rate in Incheon City is growing compared to other cities. -------> 이렇게 말하고 싶어. 근데 데이터를 봐야해. 데이터 보기도 전에 이렇게 사심 가득한 기대를 하면 안되는데, 사실 이걸 원해 ㅋㅋ 

 

그리고 문제는, 다른 city들도 어느 정도 출산 지원금 정책을 실행하고 있는데, control group을 어떻게 만드냐가 관건이다. 흠.

 

가능할까? 

 

안되면.. data를 바꿔야 돼..  

 


4. Analysis

1) Empirical Specifications

 

2) Graphical results

 

3) Results

 

4) Robustness check

* placebo test 

- We can do the inference by permuting units, pretending control units to be treated, which is refered to as a placebo test, where we check the effect of units that haven't gone through the treatment. If the estimated effect in the treated unit is bigger than most of the placebo effects, we say that this effect estimate is significant.

 

* conformal inference

- Another method for inference is to recast the problem of effect estimation as counterfactual prediction. If we do that, we can leverage the conformal prediction for inference (ref).

 

The key idea is to generate data following the null hypothesis we want to test and check the residuals of a model for counterfactual Y(0) in this generated data. If the residuals are too extreme, we say that the data is unlikely to have come from the null hypothesis we've postulated.

 

 The first step is to generate data under the null hypothesis. This is achieved by subtracting the postulated null from the outcome of the treated unit. The next step is to fit a model for the counterfactual Y(0) in the entire data, pre and post-treatment period. With this model, we then compute residuals for all time periods t. The idea is to see if that residual, in the post intervention period, is too high. It it is, the data is unlikely to have come from this null, where the effect is zero.

 

We define Test Statistic which summarizes how big are the residuals and hence, how unlikely is the data we saw, under the null. This statistic is computed using only the post-intervention period. High values of this test statistic indicate poor post intervention fit and, hence rejection of the null. In order to define how high are the post intervention residuals and test statistics, we calculate p-value in comparison to the pre-intervention residuals. 

 

We block-permute the residuals, calculating the test statistic in each permutation. Using these T test statistics, one for each of the block permutations, we are finding the proportion of times that the unpermuted test statistic is more extreme than the test statistics obtained by all possible block permutations. We can build confidence interval based on the series of null hypothesis and corresponding p-values. 

Since, the effect of intervention is heterogeneous across the time, we construct the confidence interval for effect of each post treatment period individually via above conformal inference procedure. 

 

나중에 GPT 돌려보고, 교정도 받아야겠지만, 이게 뜻이 제대로 전달이 되는건지, 내가 문장을 제대로 쓴 건지 모르겠다. 이거 누구 읽어봐주세요 부탁하고 싶다. 이해가 되는지.. ;; 


5. Conclusion

1) limitations & discussion

 

2) future works

 


6. Reference