Results and Discussion

Hypothesis Testing

Generally, to test the hypotheses, the following steps are necessary:

Compare mean/median scores
Check test assumptions
Conduct hypothesis test
Calculate effect size

In the following, all results are reported apart from the test assumptions (you can read about this in the thesis).

Hypothesis 1

Measure	IR	NIR
Mean	0.2612	0.2775
SD	0.1249	0.1788

Due to the contrary tendency as indicated by the mean scores, hypothesis H1 is rejected. Instead, the alternative hypothesis—H1b NIRs are more positive than IRs—is tested.

Measure	Value
p	0.0103
d	0.047

With p < 0.05, the difference between the groups is statistically significant (H1b is confirmed). Still, with a very small effect size of 0.047, the difference is considered to have no practical relevance.

Hypothesis 2

Measure	IR	NIR
Mean	5.2486	5.2886
SD	0.2718	0.4158

Due to the contrary tendency as indicated by the mean scores, hypothesis H2 is rejected. Instead, the alternative hypothesis—H2b NIRs are more complex than IRs—is tested.

Measure	Value
p	0.0233
d	0.04

With p < 0.05, the difference between the groups is statistically significant (H2b is confirmed). Still, with a very small effect size of 0.04, the difference is considered to have no practical relevance.

Hypothesis 3

Measure	IR	NIR
Mean	475.8192	327.6587
SD	311.3817	280.9906

The mean tendency supports the initial hypothesis.

Measure	Value
p	0.00
d	0.64

With p < 0.001¹, the difference between the groups is statistically highly significant (H3 is confirmed). Even more, with an effect size of 0.64, the difference is considered to have considerable practical relevance.

Hypothesis 4

Measure	IR	NIR
Median	5	4

In this case, the median scores are compared because there are ordinal variables. Due to the contrary tendency, hypothesis H4 is rejected. Instead, the alternative hypothesis—H4b NIRs are less extreme than IRs—is tested.

Measure	Value
p	0.0001

With p < 0.001, the difference between the groups is statistically highly significant. Therefore, the relative frequencies of the 1- and 5-star category is compared:

Review Type	⭐	⭐⭐⭐⭐⭐
IR	0.965	52.648
NIR	1.375	47.989

The relative frequencies show ambivalent results: While there are indeed less 1-star reviews, there are more 5-star reviews in the IR sample. Thus, hypothesis H4b is rejected.

Hypothesis 5

Measure	IR	NIR
Mean	91.7514	91.3303
SD	2.0898	3.8464

The mean tendency supports the initial hypothesis.

Measure	Value
p	0.498

With p > 0.05, the difference between the groups is not statistically significant. Thus, H5 is rejected.

Summary

Hypothesis	Status	Relevant?
H1: IRs are more positive than NIRs	Rejected	-
H1b: NIRs are more positive than IRs	Confirmed	No
H2: IRs are more complex than NIRs	Rejected	-
H2b: NIRs are more complex than IRs	Confirmed	No
H3: IRs are more elaborate than NIRs	Confirmed	Yes
H4: IRs are less extreme than NIRs	Rejected	-
H4b: NIRs are less extreme than IRs	Rejected	-
H5: IRs are more objective than NIRs	Rejected	-

Note: “Status” reflects either the outcome of a) the comparison of mean / median scores; or b) the results of the hypotheses tests. The column “Relevant?” refers to the effect size (if computed) and whether the significant difference is considered to be relevant.

Only one of the five initial hypotheses could be confirmed: H3. The other four hypotheses were rejected. The two alternative hypotheses H1b and H2b were found to be statistically significant. At the same time, effect size indicates that the difference has no practical relevance. In case of H4b, a significant association was found, but with respect to the relative frequencies of the two respective star rating categories, the hypothesis was rejected.

Finally, the following table summarizes the hypotheses concepts and whether a relevant difference was found:

Hypothesis Concept	Difference?
Positivity	❌
Complexity	❌
Elaborateness	✅
Extremeness	❌
Objectivity	❌

This is the basis for the following discussion.

Discussion

Hypothesis 1
Book reviewers do not feel to obliged to give anything more in return for the free copy than their opinion. Even more, positive publicity might be more relevant for other product types than books; in the latter case, even negative publicity might be valuable. This could explain why IRs and NIRs do not differ with respect to positivity.

Hypotheses 2 & 3
Book reviewers might not be aware of the possible danger of adding a disclosure statement to their reviews; thus, the underlying assumption of „self-fulfilling prophecy“ can be rejected. At the same time, an explanation for the confirmation of H3 could be the aforementioned norm of reciprocity.

Hypothesis 4
Reviewer motivations, while certainly different for incentivized and non-incentivized reviewers, are not reflected by extreme star ratings.

Hypothesis 5
It can be assumed that book reviewers develop an uniform writing style in reviews. Therefore, it does not make a difference whether the review is published shortly after the product experience or not.

Conclusion

❓ Do incentivized book reviews show signs of influence if the reviewer received a free book copy?

❗ Incentivization indeed impacts the contents of book reviews, but the only form of impact that has been found is an influence on review elaborateness (in terms of review length). At the same time, book reviews do not differ with respect to positivity, complexity, extremeness, and objectivity.

However, the phenomenon of „influence“ needs further investigation because there might be more dimensions than just the five considered in this thesis. Also, a conclusion such as „longer reviews are influenced“ is abridged.

Limitations
The findings are only valid with respect to this study’s product type, genre, language, reviewing platform, and temporal limitation.

Further Research Perspectives

different formalisation and operationalisation of the concepts
repeat analysis with sentence-based data
try to avoid misclassification of NIRs
use different NIR sample
avoid biases by analyzing intra-reviewer or intra-book differences
analyze a different genre
derive hypotheses from book market-specifics etc.

Note that the p-value is not exactly zero. This stems from rounding the values. ↩