Friday, June 28, 2019

IEA IRC 2019: Day 3

The final day of the IEA IRC 2019 started with Anna Rosling Rönnlund's talk "For a fact-based worldview". Of course, I had been looking forward to this, her book "Factfulness" (with Hans Rosling and Ola Rosling) is a wonderful eye-opener. She started by telling how the three of them have spent many years trying to make numbers more understandable and easily available. Then she went on with our results on the "test" (from Factculness). Actually, 43 % of this conference got it better than chimpanzees on these questions, while in representative surveys, 10 % do better than chimps. In average, we got 4.2 points out of 12, which is 0.2 points better than an average chimp. (In representative surveys, the average is 2. Personally, I think having read Factfulness helps a lot...)

People tend to answer systematically wrong - they think the situation is worse than it is. (Moreover, if I remenber the book correctly, people tend to answer "correctly" based on the situation fifty years ago. People should not learn about the world by heart and then think that they don't have to update their worldview.) Looking out of the window, we don't see the slow global trends. In newspapers, we see diseases and catastrophes, the extraordinary events.

She spent some time on Dollar Street - where differences in living standard are illustrated by lots of photos from 350 homes around the world (so far). She illustrated how families in different countries are amazingly similar when they live on the same income level, while diversity in each country is amazing. For instance, as the income doubles from 2 to 4 dollars a day, people tend to prioritize the same regardless of continent: stove instead of fire, toilet instead of no toilet, mattress instead of sleeping on the floor and so on.

Towards the end, she gave a few of the rules of thumb of critical thinking that is also in the book. I highly recommend reading the book...

Finally, she showed some animations using TIMSS data. They were interesting, but as hung up we are on confidence intervals, it would be good to have a way of marking that to avoid making a point of differences that are not actually differences or trends that are not trends. But she stressed that the tool should be seen as a hypothesis-generating tool, to notice things that seem interesting. Of course, we need to include the statistical models to investigate the hypotheses.

After this, the penultimate session I went to was "TIMSS, PIRLS and ICILS: Utilizing in-depth analysis of large-scale assessment data to improve teaching". The first talk here was Franck Salles: "Clarifying TIMSS Advanced mathematics 2015 results: A didactical approach through levels of mathematics knowledge operation". The ministry of education in France does detailed analysis of task performance to inform inspectors and teacher educators. In TIMSS Advanced, there was a 1 SD drop in French students' performance from 1995 to 2015. There had been large maths curriculum changes in that period. Of the three cognitive domains, France does relatively badly in "applying". He showed how two different tasks concerning "applying" was very different, illustrating the need of a better math task analysis model. One part of the model he proposed is tasks assessing mathematical knowledge as an object or assessing it as a tool. As an object, there are items asking for a computation or asking to show the understanding of a concept. As a tool, there are items asking for a direct application of knowledge, there are items asking for application with adaptation, and there are items asking for applivation "with intermediary" - students have to add something not in the task already. This classification wass done by a national expert panel. He showed how the classification of a task of course has to depend on the prior knowledge of students. The classification then had to be done based on the national curriculum, making it a national classification which would have been different in other countries.

Secondly in that session, I heard Jeppe Bundsgaard talk on "Differential item functioning as a pedagogical tool". He used ICILS 2013 (where Denmark didn't quite meet the sampling criteria), and the Rasch model. Differential Item Functioning (DIF) refers, of course, to the phenomenon that an item works differently for two different groups of students. Bundsgaard wanted to see if grouping items with DIF can identify challenging areas in the study. He studied DIFs for the countries together and the DIFs of Norway, Denmark and Germany separately. The short conclusion is that Norwegian and Danish students are better at items related to computer literacy, but worse at items related to information literacy.

Thirdly, Olesya Gladushyna and Rolf Strietholt talked on "Nerds or polymaths? Performance profiles at the end of primary education". Latent profile analysis (LPA) was used to try to see whether students (of 4th grade) have qualitatively distinct profiles (even though not differing quantitatively) (I'm not sure whether what I'm writing her makes sense.) They did find different models with different profiles, for instance the three-profile model included one profile where students are better in math than in reading and science, one profile where students are worse in math than in reading and science, and one profile where students are equally good in all three. (The designation of those who are better in math as "nerds" is troublesome, as another result was that children with a home language other than the language of test tend to belong to that group, which does probably rather mean that they are not so good in reading and science, not that they are outstanding in math.)

Finally in this session, Nani Teig and others had the title "I know I can, but do I have the time? The role of teachers' self-efficacy and perceived time constraints in implementing cognitive-activation strategies". She used a framework for instructional quality with three basic dimensions: classroom management, supportive climate and cognitive activation. The focus here is cognitive activation strategies (CAS). This can be divided into general CAS ("asking challenging questions") and CAS specific to science - inquiry-based CAS; learn about how to do science. We know that teachers have a lack of confidence in enacting CAS, and that CAS can be demanding and time-consuming. This study focused on the interplay between teacher self-efficacy and teacher perceived time constraints, using Norwegian TIMSS 2015 data. They found that general and inquiry-based CAS are distinct but correlated constructs. They found significant correllation between self-efficacy and both kinds of CAS, but only significant correlation between teacher perceived time constraints and inquiry-based CAS (which makes sense, I guess), but this is significant only on grade 9 when analysed for each grade (at that point, of course, the number of students got smaller).

The final session (the closing ceremony excluded) was on "Socioeconomic background and student achievement: TIMSS and PIRLS". Here, there were three talks, the first of which was Rune Muller Kristensen: "Deconstruction of the negative social heritage? A search for variables confounding the simple relation between socioeconomic status and student achievement". They used Danish TIMSS data from 2015. ESCS (Economic, Social and Cultural Status) cancelled out other effects in that study, including school and class size. The point of this project was to understand these relationships better. However, no matter how many relevant variables were thrown into the model, not much of the variation between ESCS and performance was explained. (The discussant at the end asked whether the variations of the ESCS and the potential confounding variables are too small in Denmark, and whether different results could be found in other countries with more variance.)

Secondly, Andrés Strello and others had the title "Effects of early tracking on performance and inequalities in achievement: Combined evidence from PIRLS, TIMSS and PISA". They studied all available cycles of the three studies and 75 countries, sorted according to when (or if) they started tracking, looking at dispersion, social inequality and performance level. They did 45 pairs of comparisons, and then used a meta-analytical approach (Card 2011). The meta-analysis showed that tracking has a significant effect on inequality as dispersion, on social inequality and on performance level. (It seemed, however, that tracking has a negative effect on reading performance as measured in PISA.) Also, the earlier tracking takes place, the larger the effects. (In the discussion afterwards, and in the presentation itself, it was pointed out that tracking is a complex phenomenon with different implementations between and even within countries. Still, that makes it perhaps more surprising that significant findings were found.)

Finally, the last talk of the conference was Vasilik Pitsia and others: "High achievement in mathematics and science: A multilevel analysis of TIMSS 2015 data for Ireland" (using 4th grade data). They divided the students into high level achievers (TIMSS Advanced International Benchmark) and non-high achievers. As usual, confidence correlates highly with being high-achievers (probably meaning partly that high-achievers notice that they are high-achievers). Also, home resources are important. However, the chance of being high-achiever decreases when pupils think they get engaging teaching. (This was the results for mathematics, I didn't note down the results for science.) (Of course, it is tempting to find an explanation for the last result. May high-achievers be less easily engaged, because level of mathematics in the teaching is too low?)

Then, there was only the closing ceremony left. We were invited to join the IEA IRC in United Arab Emirates in 2021. For me, there are many reasons to avoid a conference in UAE. Of course it is difficult to get to the UAE from Northern Europe in a climate-friendly way and it is unpleasant to have a conference in high temperatures. A lot more importantly, at least for me, is the human rights situation, where for instance gays are arrested and in theory get a death sentence. There are examples of gay men being raped in UAE only to be investigated for illegal gay sex. So I will leave the 2021 IRC for more thick-skinned people than myself, and instead aim for the 2023 IRC.


(I have nothing personally against the UAE representative advertising for the beauty of UAE and the happiness of its people, but it would have been fair to mention that certain subgroups of the population are not happy at all.)

IEA IRC 2019: Day 2

The morning plenary at day 2 of the IEA IRC was Aaron Benavot's talk "How can IEA make a difference in measuring and monitoring learning in the 2030 agenda for sustainable development?" He is a former director of UNESCO's Global Education Monitoring (GEM) Report. GEM Reports, published yearly, previously monitored progress on the 6 Education for All Goals, now they monitor educational targets in the 2030 Agenda for Sustainable Development. 

He discussed the history behind the merging of several processes into the Sustainable Development Framework with 17 goals, 169 targets, 230 indicators. He stressed how this is the most aspirational and comprehensive international education agenda ever. This comprehensive agenda reinvigorate earlier debates on how to measure and monitor learning. The countires are supposed to have voluntary national reviews, and there will be an elaborate indicator framework with different indicators and measures - at least one global indicator per target, a number of thematic indicators (globally comparable indicators), regional indicators and national indicators. For instance, the target 4.1 talkes about relevant and effective learning outcomes, while the global indicator narrows this down to reading and mathematics. However, the measuring on the global indicators is to be done in close cooperation with each state.

He stressed how different ways of measuring gives very different results. For instance, the traditional way of measuring literacy is by census data, where often the leader of the household is asked who in the household is literate. Now, a few countries are moving towards testing - for instance asking people to read a sentence for the census taker. This reduces the literacy estimates. He gave examples of how IEA might help in developing ways of measuring.

He also discussed how the international assessments are increasingly supplemented by regional and national assessments; more than 150 countries have performed national assessments since 2000.

There was a discussion after the talk about the country-led nature of the reviews and measuring. A researcher from South Africa stressed the importance of each country being able to determine what are the education priorities in their context. If South Africa cannot itself decide but has to adopt measures from western "North" countries, that would not be suited to the local context.

Again, I decided to skip the panel (which was on PIRLS and not part of my research interest presently).

After lunch, there was a Norwegian symposium on the TIMSS. There were three presentations from the University of Oslo research team (CEMO: Centre for Educational Measurement). The first was Rolf Vegar Olsen and Sigrid Blömeke: "Predicting change in mathematics achievement in Norway over time". He started out by pointing out the dramatic fall in Norwegian TIMSS results from 1995 to 2003 (comparable to twice the difference between 8th and 9th grade in 2015), followed by an increase from 2015.
The method used for this paper was the Oaxaca-Blinder Decomposition, a method for studying the mean differences between two groups - basically looking at both the constant and the slope of the regression lines for the two groups. (Actually, they used a threefold OBD, with "endowments", "coefficients" and "interaction" terms.) They wanted to include predictors which had changed in the period used. However, fairly little of the change in score could be explained by the included predictors. (A lot of possible predictors had to be excluded because the questions were different in 2003 and 2015.)

The second talk was Trude Nilsen, Julius Björnsson and Rolf Vegar Olsen: "Has equity changed in Norway over the last decades?" First, they discussed the definition of equity: it could be defined as lack of achievement differences between schools, a small SES effect on achievement or as a low proportion of pupils getting low scores. They looked at all cycles of TIMSS and PISA. While many measures had changed over time, "Number of books at home" had kept stable in both TIMSS and PISA. The findings was that the total variance had decreased over time (which may, however, be because the proportion of high performing students have decreased), while on the school level there has been different developments in the different studies. The variance explained by SES has increased over time. The main problem however is the lack of stability in the SES measures. A solution could be to combine ILSA (international large-scale assessments) with register data (but that could be controversial for privacy reasons).

The third talk of this symposium was Hege Kaarstein and Trude Nilsen: "Twenty years of science motivation mirrored through TIMSS: Examples of Norway". Their goal was to look at the development of science motivation. Methodically, every TIMSS study was compared to the 1995, in addition comparisons between 4th and 8th grade and between girls and boys in all cycles, was planned. They studied intrinsic motivation, self-concept and extrinsic motivation (the third one not measured in 4th grade).  It is an important point that it must be checked (within the means available) that questions are understood similarly over time, but the details of the scalar measurement invariance (MI) I am not able to repeat. The results of the study were mixed, but the motivation seem to have increased. Self-concept has the highest correlation with performance, but the self-concept did not increase significantly in 8th grade. (However, Norwegian students already reported very high self-concept from the beginning.)

Jan-Eric Gustafsson was the discussant, who picked up on the difficulties of looking at change over time, and asked how the ILSAs could be improved to make it easier to study change over time. He also pointed out that many of the independent variables used here are prone to large errors in measurement (as they are self-reported by students), which can lead to regression coeffisients being underestimated. He also pointed out that the PISA scales vary in reliability from year to year, while TIMSS scales have higher reliability. He also noted that "number of books at home" is shown to be working differently in different countries, so it may also be assumed to be working differently over time in one country. (It was actually pointed out in the plenary on Friday that the proportion of pupils with many books in the home, is decreasing in rich countries.) Also, he criticised cutting out lots of the comparisons based on the MIs, as the MI analyses has so high power that they detect substantially insignificant differences. (A very interesting point although he himself admitted that including these comparisons may make the paper impossible to publish.)

He also provided a fun example of problems of measuring: when a new grade scale was introduced in Sweden, confusion followed as teachers did not use the new scale consistently. This lead to increased variance in the grades (and less correlation with the underlying competence of pupils, I guess), leading to a decrease in the SES effect. (Of course, if grades are more randomly assigned, all correlations between grades and other variables will decrease.)

Then, there was the last session of the second day. The first talk was Samo Varsik: "Differences in students' and teachers' characteristics between high and low performing classes in Slovakia". He used PIRLS 2016 4th grade data from the Slovak republic. The methodological approach is based on similar research done in Czech Republic. He first showed how SES has a huge impact in Slovakia. But he also looked for differences in teaching methods between high- and low-performing classes, but found very few significant correllations. The only two significant differences were connected to high-performing classes being tested more often and being more often asked to summarize the main ideas. The second part of his work was regression models, showing for instance that students' confidence n reading is, not so surprisingly, correllated with performance, also when controlling for SES, gender and so on. However, he did not find significant results regarding teachers' characteristics. (Other than this, I did not manage to write down so much of his results.) At the end, he noted that an important limitation of his method is the "Modern Teaching Methods" variable, based on a few self-reported questions.

The second paper of the session was supposed to be Bieke De Fraine et al: "Reading comprehension growth from PIRLS Grade 4 to Grade 6", but this was cancelled.

The third paper was Marie Wiberg and Ewa Rolfsman: "Nordic students' achievement and school effectiveness in TIMSS 2015". (Nordic = Sweden & Norway). They looked at student variables (sex, native father (NF) and number of books) and school variables (student behaviour, urban school location, school climate (teacher, student, parents), aggregated SES, aggregated NF, general resources at school and resources in mathematics) and used linear regression. They included the concept of "effective schools" based on them having better results than expected based on background data. Students's background was important everywhere. In Norway, school location and school climate was significant, while in Sweden only school climate was significant. For future research, the possibility to connect to register data will make possible other analyses.

The final paper of the day was André Rognes' talk on "Birth month and mathematics performance relationships in Norway", written with Annette Hessen Bjerke, Elisabeta Eriksen and myself. Of course, I knew the paper quite well in advance: the main point is that the Relative Age Effect (RAE) is statistically significant in all content and cognitive domains of mathematics in 4th, 5th and 8th grade. There were no statistically significant RAE in 9th grade. We also tested whether there was a significant difference in RAE between 4th and 5th grade and between 8th and 9th grade - there was not.

We did get a question about whether we had looked at SES in our research. We had not. It is unlikely that birth month can be predicted by SES (and a colleague actually pointed out that he had checked it. Whether the RAE is larger in some SES groups than others, is another question that it would be interesting to investigate. (Although I fear the Norwegian data alone would not provide enough power to find out. Even with more than 4000 students in each cohort, the number of students per month gets quite small if dividing into different SES groups.)

That was the end of the second day. For me, this day was more aligned with my research interests than the first, so I was happy about it.

IEA IRC 2019: Day 1

My first IEA IRC (the 8th IEA International Research Conference) took place in Aarhus University in June 2019. As this was my second conference in this venue, I was not surprised to find that the conference was actually in Copenhagen... However, unlike the ESU five years ago, this conference started with a song (allsang): "Svantes lykkelige dag" and "I Danmark er jeg født".

The first keynote was Christian Christrup Kjeldsen: "Global attitudes and perceptions of social justice among youth: When no (in)differences make the difference". He reminded us of the concept of fuzzy set, where elements can be a member of a set to different degrees. Becoming a subject is part of life, and (I suppose) people cannot always be put 100% into the fixed boxes. (He argued based on his reading of Bourdieu, but of course I'm not able to summarize that.) Part of his talk was on what is significant: the differences between (the continuum of) statistical significance, (the continuum of) substantial issues in a moral philosophical approach and (the continuum of) effect sizes. He argued for the concept of "substantian significance": differences in capabilities supporting a life that the individual has reason to value. When talking about effect size, he connected this to Hattie, who he claimed could serve as an inspiration. At the end, he talked about a case in which he merged results from different studies in a fuzzy way while trying to keep enough noise to not understate the variance. Again, hard to summarize.

I think there was food for thought there. Take gender as an example. Of course, we are well aware that gender is more complex than the oldfashioned binary "man"/"woman" concept. However, there are important differences between "men" and "women" in most fields of research, when treated as a binary concept. So which underlying concepts can be found that can explain the differences, without having to keep using a binary concept that we know is too simplistic?

As happens at conferences, I had to spend the next slot doing some last-week edits to our presentation with my colleague. For the after-lunch slot I chose to take part in the "Open source publishing with IEA" panel. Of course, if we are to do more analyses of international studies, we need to know as much as possible about the publishing possibilities.

The journal "Large-scale Assessments in Education" has had its fifth anniersary, and is now a Springer open source journal, giving it more visibility. Also, there is the IEA Research for Education Book Series,   often 80-150 pages long. Calls for proposals are published biannually. (The authors actually get 25 000 euro for each book.) Unsolicited applications are also considered. Only IEA studies can be used for the book series, while the journal is more forgiving. The full process from accepted proposal to finished printed book is usually about two years.

Finally, Seamus Hegarty talked about the review process for the Book Series: There is a pre-review, then review of each chapter (based on an annotated ToC, which is mandatory for proposals). The review is not double-blind - only the reviewers are anonymous.

He gave some examples of some usual editorial suggestions:
  • Do provide an argument about the significance of your work
  • Contextualise your work
  • Detail your methodology
  • Be rigorous and coherent (especially difficult to obtain coherence when different teams of authors write different parts of the book
  • Write clearly
  • Organise your own review! It is useful to use colleagues to do a "review" before the real review.

Then, for the final session of the day I decided to attend the session on "TIMSS and ICCS: Students' attitudes and achievement in TIMSS, TIMSS Advanced mathematics, and ICCS". The first paper was by Laura Palmerio and Elisa Caponera: "Relationship between students' attitudes and beliefs, and achievement in advanced mathematics". The TIMSS Advanced questionnaires and tests were supplemented by a national questionnaire given to the same students, on self-efficacy and anxiety. They found that self-efficacy is highly correlated with mathematics performance, not surprisingly. This could be a sign that we should work on students' self-efficacy.

A side note: they showed that "self-efficacy was the best predictor of mathematics performance" (according to the abstract). I think this is a good example of how the language of "predictor" can be problematic, as the relationship between self-efficacy and performance is of course going in two directions - performance leads to better self-efficacy and self-efficacy leads to better performance. (In the presentation it was very clear that self-efficacy and performance are part of a circular relationship which also includes behaviour and anxiety.)

The next talk was Michaelides and others: "Meaningful clusters of eight grade students in 2015 TIMSS mathematics using motivation variables". They focused on confidence, enjoyment and value (all three scales administered in 8th grade, the two first also in 4th grade), to look at what the interactions between them are. For instance, some students report that they value math but do not enjoy it. They did this across 12 jurisdictions, in TIMSS cycles 1995-2007. The analysis was based on a two-step clustering approach. These clusters were developed per country, and then the clusters' participants' achievement and gender composition was explored. In inconsistent clusters, value did not play much of a role for achievement - self-confidence and enjoyment was more important.

The third talk in this session was Dupont et al: "The role of parents' literacy attitudes on children's reading achievement (PIRLS 2016)". They had different hypotheses on the connection between parents' reading attitudes and the outcomes (students' reading motivation and students' reading achievement). Regression analyses were done, controlling for home resources for learning. They found high correlation between parents' attitudes and students' reading achievement. (Some of the diagrams here could be useful in my teaching on quantitative methods in our master courses.) The study underline the importance of parents' literacy practices on students' reading achievement and attitudes.

The final talk of the day was Kwong and Macaskill: "The relationship between student engagement and achievement across countries within regions using latent class analysis". They looked at Asia, Europe and Latin America as three regions. First used Exploratory Factor Analysis to explore relations among the attitude indices. Thereafter LPA was used - a two level LPA model used for the Asia region.  (Obviously, I can not summarize all the tables showing the results of these analyses.) Through lots of diagrams, we were shown how the three regions had different profiles, although for instance Taiwan seemed to stand out a bit from the other Asian regions included. (Sadly, complex diagrams with lots of small type do not work very well when sun is flooding the room, so it was hard to get the details.)

That ended the first day of the conference. It is a different experience than many other conferences, as I usually go to conferences where I can choose talks on topics I am very interested in. Here, I more often find myself listening to talks where the topic in itself is not that relevant to my interests, but where the methodological ideas can very well be useful for me to explore other topics. So it is a different focus.