In my previous post I wrote about how I achived a >40% speedup on sysbench read-only using profile-guided optimisation (PGO). While this is a preliminary result, I though it was so interesting that it deserved early mention. The fact that any benchmark can be improved that much shows clearly that PGO is something worth looking into. Even if we will probably not improve all workloads by 40%, it seems highly likely that we can obtain significant gains also for many real workloads.
I had one or two interesting comments on the post that raise valid concerns, so I wanted to write a follow-up here, explaining some of the points in more details and going deeper into the performance counter measurements. As I wrote before, actual observations and measurements are crucial to fully understand performance of complex code on modern CPUs. Intuition and guesswork just does not suffice.
On branch mispredictions
bludwarf suggested that branch mispredictions could be an important part of the picture:
> It would be interesting to see the hottest spots.... Having to execute just > a lot of code is not an explanation. If it would be so, then profiler > feedback won't affect performance so much. I believe the reson for so big > difference for the results is branch misprediction.
I had similar thoughts when I first saw the results, but I was able to verify
my explanation with actual measurements of instructions retired and icache
misses before and after profile-guided optimisations:
171821 INST_RETIRED.ANY 17730 ICACHE.MISSES 166686 INST_RETIRED.ANY 13643 ICACHE.MISSES
Before PGO, ICACHE.MISSES/INST_RETIRED.ANY = 10.3%. After, we get 8.2%. So we do get significant less icache misses. I wrote in my previous post some plausible reasons for that with my remarks on basic blocks, though I have not tried to verify those with actual measurements.
I also checked branch mispredictions. The Intel manual has two measures to
estimate the cost of branch misprediction. One is time wasted due to
speculatively issued uops that do not retire, %Bad_Speculation =
(UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES )
/ (4*CPU_CLK_UNHALTED.THREAD). The other is cycles wasted based on an
estimated cost of 20 cycles per misprediction, %BR.MISP.COST = 20 *
BR_MISP_RETIRED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD. Here are the
relevant measurements taken before and after (two sets of measurements, as
Sandy Bridge supports only 4 simultaneous performance counters):
486641 CPU_CLK_UNHALTED.THREAD 205880 UOPS_RETIRED.RETIRE_SLOTS 226184 UOPS_ISSUED.ANY 7219 INT_MISC.RECOVERY_CYCLES %Bad_Speculation = 2.5% 335051 CPU_CLK_UNHALTED.THREAD 194616 UOPS_RETIRED.RETIRE_SLOTS 216353 UOPS_ISSUED.ANY 6830 INT_MISC.RECOVERY_CYCLES %Bad_Speculation = 3.7% 490250 CPU_CLK_UNHALTED.THREAD 1310 BR_MISP_RETIRED.ALL_BRANCHES_PS %BR.MISP.COST = 5.3% 338524 CPU_CLK_UNHALTED.THREAD 1147 BR_MISP_RETIRED.ALL_BRANCHES_PS %BR.MISP.COST = 6.8%
Note that the actual branch misprediction is much the same. But with less time spent on other bottlenecks, the estimated relative cost becomes higher.
Most of the branch mispredictions are due to indirect branches (virtual functions and @plt entries in shared libraries). Even without PGO the CPU is able to predict other branches very well, so compiler or __builtin_expect can not help much.
At the end of this post, I have appended the top of some per-function profiles. They show how branch mispredictions are mostly from commonly called library functions like memcpy() (these are mispredictions of the indirect jumps in the PLT), and from virtual function calls. They also show how the icache misses are spread out more or less evenly, correlating strongly with the count of actual instructions executed in various parts of the code.
On better coverage of the PGO
A comment from Anonymous raised another valid concern:
> If I read you correctly, you are using the same benchmark for optimizing as > well as measuring performance increases. You need to have different test and > training sets or your measurements don't generalize.
This is true, of course. The sysbench read-only exercises only an extremely narrow subset of mysqld functionality. The next step, on which I am already working, is to implement a more suitable and general workload for the profile, which can hopefully cover, and hence speedup, a significant portion of common real-life workloads.
More tests will be needed to be sure that real-world gains can be achieved, but I am very hopeful. After all, much of the code of sys-bench read-only is also used in many other workloads, and much of the properties (eg. infrequency of the lots and lots of error check branches) should be similar.
My main concern is that GCC will choose to de-optimise code that is not covered by the test load, to the extent that it significantly harms performance in some cases. But it should not be too much of a problem – after all, GCC will still try to optimise the generated code as best it can, and if the information it has from the profile is incomplete, it should not be much worse than having no information, as is the case without PGO. And over time, we can extend the coverage of the test load to help eliminate any problems that do turn up.
Since the last post, I implemented a first version of a program that generates a test load. For now, I put it on github: gen_profile_load. The program generates a range of different loads, such as simple key lookups, small joins, inserts/updates/deletes, under various conditions such as with/without binlogging, with/without prepared statements, and so on.
I plan to next do some tests with using this for doing the PGO, and checking the effect on sysbench read-only and possibly other benchmarks. Another idea I had was to try to use only parts of the generated load, and test that this does not cause significant regressions on the omitted parts. This could help in getting some idea of what kind of (hopefully small) performance regressions can be expected for workloads that are not well covered by the PGO.
Detailed profiles
These are pasted from the output of perf report
. They show the top functions in terms of instructions executed, icache misses suffered, and branches mispredicted.
Instructions retired (INST_RETIRED.ANY), before PGO:
Samples: 171K of event 'rc0', Event count (approx.): 171821 1,94% mysqld libc-2.13.so [.] __memcpy_ssse3 1,92% mysqld mysqld [.] my_hash_sort_bin 1,60% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1,48% mysqld mysqld [.] malloc 1,36% mysqld mysqld [.] free 1,34% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) 1,31% mysqld mysqld [.] alloc_root 1,30% mysqld mysqld [.] JOIN::optimize_inner() 1,16% mysqld mysqld [.] my_malloc_size_cb_func 1,14% mysqld mysqld [.] ha_innobase::build_template(bool) 1,08% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*) 1,08% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*) 1,07% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 1,01% mysqld libpthread-2.13.so [.] pthread_mutex_lock 1,00% mysqld mysqld [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*) 0,98% mysqld libpthread-2.13.so [.] pthread_getspecific 0,93% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_ 0,90% mysqld mysqld [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long) 0,88% mysqld [vdso] [.] 0x000000000000070c 0,83% mysqld mysqld [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long) 0,79% mysqld libc-2.13.so [.] __memset_sse2 0,78% mysqld libc-2.13.so [.] __strlen_sse42 0,77% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg 0,77% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int) 0,71% mysqld mysqld [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227 0,70% mysqld libc-2.13.so [.] __strcmp_sse42 0,70% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int) 0,70% mysqld [kernel.kallsyms] [k] update_cfs_shares 0,69% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 0,69% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231 0,68% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 0,60% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long) 0,59% mysqld [kernel.kallsyms] [k] effective_load 0,56% mysqld [kernel.kallsyms] [k] select_task_rq_fair 0,56% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,50% mysqld mysqld [.] ha_innobase::external_lock(THD*, int) 0,48% mysqld mysqld [.] my_malloc 0,48% mysqld [kernel.kallsyms] [k] enqueue_task_fair 0,48% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 0,47% mysqld [kernel.kallsyms] [k] update_entity_load_avg 0,46% mysqld mysqld [.] mysql_execute_command(THD*) 0,46% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*) 0,45% mysqld mysqld [.] Item_func::type() const 0,44% mysqld [kernel.kallsyms] [k] update_curr 0,44% mysqld [kernel.kallsyms] [k] copy_user_generic_string 0,44% mysqld [kernel.kallsyms] [k] do_raw_spin_lock 0,43% mysqld mysqld [.] my_lengthsp_8bit 0,43% mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0,43% mysqld [kernel.kallsyms] [k] init_sync_kiocb.constprop.15 0,41% mysqld mysqld [.] get_best_combination(JOIN*) 0,40% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool) 0,36% mysqld mysqld [.] Item_equal::Item_equal(Item*, Item*, bool) 0,36% mysqld mysqld [.] Item::Item() 0,35% mysqld [kernel.kallsyms] [k] atomic64_dec_and_test 0,35% mysqld mysqld [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result) 0,35% mysqld mysqld [.] JOIN::exec_inner() 0,35% mysqld [kernel.kallsyms] [k] account_entity_enqueue 0,35% mysqld [kernel.kallsyms] [k] do_sys_poll 0,35% mysqld [kernel.kallsyms] [k] system_call 0,34% mysqld mysqld [.] JOIN::destroy() 0,34% mysqld mysqld [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)
Icache misses (ICACHE.MISSES), before PGO:
Samples: 17K of event 'r280', Event count (approx.): 17730 2,21% mysqld mysqld [.] JOIN::optimize_inner() 2,04% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) 1,91% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1,42% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*) 1,27% mysqld libc-2.13.so [.] __memcpy_ssse3 1,20% mysqld mysqld [.] malloc 1,09% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_ 1,07% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 0,90% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long) 0,90% mysqld [vdso] [.] 0x0000000000000771 0,86% mysqld libpthread-2.13.so [.] pthread_mutex_lock 0,83% mysqld mysqld [.] _ZN4JOIN7prepareEPPP4ItemP10TABLE_LISTjS1_jP8st_orderbS7_S1_S7_P13st_select_lexP18st_select_lex_unit.part.227 0,73% mysqld mysqld [.] mysql_execute_command(THD*) 0,72% mysqld mysqld [.] alloc_root 0,68% mysqld mysqld [.] JOIN::exec_inner() 0,65% mysqld libc-2.13.so [.] __memset_sse2 0,64% mysqld mysqld [.] JOIN::destroy() 0,63% mysqld mysqld [.] free 0,59% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,55% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.231 0,55% mysqld [kernel.kallsyms] [k] do_raw_spin_lock 0,54% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 0,54% mysqld mysqld [.] ha_innobase::external_lock(THD*, int) 0,52% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int) 0,51% mysqld mysqld [.] my_malloc_size_cb_func 0,51% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int) 0,51% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*) 0,50% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) 0,47% mysqld mysqld [.] setup_conds(THD*, TABLE_LIST*, List<TABLE_LIST>&, Item**) 0,47% mysqld mysqld [.] get_best_combination(JOIN*) 0,47% mysqld mysqld [.] ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type) 0,47% mysqld libc-2.13.so [.] __strcmp_sse42 0,47% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg 0,46% mysqld mysqld [.] trx_commit(trx_t*) 0,44% mysqld [kernel.kallsyms] [k] fget_light 0,43% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool) 0,43% mysqld [kernel.kallsyms] [k] system_call 0,42% mysqld mysqld [.] JOIN::cleanup(bool) 0,41% mysqld mysqld [.] handle_select(THD*, LEX*, select_result*, unsigned long) 0,41% mysqld mysqld [.] mtr_commit(mtr_t*) 0,41% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*) 0,41% mysqld mysqld [.] my_hash_sort_bin 0,40% mysqld mysqld [.] open_table(THD*, TABLE_LIST*, st_mem_root*, Open_table_context*) 0,40% mysqld libpthread-2.13.so [.] pthread_getspecific 0,39% mysqld mysqld [.] buf_page_get_known_nowait(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*) 0,39% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,38% mysqld mysqld [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.226 0,38% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 0,38% mysqld mysqld [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*) 0,37% mysqld mysqld [.] ha_innobase::extra(ha_extra_function) 0,37% mysqld mysqld [.] ha_innobase::build_template(bool) 0,37% mysqld [kernel.kallsyms] [k] enqueue_task_fair 0,36% mysqld mysqld [.] add_key_field(JOIN*, key_field_t**, unsigned int, Item_func*, Field*, bool, Item**, unsigned int, unsigned long long, st_sargable_param**) 0,36% mysqld mysqld [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**) 0,35% mysqld [kernel.kallsyms] [k] update_rq_clock 0,34% mysqld mysqld [.] do_select(JOIN*, List<Item>*, TABLE*, Procedure*) 0,33% mysqld mysqld [.] my_real_read(st_net*, unsigned long*) 0,33% mysqld mysqld [.] Arg_comparator::set_cmp_func(Item_result_field*, Item**, Item**, Item_result) 0,33% mysqld [kernel.kallsyms] [k] __schedule 0,33% mysqld mysqld [.] Item::Item() 0,32% mysqld mysqld [.] thd_ha_data
Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), before PGO:
Samples: 1K of event 'rc5:p', Event count (approx.): 1310 7,02% mysqld libc-2.13.so [.] __memcpy_ssse3 6,64% mysqld libpthread-2.13.so [.] pthread_getspecific 3,74% mysqld libpthread-2.13.so [.] pthread_mutex_unlock 3,66% mysqld mysqld [.] Item_func::type() const 3,05% mysqld libc-2.13.so [.] __memset_sse2 2,67% mysqld mysqld [.] Field_long::type() const 2,60% mysqld libpthread-2.13.so [.] pthread_mutex_lock 2,52% mysqld libc-2.13.so [.] __strcmp_sse42 2,06% mysqld mysqld [.] Item_field::used_tables() const 2,06% mysqld mysqld [.] Item_param::used_tables() const 2,06% mysqld libc-2.13.so [.] __strlen_sse42 1,76% mysqld librt-2.13.so [.] clock_gettime 1,68% mysqld mysqld [.] Item::cmp_type() const 1,60% mysqld mysqld [.] st_select_lex::master_unit() 1,37% mysqld mysqld [.] Item_equal::functype() const 1,30% mysqld mysqld [.] Item::const_item() const 1,30% mysqld mysqld [.] Item::real_item() 1,07% mysqld mysqld [.] Field_str::charset() const 0,92% mysqld mysqld [.] Protocol::end_statement() 0,92% mysqld mysqld [.] Item_field::type() const 0,84% mysqld mysqld [.] mysql_execute_command(THD*) 0,84% mysqld mysqld [.] Item_field::cleanup() 0,84% mysqld mysqld [.] Item_field::field_type() const 0,84% mysqld mysqld [.] ha_innobase::extra(ha_extra_function) 0,84% mysqld libpthread-2.13.so [.] __libc_recv 0,76% mysqld mysqld [.] Item_param::save_in_field(Field*, bool) 0,76% mysqld mysqld [.] my_hash_sort_bin 0,69% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 0,69% mysqld mysqld [.] my_real_read(st_net*, unsigned long*) 0,69% mysqld mysqld [.] scheduler_wait_net_end 0,69% mysqld mysqld [.] Item_param::val_int() 0,69% mysqld mysqld [.] Item::check_cols(unsigned int) 0,69% mysqld mysqld [.] Item_field::result_type() const 0,61% mysqld mysqld [.] Field_long::pack_length() const 0,61% mysqld mysqld [.] vio_read 0,61% mysqld [kernel.kallsyms] [k] system_call 0,53% mysqld mysqld [.] Item::is_expensive() 0,53% mysqld mysqld [.] st_select_lex::outer_select() 0,53% mysqld mysqld [.] do_handle_one_connection(THD*) 0,53% mysqld mysqld [.] Explain_select::~Explain_select() 0,53% mysqld mysqld [.] Field_num::result_type() const 0,53% mysqld mysqld [.] Item_param::result_type() const 0,53% mysqld mysqld [.] Item_param::field_type() const 0,53% mysqld mysqld [.] Item_equal::count_sargable_conds(unsigned char*) 0,53% mysqld mysqld [.] Item_func_eq::functype() const 0,53% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) 0,53% mysqld [kernel.kallsyms] [k] __pollwait 0,53% mysqld [kernel.kallsyms] [k] sock_def_readable 0,46% mysqld mysqld [.] net_before_header_psi(st_net*, void*, unsigned long) 0,46% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 0,46% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int) 0,46% mysqld mysqld [.] handler::index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function) 0,46% mysqld mysqld [.] Item_equal::select_optimize() const 0,46% mysqld mysqld [.] Item_func::used_tables() const 0,46% mysqld mysqld [.] ha_innobase::info(unsigned int) 0,46% mysqld libc-2.13.so [.] __memmove_ssse3 0,46% mysqld libc-2.13.so [.] __memcmp_sse4_1 0,46% mysqld [kernel.kallsyms] [k] dequeue_task_fair 0,46% mysqld [kernel.kallsyms] [k] pick_next_task_fair 0,38% mysqld mysqld [.] Protocol_binary::store(char const*, unsigned long, charset_info_st const*) 0,38% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int)
Instructions retired (INST_RETIRED.ANY), after PGO:
Samples: 166K of event 'rc0', Event count (approx.): 166686 2,55% mysqld libc-2.13.so [.] __memcpy_ssse3 1,79% mysqld mysqld [.] my_hash_sort_bin 1,60% mysqld mysqld [.] alloc_root 1,48% mysqld mysqld [.] malloc 1,42% mysqld mysqld [.] my_malloc_size_cb_func 1,42% mysqld mysqld [.] free 1,29% mysqld libc-2.13.so [.] __memset_sse2 1,23% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 1,20% mysqld mysqld [.] ha_innobase::build_template(bool) 1,16% mysqld libpthread-2.13.so [.] pthread_mutex_lock 1,15% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1,15% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*) 1,13% mysqld mysqld [.] JOIN::optimize_inner() 1,12% mysqld mysqld [.] build_template_field(row_prebuilt_t*, dict_index_t*, dict_index_t*, TABLE*, Field const*, unsigned long) 1,07% mysqld [vdso] [.] 0x00000000000006a1 1,03% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) 1,00% mysqld libpthread-2.13.so [.] pthread_getspecific 0,99% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int) 0,94% mysqld libc-2.13.so [.] __strlen_sse42 0,83% mysqld mysqld [.] cmp_dtuple_rec_with_match_low(dtuple_t const*, unsigned char const*, unsigned long const*, unsigned long, unsigned long*, unsigned long*) 0,80% mysqld libc-2.13.so [.] __strcmp_sse42 0,78% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg 0,78% mysqld mysqld [.] Protocol::net_store_data(unsigned char const*, unsigned long, charset_info_st const*, charset_info_st const*) 0,75% mysqld mysqld [.] rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned long*, unsigned long, mem_block_info_t**, char const*, unsigned long) 0,69% mysqld mysqld [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*) 0,68% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183 0,68% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,63% mysqld [kernel.kallsyms] [k] effective_load 0,63% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_ 0,62% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long) 0,59% mysqld [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0,57% mysqld [kernel.kallsyms] [k] select_task_rq_fair 0,56% mysqld [kernel.kallsyms] [k] update_cfs_shares 0,56% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 0,52% mysqld [kernel.kallsyms] [k] update_curr 0,52% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int) 0,52% mysqld mysqld [.] ha_innobase::external_lock(THD*, int) 0,51% mysqld mysqld [.] my_malloc 0,50% mysqld [kernel.kallsyms] [k] update_entity_load_avg 0,46% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 0,46% mysqld [kernel.kallsyms] [k] copy_user_generic_string 0,45% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*) 0,43% mysqld mysqld [.] get_best_combination(JOIN*) 0,43% mysqld [kernel.kallsyms] [k] do_raw_spin_lock 0,42% mysqld mysqld [.] find_field_in_table(THD*, TABLE*, char const*, unsigned int, bool, unsigned int*) 0,42% mysqld [kernel.kallsyms] [k] enqueue_task_fair 0,41% mysqld [kernel.kallsyms] [k] account_entity_enqueue 0,39% mysqld [kernel.kallsyms] [k] init_sync_kiocb.constprop.15 0,39% mysqld mysqld [.] sort_and_filter_keyuse(THD*, st_dynamic_array*, bool) 0,39% mysqld mysqld [.] mysql_execute_command(THD*) 0,38% mysqld mysqld [.] my_lengthsp_8bit 0,38% mysqld [kernel.kallsyms] [k] system_call 0,38% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 0,37% mysqld mysqld [.] Item::Item() 0,36% mysqld [kernel.kallsyms] [k] atomic64_dec_and_test 0,35% mysqld librt-2.13.so [.] clock_gettime 0,34% mysqld mysqld [.] JOIN::cleanup(bool) 0,34% mysqld mysqld [.] Item_func::type() const 0,33% mysqld mysqld [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long) 0,33% mysqld mysqld [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**) 0,33% mysqld [kernel.kallsyms] [k] __update_tg_runnable_avg.isra.25
Icache misses (ICACHE.MISSES), after PGO:
Samples: 13K of event 'r280', Event count (approx.): 13643 2,07% mysqld libc-2.13.so [.] __memcpy_ssse3 1,39% mysqld mysqld [.] malloc 1,33% mysqld mysqld [.] JOIN::optimize_inner() 1,26% mysqld mysqld [.] JOIN::save_explain_data_intern(Explain_query*, bool, bool, bool, char const*) 1,17% mysqld mysqld [.] make_join_statistics(JOIN*, List<TABLE_LIST>&, Item*, st_dynamic_array*) 1,08% mysqld [vdso] [.] 0x0000000000000864 1,07% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) 1,06% mysqld libc-2.13.so [.] __memset_sse2 1,01% mysqld mysqld [.] free 0,97% mysqld libpthread-2.13.so [.] pthread_mutex_lock 0,92% mysqld [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0,80% mysqld mysqld [.] THD::enter_stage(PSI_stage_info_none const*, PSI_stage_info_none*, char const*, char const*, unsigned int) 0,76% mysqld mysqld [.] alloc_root 0,74% mysqld libpthread-2.13.so [.] __pthread_mutex_unlock_usercnt 0,73% mysqld libc-2.13.so [.] __strcmp_sse42 0,71% mysqld mysqld [.] open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) 0,69% mysqld mysqld [.] my_hash_sort_bin 0,67% mysqld [kernel.kallsyms] [k] fget_light 0,63% mysqld [kernel.kallsyms] [k] do_raw_spin_lock 0,62% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_ 0,59% mysqld mysqld [.] Item::Item() 0,57% mysqld mysqld [.] _ZL19update_ref_and_keysP3THDP16st_dynamic_arrayP13st_join_tablejP4ItemyP13st_select_lexPP17st_sargable_param.isra.183 0,53% mysqld libpthread-2.13.so [.] pthread_getspecific 0,52% mysqld [kernel.kallsyms] [k] update_cfs_shares 0,51% mysqld mysqld [.] ha_innobase::external_lock(THD*, int) 0,51% mysqld mysqld [.] my_malloc_size_cb_func 0,50% mysqld mysqld [.] JOIN::prepare(Item***, TABLE_LIST*, unsigned int, Item*, unsigned int, st_order*, bool, st_order*, Item*, st_order*, st_select_lex*, st_select_lex_unit*) 0,48% mysqld mysqld [.] mysql_execute_command(THD*) 0,47% mysqld mysqld [.] free_root 0,47% mysqld [kernel.kallsyms] [k] idle_cpu 0,47% mysqld [kernel.kallsyms] [k] unix_stream_sendmsg 0,46% mysqld [kernel.kallsyms] [k] system_call 0,44% mysqld [kernel.kallsyms] [k] unix_stream_recvmsg 0,43% mysqld mysqld [.] ha_innobase::info_low(unsigned int, bool) 0,43% mysqld mysqld [.] MDL_context::try_acquire_lock_impl(MDL_request*, MDL_ticket**) 0,43% mysqld librt-2.13.so [.] clock_gettime 0,43% mysqld [kernel.kallsyms] [k] enqueue_task_fair 0,42% mysqld mysqld [.] JOIN::init(THD*, List<Item>&, unsigned long long, select_result*) 0,42% mysqld mysqld [.] Item_field::used_tables() const 0,41% mysqld mysqld [.] create_ref_for_key(JOIN*, st_join_table*, keyuse_t*, bool, unsigned long long) 0,40% mysqld mysqld [.] Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*) 0,40% mysqld [kernel.kallsyms] [k] emulate_vsyscall 0,40% mysqld mysqld [.] dict_index_copy_types(dtuple_t*, dict_index_t const*, unsigned long) 0,40% mysqld [kernel.kallsyms] [k] update_rq_clock 0,39% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 0,38% mysqld mysqld [.] ha_innobase::extra(ha_extra_function) 0,38% mysqld [kernel.kallsyms] [k] update_entity_load_avg 0,37% mysqld mysqld [.] get_best_combination(JOIN*) 0,37% mysqld mysqld [.] Item_field::fix_fields(THD*, Item**) 0,37% mysqld mysqld [.] my_malloc 0,37% mysqld [kernel.kallsyms] [k] __perf_event_task_sched_out 0,37% mysqld [kernel.kallsyms] [k] __schedule 0,36% mysqld mysqld [.] lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int) 0,36% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int) 0,35% mysqld mysqld [.] Protocol::send_result_set_metadata(List<Item>*, unsigned int) 0,35% mysqld mysqld [.] Prepared_statement::execute(String*, bool) 0,34% mysqld mysqld [.] thd_ha_data 0,34% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) 0,34% mysqld [kernel.kallsyms] [k] select_task_rq_fair 0,34% mysqld [kernel.kallsyms] [k] do_sys_poll 0,34% mysqld mysqld [.] st_select_lex::master_unit()
Branch mispredictions (BR_MISP_RETIRED.ALL_BRANCHES_PS), after PGO:
Samples: 1K of event 'rc5:p', Event count (approx.): 1147 8,37% mysqld libc-2.13.so [.] __memcpy_ssse3 7,06% mysqld libpthread-2.13.so [.] pthread_getspecific 3,84% mysqld mysqld [.] Item_func::type() const 2,88% mysqld libc-2.13.so [.] __memset_sse2 2,70% mysqld libpthread-2.13.so [.] pthread_mutex_lock 2,35% mysqld mysqld [.] st_select_lex::master_unit() 2,27% mysqld libpthread-2.13.so [.] pthread_mutex_unlock 2,18% mysqld mysqld [.] Item_param::used_tables() const 2,01% mysqld mysqld [.] Field_long::type() const 2,01% mysqld libc-2.13.so [.] __strlen_sse42 1,92% mysqld libc-2.13.so [.] __strcmp_sse42 1,39% mysqld mysqld [.] Item::const_item() const 1,31% mysqld mysqld [.] ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) 1,22% mysqld mysqld [.] handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function) 1,13% mysqld mysqld [.] dispatch_command(enum_server_command, THD*, char*, unsigned int) 1,13% mysqld mysqld [.] end_send(JOIN*, st_join_table*, bool) 1,05% mysqld mysqld [.] my_real_read(st_net*, unsigned long*) 1,05% mysqld mysqld [.] Item_equal::functype() const 1,05% mysqld mysqld [.] Item_bool_func2::cleanup() 0,96% mysqld mysqld [.] Item::is_expensive() 0,96% mysqld mysqld [.] Item_field::type() const 0,96% mysqld mysqld [.] Item_param::type() const 0,87% mysqld mysqld [.] Item::real_item() 0,78% mysqld mysqld [.] Item_field::used_tables() const 0,78% mysqld mysqld [.] scheduler_wait_net_end 0,78% mysqld librt-2.13.so [.] clock_gettime 0,70% mysqld mysqld [.] Statement::set_statement(Statement*) 0,70% mysqld mysqld [.] _ZL21join_read_const_tableP13st_join_tableP11st_position.isra.180 0,70% mysqld mysqld [.] JOIN::optimize() 0,70% mysqld mysqld [.] Field_num::size_of() const 0,70% mysqld mysqld [.] Field_str::charset() const 0,70% mysqld mysqld [.] Item_field::cleanup() 0,70% mysqld libpthread-2.13.so [.] pthread_rwlock_rdlock 0,61% mysqld mysqld [.] select_result::initialize_tables(JOIN*) 0,61% mysqld mysqld [.] mysql_execute_command(THD*) 0,61% mysqld mysqld [.] mysqld_stmt_execute(THD*, char*, unsigned int) 0,61% mysqld mysqld [.] JOIN::optimize_inner() 0,61% mysqld mysqld [.] Field_str::decimals() const 0,61% mysqld mysqld [.] handler::rebind_psi() 0,61% mysqld mysqld [.] Item_field::field_type() const 0,61% mysqld mysqld [.] Item::cmp_type() const 0,52% mysqld mysqld [.] net_before_header_psi(st_net*, void*, unsigned long) 0,52% mysqld mysqld [.] my_net_read 0,52% mysqld mysqld [.] mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_ 0,52% mysqld mysqld [.] Field::new_key_field(st_mem_root*, TABLE*, unsigned char*, unsigned char*, unsigned int) 0,52% mysqld mysqld [.] Field::send_binary(Protocol*) 0,52% mysqld mysqld [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long) 0,52% mysqld mysqld [.] btr_search_guess_on_hash(dict_index_t*, btr_search_t*, dtuple_t const*, unsigned long, unsigned long, btr_cur_t*, unsigned long, mtr_t*) 0,52% mysqld [vdso] [.] 0x0000000000000630 0,44% mysqld mysqld [.] Protocol::send_eof(unsigned int, unsigned int) 0,44% mysqld mysqld [.] Item::has_subquery() const 0,44% mysqld mysqld [.] Item::mark_as_condition_AND_part(TABLE_LIST*) 0,44% mysqld mysqld [.] select_send::send_eof() 0,44% mysqld mysqld [.] select_send::send_data(List<Item>&) 0,44% mysqld mysqld [.] add_key_fields(JOIN*, key_field_t**, unsigned int*, Item*, unsigned long long, st_sargable_param**) 0,44% mysqld mysqld [.] do_handle_one_connection(THD*) 0,44% mysqld mysqld [.] Field_str::repertoire() const 0,44% mysqld mysqld [.] handler::cancel_pushed_idx_cond() 0,44% mysqld mysqld [.] Item_field::get_item_equal() 0,44% mysqld mysqld [.] Item::check_cols(unsigned int) 0,44% mysqld mysqld [.] Item_field::fix_fields(THD*, Item**)
Hi Kristian,
very interesting results.
Before PGO my_hash_sort_bin was the second just after the hottest spot (__memcpy_ssse3) – 1,92% and 1,94% correspondingly. After the optimization it’s still the second, but the gap is significantly larger – 2,55% and 1,79%. So it seems the function was optimized. Actually, it’s just a simple loop calculating a hash, so it’s assembly code should be very short. Is it possible to have a look at the assembly for the function before and after PGO? Maybe there is some cycle unrolling?.. The loop depends on input data (key) so probably compiler unrolls some iterations?..
Looks like the compiler detected that the code is run sufficiently often that it decided to unroll the loop a couple of times. See dissasembly here: http://knielsen-hq.dk/pgo-disassembly.txt
However, note that this profile is about instructions retired, and my_hash_sort_bin() did not change that much. The reason for the larger gap seems to be that the compiler decided to use __memcpy_ssse3() in more places (instead of inline rep movsb), hence more instructions are executed inside it.
I also noticed that the my_hash_sort_bin() function is written poorly. The compiler is forced to store and re-load the nr1[0] and nr2[0] values in each iteration, which is very costly – but this is necessary, otherwise the semantics would be wrong in case nr1 and nr2 alias with one another (or with key). Of course, they never alias in practise, but the compiler does not know this. I will fix this (__restrict might work, but it is easier to just copy into and out of temporary variables).
And thanks for you interesting observations and comments, keep them coming 🙂
– Kristian.
my results on x86 architecture, mariadb 10.0.7
http://lulu.jong.my/mariadb-pro-bench.png
Nice, thanks for testing 🙂 Pretty good results, those!